Cross-Source Event Linking: Deterministic Fault Binding for Carrier SLAs

Q: How should window TTLs be chosen for mixed IGP and BGP networks?

Set TTLs by protocol convergence. Fast IGP domains where OSPF or IS-IS reconverge in 10 to 30 seconds suit 60 to 90 second windows; BGP cores where withdrawal can exceed 120 seconds need 180 to 300 seconds. A TTL that is too short splits one fault into several tickets; too long inflates member counts and latency.

Cross-source event linking is the deterministic binding layer inside the Fault Correlation & Rule Engines pipeline. Its operational intent is precise: transform asynchronous, heterogeneous telemetry streams into unified incident contexts before downstream routing or ticket generation occurs. This stage deliberately excludes raw ingestion, schema normalization, and final ITSM dispatch, focusing exclusively on temporal alignment, semantic matching, and causal binding across disparate monitoring domains. For NOC engineers and platform teams, the linking layer is the critical decision boundary where raw noise is filtered and actionable fault groups are constructed — directly shaping Mean Time to Acknowledge (MTTA), Mean Time to Resolution (MTTR), and carrier-grade SLA compliance.

Operational Intent and Boundaries

What enters this stage is a stream of already-normalized events. Upstream, the records arriving here have been parsed and typed against the contracts defined in Event Schema Design, so every payload carries a canonical device_fqdn, interface_idx, severity, and epoch timestamp. The linking layer does not re-parse Cisco syslog lines (that belongs to Syslog Format Parsing) and does not decode trap OIDs (that belongs to SNMP Trap Standardization). It assumes clean input and concentrates on one question: which of these independent events describe the same underlying fault?

What exits is a sealed correlation group — a set of member events, a stable correlation_id, and a single resolved severity — handed to severity calibration and routing. What is explicitly excluded is causality ranking across competing root causes (deferred to topology and scoring), historical warehousing, and ticket creation itself. By holding a narrow scope, the linking layer stays auditable: every binding decision can be replayed from the window state and the rule trace, which is what carrier change-control reviews demand. A linking engine that quietly absorbed parsing or routing responsibilities would become impossible to reason about during a 02:00 outage.

Pipeline Architecture and Temporal Alignment

The linking engine executes a stateful correlation pipeline that evaluates incoming events against active correlation windows. When a syslog alert, SNMP trap, or streaming telemetry payload arrives, the engine reads the canonical identifiers — device FQDN, interface index, and routing instance — and maps them onto a shared event bus keyed by network element. Temporal sliding windows, typically 60–300 seconds depending on network scale and protocol convergence characteristics, buffer candidate events for cross-referencing. A declarative rule framework then applies logical join predicates to the buffered payloads. This extends the baseline single-stream pattern matching of the parent pipeline by enforcing multi-source join conditions: a group is only viable when the candidates originate from two or more distinct telemetry sources.

Raw temporal proximity alone produces high false-positive rates in dense carrier networks. To eliminate spurious bindings, the linking layer queries a live topology graph to validate physical and logical adjacency. When an optical transport alarm and an IP/MPLS routing event share a temporal window, the engine traverses the inventory graph to confirm whether the affected optical port terminates on the same chassis as the impacted IP interface. This validation prevents unrelated events from being artificially grouped and ensures correlation groups reflect real network dependency paths. The methodology for integrating real-time graph traversal into correlation logic is formalized in Topology-Aware Correlation, which dictates how adjacency matrices and service dependency trees constrain the linking scope.

Diagram: temporal and topological cross-source linking.

Production-Ready Python Implementation

Production deployments require non-blocking I/O, strict type safety, and deterministic window eviction. The pattern below pairs a Pydantic V2 model — used to enforce the schema contract at the linking boundary, exactly as recommended in Validating NetFlow Events with Pydantic — with an asyncio-driven engine and a deque-backed sliding window. It isolates the rule-evaluation boundary from topology validation and severity scoring.

import asyncio
import time
import logging
from collections import deque
from enum import Enum
from typing import Any, Deque, Dict, List, Optional

from pydantic import BaseModel, ConfigDict, Field, field_validator

logger = logging.getLogger("correlation_engine")


class Severity(str, Enum):
    CRITICAL = "critical"
    MAJOR = "major"
    MINOR = "minor"
    INFO = "info"


class NetworkEvent(BaseModel):
    # Strict mode rejects silent coercion (e.g. "5" -> 5), so a malformed
    # upstream record is dropped at the boundary instead of poisoning a group.
    model_config = ConfigDict(frozen=True, strict=True)

    event_id: str
    source: str                      # "syslog" | "snmp" | "telemetry"
    device_fqdn: str
    interface_idx: Optional[str] = None
    severity: Severity
    timestamp: float                 # wall-clock epoch seconds
    payload: Dict[str, Any] = Field(default_factory=dict)

    @field_validator("device_fqdn")
    @classmethod
    def _normalize_fqdn(cls, v: str) -> str:
        # FQDNs arrive with inconsistent casing across collectors; fold to
        # lower-case so the window key groups the same element reliably.
        return v.strip().lower()


class SlidingWindow:
    def __init__(self, ttl_seconds: float = 120.0) -> None:
        self.ttl = ttl_seconds
        self.buffer: Deque[NetworkEvent] = deque()

    def add(self, event: NetworkEvent) -> None:
        self.buffer.append(event)
        self._evict_expired()

    def _evict_expired(self) -> None:
        # Event timestamps are wall-clock (epoch) seconds, so compare against
        # time.time() — not time.monotonic(), which uses a different reference.
        cutoff = time.time() - self.ttl
        while self.buffer and self.buffer[0].timestamp < cutoff:
            self.buffer.popleft()

    def candidates(self) -> List[NetworkEvent]:
        self._evict_expired()
        return list(self.buffer)


class CrossSourceLinker:
    def __init__(self, window_ttl: float = 120.0) -> None:
        self.windows: Dict[str, SlidingWindow] = {}
        self.window_ttl = window_ttl

    def _window_for(self, key: str) -> SlidingWindow:
        if key not in self.windows:
            self.windows[key] = SlidingWindow(ttl_seconds=self.window_ttl)
        return self.windows[key]

    async def process_event(self, event: NetworkEvent) -> Optional[Dict[str, Any]]:
        # Window is keyed per network element so unrelated devices never share
        # a buffer — the single biggest source of cross-device false positives.
        window = self._window_for(event.device_fqdn)
        window.add(event)

        candidates = window.candidates()
        if len(candidates) < 2:
            return None

        linked = await self._evaluate_rules(candidates)
        if not linked:
            return None

        # topology check runs off the hot path; await keeps the loop responsive
        if not await self._validate_topology(linked):
            logger.info("topology_reject device=%s n=%d", event.device_fqdn, len(linked))
            return None

        correlation_id = f"corr-{event.device_fqdn}-{int(event.timestamp)}"
        return {
            "correlation_id": correlation_id,
            "events": [e.event_id for e in linked],
            "resolved_severity": self._resolve_severity(linked),
            "member_count": len(linked),
            "sealed_at": time.time(),
        }

    async def _evaluate_rules(self, candidates: List[NetworkEvent]) -> Optional[List[NetworkEvent]]:
        # Multi-source join: a viable group needs >= 2 distinct telemetry sources.
        # Production dispatches this to a compiled predicate AST / policy engine.
        sources = {e.source for e in candidates}
        return candidates if len(sources) >= 2 else None

    async def _validate_topology(self, events: List[NetworkEvent]) -> bool:
        # Stubbed adjacency lookup; real deployments query a cached graph
        # (Neo4j / RedisGraph / in-memory matrix) — see Topology-Aware Correlation.
        await asyncio.sleep(0)
        return True

    def _resolve_severity(self, events: List[NetworkEvent]) -> Severity:
        order = (Severity.CRITICAL, Severity.MAJOR, Severity.MINOR, Severity.INFO)
        for sev in order:
            if any(e.severity == sev for e in events):
                return sev
        return Severity.INFO

The engine consumes events from an asyncio.Queue populated by the ingestion tier, so linking never blocks collectors during a storm. When sustained ingest exceeds correlation throughput, the same queue provides the natural backpressure signal described in Async Batch Processing, letting the linking stage drain in bounded micro-batches rather than collapsing under head-of-line blocking.

Topology Validation and False-Positive Suppression

Temporal joins without spatial validation generate alert storms that degrade NOC response times. Cross-source event linking enforces adjacency constraints by querying a live inventory graph (Neo4j, RedisGraph, or a cached adjacency matrix) before sealing a group. If a BGP session drop and an optical port CRC error share a 90-second window but reside on logically isolated routing instances, the engine rejects the binding. In tier-1 carrier environments this spatial filtering cuts false-positive flood rates by 60–85%, holding the steady-state false-positive routing rate below roughly 2% and protecting the MTTA budget that SLA acknowledgment clauses depend on.

Threshold tuning must account for protocol-specific convergence delays. OSPF/IS-IS reconvergence typically completes within 10–30 seconds, while BGP route-withdrawal propagation can exceed 120 seconds. Setting window TTLs by protocol prevents premature group closure that would split one incident into several tickets. The dynamic side of this — adjusting tolerances as the network and its baselines drift — is governed by Threshold Tuning Methods, while the adjacency-validation and dependency-tree traversal it relies on are detailed in Building Graph-Based Fault Trees in Python.

Diagram: per-element sliding window — two in-window events bind, an expired event is evicted.

Configuration and Tuning Parameters

The linking layer exposes a small set of parameters whose defaults should be derived from the network’s own convergence behaviour, never copied blindly:

Window TTL (window_ttl) — base sliding-window length. Default 120 s. Use 60–90 s for IGP-dominated access domains where reconvergence is fast, and 180–300 s for BGP-heavy cores where withdrawal propagation is slow. TTL set too short fragments one fault into multiple groups; too long inflates per-group member counts and latency.
Minimum distinct sources (min_sources) — default 2. Raising to 3 sharply suppresses noise on chatty elements but risks missing genuine two-signal faults (e.g. interface-down + BGP-flap); keep at 2 for transport and raise selectively per element class.
Severity decay half-life — default 45 s. Governs how quickly an older member’s severity contribution is discounted; tune against observed flap intervals so a re-arming alarm does not keep a group at CRITICAL.
Topology cache TTL — default 30 s. Adjacency lookups read a warmed cache; a stale cache after a maintenance window is a leading cause of wrong bindings, so invalidate on inventory change events rather than on a fixed timer alone.
Max group size — default 200. A hard cap that forces a group to seal under a storm, bounding worst-case latency; oversized groups almost always indicate a window TTL that is too generous.

Severity Resolution and Weighted Decay

Cross-source linking does not simply aggregate maximum severity; it applies weighted decay that accounts for source reliability, event age, and historical noise patterns. A flapping SNMP interface-down alarm paired with a verified telemetry-based packet-loss metric should not inherit the SNMP trap’s historical noise weight. Instead, the engine applies exponential decay to older events and discounts sources with known flapping baselines, so the sealed group carries a calibrated severity that reflects actual service impact rather than raw alarm volume.

This scoring blends historical mean-time-between-failures (MTBF) data, source trust coefficients, and service-tier multipliers. The linking layer computes a provisional severity to make routing decisions deterministic, but the authoritative breakdown of decay functions, trust weighting, and SLA-aligned threshold calibration lives in Severity Scoring Algorithms, which the routing tier consults before a ticket is opened.

Debugging Workflow and Observability

Production correlation engines require deterministic tracing and state introspection. The following checklist is standard for carrier-grade linking deployments:

Correlation ID propagation. Every event carries a UUID from ingestion; the linking engine stamps the sealed group with a stable correlation_id, giving end-to-end traceability across syslog, telemetry collectors, and ITSM dispatchers.
Window state inspection. Expose a read-only HTTP/gRPC endpoint that dumps active sliding windows, candidate counts, and eviction rates so engineers can confirm whether TTLs are too aggressive or too permissive.
Rule-trace logging. Emit structured JSON for every predicate evaluation: log rule_id, match_status, candidate_count, distinct_sources, and topology_validation_result. This removes guesswork during false-positive investigations.
Backpressure and queue monitoring. Track asyncio.Queue depth and consumer lag. If ingest exceeds correlation throughput, the circuit breaker described below trips before latency violates SLA.
Metric aggregation. Track events_processed, groups_sealed, false_positive_rejected, avg_correlation_latency_ms, p99_correlation_latency_ms, and severity_escalation_rate. Alert when p99 latency exceeds 200 ms, which typically signals topology cache misses or rule-engine contention.

Failure Modes and Mitigation

The linking stage degrades gracefully rather than dropping events silently:

Dead-letter isolation. Events that fail Pydantic validation or reference an unknown element are routed to a dead-letter queue with the rejection reason attached, never discarded. The DLQ is replayable once the upstream schema or inventory gap is fixed.
Circuit breaker on topology lookups. If the inventory graph is unreachable or its p99 lookup latency climbs past the breaker threshold, the engine trips into temporal-only grouping and stamps each affected group with an sla_degraded flag, so the NOC knows bindings are unverified and reviews them manually.
Fallback routing. When the breaker is open, groups still seal and route — to a generic transport queue rather than a topology-derived team — preserving MTTA at the cost of routing precision. This is a deliberate trade: a slightly misrouted ticket beats a dropped incident.
Bounded eviction under storm. The max-group-size cap and per-element windows ensure a single noisy chassis cannot exhaust memory or starve other elements; excess events evict by TTL and surface in eviction-rate metrics.
Exactly-once sealing. A correlation_id derived from element plus window epoch makes re-processing idempotent, so a worker restart or active-active failover never seals the same incident twice.

When a group is finally sealed and severity-calibrated, the engine routes a single consolidated payload — for example, a BGP peer flap coinciding with a physical interface error becomes one ticket for the IP/MPLS team instead of duplicate routing and transport alarms. The protocol-specific binding strategy for that exact pattern is worked through in Correlating BGP Flaps with Interface Down Events. By holding strict temporal, spatial, and severity boundaries, the linking layer also gives downstream ML root-cause analysis clean, causally valid incident graphs rather than noisy, artificially aggregated telemetry.

Frequently Asked Questions

What is the difference between cross-source event linking and topology-aware correlation? Linking decides which events belong in the same incident group using temporal windows, multi-source join rules, and a topology adjacency check. Topology-aware correlation owns the deeper graph traversal — suppressing downstream alarms and ranking root cause across a dependency tree. Linking calls topology validation as a gate; it does not replace it.

Why key the sliding window per network element instead of one global window? A single global window forces unrelated devices to share a buffer, which is the largest source of cross-device false positives. Keying per device_fqdn means only events from the same element can bind, and topology validation then confirms the binding is physically real.

How should window TTLs be chosen for mixed IGP and BGP networks? Set TTLs by protocol convergence. Fast IGP domains (OSPF/IS-IS reconverge in 10–30 s) suit 60–90 s windows; BGP cores where withdrawal can exceed 120 s need 180–300 s. A TTL that is too short splits one fault into several tickets; too long inflates member counts and latency.

What happens to events when the topology graph is unavailable? A circuit breaker trips the engine into temporal-only grouping, seals groups with an sla_degraded flag, and routes them to a generic transport queue. Nothing is dropped — MTTA is preserved at the cost of routing precision until the inventory source recovers.

Up to the parent reference: Fault Correlation & Rule Engines — the end-to-end correlation pipeline this stage sits inside
Topology-Aware Correlation — validate chassis and routing adjacency before sealing a group
Severity Scoring Algorithms — weighted decay and SLA-aligned severity calibration
Threshold Tuning Methods — keep window TTLs and tolerances calibrated as the network drifts
Correlating BGP Flaps with Interface Down Events — the canonical two-signal binding pattern, end to end

Cross-Source Event Linking: Deterministic Fault Binding for Carrier SLAs #

Operational Intent and Boundaries #

Pipeline Architecture and Temporal Alignment #

Production-Ready Python Implementation #

Topology Validation and False-Positive Suppression #

Configuration and Tuning Parameters #

Severity Resolution and Weighted Decay #

Debugging Workflow and Observability #

Failure Modes and Mitigation #

Frequently Asked Questions #

Related #

In this section