Topology-Aware Correlation: Deterministic Fault Isolation and SLA-Driven Routing

Topology-aware correlation is the graph-anchored decision stage inside the Fault Correlation & Rule Engines pipeline. In a carrier network, raw alarm floods rarely reflect true service health: a single core transport fault can light up hundreds of downstream access alarms within seconds. This stage maps every event onto the physical and logical network graph so that fault isolation becomes deterministic rather than heuristic. Its operational intent is precise — suppress downstream noise, preserve upstream impact signals, and route a ticket only when a verified topological dependency confirms a service-affecting condition. For NOC engineers and platform teams, this layer is where alarm volume stops driving ticket volume.

Operational Intent and Boundaries

What enters this stage is a stream of already-linked candidate events. Upstream, Cross-Source Event Linking has bound syslog, SNMP, and streaming telemetry into temporally aligned groups carrying canonical identifiers — device_fqdn, interface_idx, severity, and an epoch timestamp. The topology layer assumes that normalization is complete; it does not re-parse traps or syslog lines, and it does not open tickets. It answers one question: given the live dependency graph, which event is the root cause, and which events are merely its consequences?

What exits is a single resolved incident — one root-cause network element, a set of suppressed downstream members, and a composite SLA impact payload handed to severity calibration and routing. What is explicitly excluded is final severity weighting (deferred to Severity Scoring Algorithms), threshold adaptation (owned by Threshold Tuning Methods), and ITSM dispatch. By holding this narrow scope, every suppression decision stays auditable: an engineer can replay the traversal from the graph snapshot and the visited-node trace, which is exactly what carrier change-control reviews demand during a 02:00 outage.

Pipeline Architecture and Graph Traversal

The topology engine maintains a directed graph representation of the network aligned with the YANG network-topology model of RFC 8345. Vertices represent network elements — routers, switches, optical nodes — while edges encode physical links, logical tunnels, and service dependencies. When a linked group arrives, the engine binds each member to its element identifier and port-level address, then validates the elements against the live inventory snapshot drawn from the CMDB or network inventory API. From the fault origin it performs a breadth-first traversal upward through aggregation layers, identifying whether the condition isolates a single access node or cascades into a core transport segment. Downstream dependents of the resolved root are marked suppressed; the upstream impact flag propagates to the highest affected element so that ticket routing aligns with real service boundaries rather than arbitrary alarm classifications.

Diagram: topology-aware suppression of downstream alarms.

Production-Ready Python Implementation

The pipeline must handle incremental topology updates, cache invalidation, and concurrent event ingestion without introducing blocking latency. The pattern below pairs a networkx directed graph with an asyncio-driven correlator. It traverses upstream to find the highest-impact ancestor, then suppresses that root’s descendants so a transport fault never spawns a ticket storm. The detailed dependency-tree construction this relies on is worked through in Building Graph-Based Fault Trees in Python.

import asyncio
import logging
from typing import Dict, List, Optional, Set
from dataclasses import dataclass
from datetime import datetime, timezone
import networkx as nx

logger = logging.getLogger("topology.correlator")


@dataclass(frozen=True)
class NetworkEvent:
    event_id: str
    source_ne: str          # canonical network element id (matches graph nodes)
    port: str
    severity: str
    timestamp: datetime
    raw_payload: dict


class TopologyCorrelator:
    def __init__(self, graph: nx.DiGraph) -> None:
        self.graph = graph
        # Suppressed dependents and active roots are keyed by element id so the
        # next event on a suppressed node is dropped before any traversal runs.
        self.suppressed_nodes: Set[str] = set()
        self.active_incidents: Dict[str, List[NetworkEvent]] = {}
        # A single lock serializes graph reads against CMDB sync writes, so a
        # traversal never observes a half-applied topology mutation.
        self._graph_lock = asyncio.Lock()

    async def process_event(self, event: NetworkEvent) -> Optional[Dict]:
        """Ingest event, traverse topology, return routing payload if actionable."""
        if event.source_ne in self.suppressed_nodes:
            logger.debug("event=%s suppressed_by=topology", event.event_id)
            return None

        async with self._graph_lock:
            # Validate node existence in the current topology snapshot.
            if event.source_ne not in self.graph:
                logger.warning("event=%s unknown_ne=%s action=queue_for_sync",
                               event.event_id, event.source_ne)
                return None

            root_cause_node = self._traverse_upstream(event.source_ne)
            if root_cause_node is None:
                return None

            sla_payload = self._calculate_sla_impact(root_cause_node, event)
            self._apply_suppression(event, root_cause_node)
        return sla_payload

    def _traverse_upstream(self, node: str) -> Optional[str]:
        """BFS up the dependency edges to the highest active or edge ancestor."""
        visited: Set[str] = set()
        queue = [node]
        while queue:
            current = queue.pop(0)
            if current in visited:
                continue
            visited.add(current)

            # If an ancestor already owns an open incident, fold into it.
            if current in self.active_incidents:
                return current

            predecessors = list(self.graph.predecessors(current))
            if not predecessors:
                return current          # reached a root / edge element
            queue.extend(predecessors)
        return node

    def _calculate_sla_impact(self, node: str, event: NetworkEvent) -> Dict:
        """Emit a structured skeleton the severity-scoring layer populates."""
        return {
            "root_cause_ne": node,
            "trigger_event": event.event_id,
            "composite_severity": event.severity,
            "sla_tier": self.graph.nodes[node].get("sla_tier", "standard"),
            "routing_queue": self.graph.nodes[node].get("noc_tier", "noc_general"),
            "suppressed_downstream": len(nx.descendants(self.graph, node)),
        }

    def _apply_suppression(self, event: NetworkEvent, root_node: str) -> None:
        """Mark downstream dependents suppressed to prevent ticket storms."""
        dependents = nx.descendants(self.graph, root_node)
        self.suppressed_nodes.update(dependents)
        self.active_incidents.setdefault(root_node, []).append(event)

The correlator consumes events from an asyncio.Queue fed by the linking tier, so topology traversal never blocks collectors during a storm. When sustained ingest exceeds traversal throughput, that queue is the natural backpressure signal handled by Async Batch Processing, letting this stage drain in bounded micro-batches rather than collapsing under head-of-line blocking.

Topology Validation and False-Positive Suppression

Temporal grouping alone produces high false-positive rates in dense carrier networks; spatial validation is what makes a suppression decision safe. Before sealing an incident, the engine confirms that the candidate members share a real dependency path — common parent nodes, shared fiber spans, or overlapping BGP peer groups — rather than merely arriving in the same window. When an optical transport alarm and an IP/MPLS routing event coincide, the traversal verifies the affected optical port terminates on the same chassis as the impacted IP interface before grouping them. Validation reads a warmed adjacency cache (an in-memory matrix, RedisGraph, or Neo4j), so the hot path stays sub-millisecond.

This spatial filtering is the single biggest lever on quality. In tier-1 environments it cuts downstream alarm volume by 85% or more during transport outages while holding the steady-state false-positive routing rate below roughly 2%. The cost of getting it wrong is asymmetric: a stale graph after a maintenance window suppresses a genuine fault, so adjacency cache invalidation must be driven by inventory-change events, not a fixed timer alone.

Diagram: upstream BFS from the alarm origin resolves the root and shades the suppressed subtree.

Configuration and Tuning Parameters

The topology layer exposes a small set of parameters whose defaults should be derived from the network’s own behaviour, never copied blindly:

Graph snapshot refresh interval — default 300 s. Bounds how stale the working graph can be between full CMDB syncs. Pair it with event-driven invalidation on inventory mutations; relying on the timer alone is the leading cause of wrong bindings after a maintenance change.
Topology cache TTL — default 30 s. Adjacency lookups read this warmed cache on the hot path. Too long risks suppressing real faults across a freshly changed link; too short drives cache-miss latency into the traversal budget.
Max traversal depth — default 12 hops. A hard ceiling on upstream BFS so a malformed or cyclic graph cannot stall a worker. Ring and mesh topologies should raise this only after confirming the graph is acyclic at the dependency layer.
Suppression TTL — default 600 s. How long a downstream element stays suppressed after the root seals. Too long masks a second, independent fault on the same subtree; too short re-admits the original alarm storm.
Max incident fan-out — default 500 descendants. Bounds the suppression set a single root may claim; an incident exceeding it almost always signals an over-broad edge in the graph rather than a genuine mega-outage, and is flagged for review.

Debugging Workflow and Observability

Topology correlation failures typically stem from stale inventory data, misaligned adjacency definitions, or race conditions during concurrent ingestion. The following checklist is standard for carrier-grade deployments:

Trace correlation misses. Inject a correlation_trace_id into every event at ingestion and log the traversal path, visited nodes, and suppression decision as structured JSON. Query traces where suppressed=false and ticket_created=false to surface graph traversal dead-ends.
Validate adjacency consistency. Run a nightly diff between the live graph snapshot and the authoritative CMDB export. Flag any node where graph.predecessors(node) returns an empty set despite a known aggregation uplink — a missing edge silently breaks upstream propagation.
Audit suppression leaks. Track suppressed_nodes set size against ingest rate. A sudden spike means over-aggressive upstream suppression; a flatline means traversal is failing to mark dependents. A TTL-based eviction policy prevents unbounded memory growth.
Concurrency stress testing. Drive bursts of 10k+ events/sec through asyncio.gather() and confirm graph mutations are serialized by the lock so no traversal observes a partial CMDB sync.
Metric aggregation. Emit events_processed, roots_resolved, downstream_suppressed, unknown_ne_rate, avg_traversal_latency_ms, and p99_traversal_latency_ms. Alert when p99 traversal latency exceeds 50 ms, which typically signals topology cache misses or lock contention during a sync window.

Failure Modes and Mitigation

The topology stage degrades gracefully rather than dropping events silently:

Dead-letter isolation. Events that reference an unknown network element are routed to a dead-letter queue with the rejection reason attached and replayed once the CMDB sync closes the inventory gap — never discarded.
Circuit breaker on inventory lookups. If the graph source is unreachable or its p99 lookup latency climbs past the breaker threshold, the engine trips into pass-through mode: it stops suppressing, stamps each incident topology_degraded, and routes on temporal grouping alone so nothing is masked while bindings are unverified.
Fallback routing. Under an open breaker, incidents still seal and route — to a generic transport queue rather than a topology-derived team — preserving MTTA at the cost of routing precision. A slightly misrouted ticket beats a dropped incident.
Bounded suppression under storm. The max-fan-out cap and suppression TTL ensure a single noisy root cannot suppress an unbounded subtree or starve memory; excess descendants surface in suppression-leak metrics rather than silently expanding.
Exactly-once sealing. A correlation_id derived from the root element plus the window epoch makes re-processing idempotent, so a worker restart or active-active failover never opens the same incident twice.

When an incident is finally resolved and impact-scored, the engine routes a single consolidated payload — for example, a core transport fault that lit up forty access alarms becomes one ticket for the transport team instead of forty. The deterministic suppression here also gives downstream ML root-cause ranking clean, causally valid dependency graphs, letting it surface latent single points of failure in ring topologies before they breach an SLA. The protocol-specific binding that feeds this stage is detailed in Correlating BGP Flaps with Interface Down Events.

Frequently Asked Questions

What is the difference between topology-aware correlation and cross-source event linking? Linking decides which events belong in the same temporal group using sliding windows and multi-source join rules. Topology-aware correlation takes that group and traverses the live dependency graph to choose the root cause, suppress its downstream consequences, and propagate the upstream impact flag. Linking calls a lightweight adjacency check as a gate; this stage owns the deep traversal.

How does the engine decide which alarm is the root cause? It performs a breadth-first traversal upward along dependency edges from each member element. The highest ancestor that either has no predecessors (an edge or core root) or already owns an open incident is selected as the root cause; every descendant of that root is then marked suppressed.

What happens when the topology graph is stale or unavailable? A circuit breaker trips the engine into pass-through mode. It stops suppressing, stamps incidents topology_degraded, and routes on temporal grouping to a generic queue. Nothing is masked or dropped, so MTTA is preserved at the cost of routing precision until the inventory source recovers.

How much alarm noise does topology suppression actually remove? In tier-1 transport outages, anchoring suppression to verified dependencies removes 85% or more of downstream alarms while holding the steady-state false-positive routing rate below roughly 2%, which is what keeps the MTTA budget inside SLA acknowledgment clauses.

Up to the parent reference: Fault Correlation & Rule Engines — the end-to-end correlation pipeline this stage sits inside
Cross-Source Event Linking — binds heterogeneous telemetry into the groups this stage resolves
Severity Scoring Algorithms — weighted decay and SLA-aligned calibration applied after the root is found
Threshold Tuning Methods — keeps adjacency tolerances and suppression TTLs calibrated as the network drifts
Building Graph-Based Fault Trees in Python — constructs the dependency graph this traversal runs over

Topology-Aware Correlation: Deterministic Fault Isolation and SLA-Driven Routing #

Operational Intent and Boundaries #

Pipeline Architecture and Graph Traversal #

Production-Ready Python Implementation #

Topology Validation and False-Positive Suppression #

Configuration and Tuning Parameters #

Debugging Workflow and Observability #

Failure Modes and Mitigation #

Frequently Asked Questions #

Related #

In this section