Why must a fault tree be a directed acyclic graph rather than a general graph?

Backward traversal walks every ancestor path from a symptom to its source. If the dependency graph contains a cycle — which routing reconvergence can introduce — traversal never terminates. Enforcing the DAG invariant at edge insertion, by rejecting any reverse edge, guarantees that nx.ancestors and shortest_path always halt and that root-cause isolation stays deterministic.

How is the confidence score for a root-cause candidate calculated?

Confidence is the product of the edge propagation_prob values along the shortest path from the candidate to the triggered symptom node. If the candidate's own alarm timestamp falls outside the correlation window, the score is multiplied by a temporal-decay factor, for example 0.5, so a stale upstream node cannot hijack a fresh storm. Candidates above a configurable floor such as 0.75 are forwarded to ITSM.

How do you keep graph traversal from blocking the async ingestion pipeline?

Graph mutation is cheap and runs inline, but the CPU-bound traversal is offloaded with loop.run_in_executor so it runs on a thread pool instead of the event loop. Each batch is also wrapped in asyncio.wait_for with a 2 second timeout, and traversal depth is capped at max_hops=6, so a dense reconvergence storm cannot stall the coroutines draining ingest.

What stops topology drift from producing wrong root-cause bindings?

Scheduled reconciliation runs the graph against a source-of-truth inventory such as NetBox or a CMDB, prunes orphaned nodes, and validates edge directionality before sync. Any binding made against an edge that cannot be verified is stamped topology_unverified and routed to manual NOC review rather than dropped, so precision degrades gracefully instead of emitting false incidents.

Building Graph-Based Fault Trees in Python

Cascading alarm storms across multi-domain telecom infrastructures routinely inflate mean time to resolution (MTTR) by burying one primary failure beneath hundreds of derivative symptoms. When a core fiber cut triggers optical-layer degradation, IP routing reconvergence, and downstream BGP session flaps, flat threshold-based alerting opens a ticket per symptom: a single fiber event can surface 80–150 distinct alarms inside 30 seconds, and the NOC spends the first 10–15 minutes of MTTA just deciding which alarm is the cause. A graph-based fault tree replaces that manual deduction with deterministic isolation by modeling network dependencies as a directed acyclic graph (DAG) and walking it backward from the symptoms to the source. This page builds that engine in Python end to end, with a networkx schema, an async ingestion hook, and the confidence-scored traversal that collapses the storm into one routable incident.

Schema Alignment and Taxonomy Anchor

This pattern is the implementation core of Topology-Aware Correlation, the graph-anchored decision stage inside the Fault Correlation & Rule Engines pipeline. It assumes the events arriving at the fault tree are already normalized: syslog, SNMP traps, and streaming telemetry have been bound into temporally aligned groups by Cross-Source Event Linking, and every payload carries the canonical identifiers defined in Event Schema Design — device_fqdn, interface_idx, severity, and an epoch timestamp. The fault tree never re-parses raw protocol data; it consumes typed events (validate them with the same strict-mode approach shown in Validating NetFlow Events with Pydantic) and answers exactly one question: given the live dependency graph, which node is the root cause and which nodes are merely its consequences?

Graph Schema and Dependency Mapping

A production-grade fault tree requires a strict node-edge schema that mirrors physical and logical topology. Nodes represent discrete network elements or interfaces, while directed edges encode failure propagation paths. Each node carries immutable identifiers (node_id, layer, vendor), mutable state attributes (alarm_severity, last_seen_ts, health_score), and propagation metadata (expected_latency_ms, suppression_weight). Edges define causal relationships: an optical port failure propagates to attached router interfaces, which propagate to BGP peers, which propagate to downstream service endpoints.

The graph must be a DAG to guarantee termination during traversal. Telecom routing protocols occasionally introduce logical loops during reconvergence, so the implementation enforces acyclicity through protocol-aware pruning at insertion time. When ingesting topology data, map physical adjacency to upstream edges and logical dependency to downstream edges. Assign edge weights proportional to propagation probability: a DWDM amplifier failure carries a higher propagation weight to its attached transponders than a transient interface CRC error carries to a routing adjacency. Anchoring the graph to actual service paths rather than a static inventory snapshot is what keeps suppression decisions auditable when change-control reviews replay them at 02:00.

Diagram: a fault-propagation DAG across network layers, walked backward from the symptom to isolate the root cause.

Production Code: DAG Initialization and Event Ingestion

The implementation uses networkx for graph operations, dataclasses for schema enforcement, and enforces the DAG constraint on every edge insertion so a reconvergence loop can never create an infinite traversal. Heterogeneous alarm payloads are normalized into FaultNode state before mutation, and repeat alarms on a known node escalate severity rather than duplicating the node.

import logging
from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Dict, List, Any
import networkx as nx

logger = logging.getLogger("fault_tree_engine")


@dataclass
class FaultNode:
    node_id: str
    layer: str  # "optical", "ip_transport", "bgp", "service"
    vendor: str
    alarm_severity: float = 0.0
    last_seen_ts: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
    health_score: float = 1.0
    suppression_weight: float = 1.0


@dataclass
class FaultEdge:
    source_id: str
    target_id: str
    propagation_prob: float  # 0.0 - 1.0
    expected_latency_ms: int
    causal_type: str  # "physical", "logical", "protocol"


class GraphFaultTree:
    def __init__(self, correlation_window_ms: int = 300_000):
        self.graph = nx.DiGraph()
        self.graph.graph["correlation_window_ms"] = correlation_window_ms
        self.graph.graph["strict_dag"] = True
        self._node_registry: Dict[str, FaultNode] = {}
        self._edge_registry: Dict[tuple, FaultEdge] = {}

    def add_node(self, node: FaultNode) -> None:
        if node.node_id in self._node_registry:
            self._update_node_state(node)
            return
        self._node_registry[node.node_id] = node
        self.graph.add_node(
            node.node_id, layer=node.layer, vendor=node.vendor,
            health_score=node.health_score, severity=node.alarm_severity,
        )

    def add_edge(self, edge: FaultEdge) -> None:
        key = (edge.source_id, edge.target_id)
        if key in self._edge_registry:
            return
        # Enforce the DAG constraint before insertion — reject reverse edges
        if self.graph.has_edge(edge.target_id, edge.source_id):
            logger.warning("Cyclic dependency rejected: %s <-> %s",
                           edge.source_id, edge.target_id)
            return
        self.graph.add_edge(
            edge.source_id, edge.target_id,
            propagation_prob=edge.propagation_prob,
            expected_latency_ms=edge.expected_latency_ms,
            causal_type=edge.causal_type,
        )
        self._edge_registry[key] = edge

    def _update_node_state(self, node: FaultNode) -> None:
        existing = self._node_registry[node.node_id]
        existing.alarm_severity = max(existing.alarm_severity, node.alarm_severity)
        existing.last_seen_ts = node.last_seen_ts
        self.graph.nodes[node.node_id]["severity"] = existing.alarm_severity
        self.graph.nodes[node.node_id]["health_score"] = node.health_score

Traversal, Severity Scoring, and the Async Ingestion Hook

Once alarms are ingested, the engine performs time-windowed backward traversal to find the highest-probability root cause. The confidence of each candidate is the product of the edge propagation probabilities along the shortest path from candidate to symptom, penalized by a temporal-decay factor when the candidate’s own alarm falls outside the correlation window. This keeps a stale upstream node from hijacking a fresh storm.

def correlate_root_cause(self, triggered_nodes: List[str]) -> Dict[str, Any]:
    """Backward DAG traversal within the correlation window.
    Returns ranked root-cause candidates with confidence scores."""
    candidates: Dict[str, float] = {}
    window_ms = self.graph.graph["correlation_window_ms"]
    now = datetime.now(timezone.utc)

    for start_node in triggered_nodes:
        if start_node not in self.graph:
            continue
        for ancestor in nx.ancestors(self.graph, start_node):
            path = nx.shortest_path(self.graph, ancestor, start_node)
            confidence = 1.0
            for i in range(len(path) - 1):
                edge_data = self.graph[path[i]][path[i + 1]]
                confidence *= edge_data["propagation_prob"]
            # Temporal decay: penalize candidates alarmed outside the window
            anc = self._node_registry.get(ancestor)
            if anc and (now - anc.last_seen_ts).total_seconds() * 1000 > window_ms:
                confidence *= 0.5
            candidates[ancestor] = max(candidates.get(ancestor, 0.0), confidence)

    ranked = sorted(candidates.items(), key=lambda x: x[1], reverse=True)
    return {"root_causes": ranked, "evaluated_at": now.isoformat()}

Graph traversal is CPU-bound and synchronous, so it must never run on the event loop directly — a 50,000-node telecom graph can spend tens of milliseconds in nx.shortest_path, and blocking there stalls every other ingest coroutine. Wrap the correlation call in run_in_executor so the async pipeline that feeds this engine stays non-blocking, exactly as the ingest tier described in Implementing asyncio for High-Volume SNMP expects:

import asyncio


async def ingest_and_correlate(tree: GraphFaultTree, batch: list) -> Dict[str, Any]:
    """Async hook: mutate graph state inline, offload traversal to a thread."""
    triggered = []
    for event in batch:  # events are already schema-validated upstream
        tree.add_node(FaultNode(
            node_id=event["device_fqdn"],
            layer=event["layer"], vendor=event["vendor"],
            alarm_severity=event["severity"],
        ))
        triggered.append(event["device_fqdn"])

    loop = asyncio.get_running_loop()
    # Offload the CPU-bound DAG walk so the event loop keeps draining ingest
    result = await loop.run_in_executor(None, tree.correlate_root_cause, triggered)
    return result

Only candidates above a configurable confidence floor (for example 0.75) are forwarded to ITSM; lower-confidence events route to a staging queue for enrichment rather than immediate ticket creation. The final SLA weighting of the winning candidate is deferred to Severity Scoring Algorithms, keeping the fault tree responsible only for isolation, not for severity policy.

Mitigation and Hardening

Topology fault trees fail in predictable ways. Each path below is a concrete production failure mode and the guardrail that contains it:

Topology drift — the graph diverges from reality as ports are repatched. Run scheduled reconciliation against the source-of-truth inventory (NetBox, CMDB), prune orphaned nodes, and validate edge directionality before sync. Stamp any binding made against an unverified edge as topology_unverified so it routes to manual NOC review instead of being dropped.
Traversal timeouts — a dense reconvergence storm explodes the ancestor set. Cap depth at max_hops=6 for telecom domains and wrap each correlation batch in asyncio.wait_for(..., timeout=2.0); on timeout, emit the highest-confidence partial result and dead-letter the batch for replay.
Cyclic dependency injection — a mislearned protocol adjacency tries to create a loop. The add_edge guard rejects reverse edges at insertion, so the DAG invariant holds and traversal always terminates.
State memory bloat — healthy nodes accumulate forever. Apply TTL-based eviction for nodes with health_score > 0.9 and no active alarm for 24h, and extract a networkx subgraph for the hot path so the working set stays cache-resident.
Ticket routing loops — duplicate ITSM payloads reopen a closed incident. Attach a monotonic correlation_id per fault-tree instance and reject any payload matching an active correlation_id inside the suppression window.

Operational Hardening Notes

Process alarm batches through a worker pool where each worker holds a thread-local GraphFaultTree, avoiding lock contention on the shared graph; merge state periodically into a central read-only topology cache rather than mutating it from every worker. Warm that cache at startup from inventory so the first storm of the shift does not pay a cold-graph penalty. Tune correlation_window_ms to the slowest expected propagation chain — 300s comfortably covers an optical-to-service cascade, but DC fabrics with sub-second BFD can drop to 60s and cut false bindings. Pre-compute and cache nx.ancestors for high-fan-out core nodes between topology syncs; on a 50k-node graph this turns a recurring p99 traversal of ~40ms into a low-single-digit lookup. Keep the confidence floor and max_hops in config, not code, so NOC teams can retune them against drift without a redeploy — the same discipline applied in Dynamic Threshold Adjustment for Latency Alerts.

Advanced Patterns: Predictive and AI-Driven RCA

Once deterministic fault trees stabilize MTTR, the graph becomes a foundation for prediction. Exporting historical traversal outcomes, propagation weights, and node health trajectories lets SRE teams train gradient-boosted models that forecast failure likelihood before alarms trigger. Graph neural networks extend this further, surfacing non-obvious cross-layer dependencies that static rules miss. The transition from reactive DAG traversal to predictive analytics requires consistent telemetry normalization, but the node-edge schema above stays identical — ensuring backward compatibility and an incremental, low-risk rollout.

Up to the parent stage: Topology-Aware Correlation — the graph-anchored decision layer this fault tree implements
Correlating BGP Flaps with Interface Down Events — the two-signal binding that feeds verified events into the graph
Implementing Weighted Severity Scoring — turn the isolated root cause into an SLA-aligned severity
Dynamic Threshold Adjustment for Latency Alerts — keep correlation windows calibrated as the network drifts
Implementing asyncio for High-Volume SNMP — the non-blocking ingest tier that feeds this engine

Building Graph-Based Fault Trees in Python #

Schema Alignment and Taxonomy Anchor #

Graph Schema and Dependency Mapping #

Production Code: DAG Initialization and Event Ingestion #

Traversal, Severity Scoring, and the Async Ingestion Hook #

Mitigation and Hardening #

Operational Hardening Notes #

Advanced Patterns: Predictive and AI-Driven RCA #

Related #