Categorizing Network Interface Errors Automatically

Q: Why split physical from logical errors before correlation instead of after?

Tagging a layer1 versus layer2/3 domain in isolation is cheap, stateless, and scales horizontally with no shared memory. It lets the correlation tier group on clean, pre-labelled inputs, a physical degradation and the protocol flaps it caused, instead of re-deriving the domain for every event during a storm.

Q: How does the classifier avoid the memory exhaustion common in flap detection?

Three bounds: precompiled regex objects with no per-event allocation, fixed-length deque state per link that evicts timestamps by age, and a hard ceiling on the state table that evicts the oldest link key once it is reached. Memory stays proportional to active links, not to events seen.

Q: What happens to a message that matches neither pattern?

It is tagged category unclassified and routed to manual triage rather than guessed, and the payload is logged for review. A sustained rise in the unclassified rate is treated as a backlog of missing deterministic rules, usually triggered by a vendor firmware or log-format change.

Q: When does a flap storm override per-event severity?

When more than 8 state changes land on the same device and interface index inside the flap window, 300 seconds by default. The worker promotes the event to critical, stamps flap_storm in the metadata, and the routing layer emits one aggregated P2 instead of a ticket per oscillation.

In carrier-grade transport and IP/MPLS networks, interface error storms remain a primary driver of NOC fatigue, ticket duplication, and inflated MTTR. Raw telemetry from multi-vendor chassis routinely emits overlapping syslog messages and SNMP traps for one physical degradation event: a single dirty fiber connector can surface as a CRC-count log line, an ifOperStatus down trap, and a cascade of carrier-loss notifications inside the same second. Without deterministic classification, the categorization stage forwards five symptoms where one root cause exists — misrouting optical degradation to the routing team, opening duplicate P3/P4 tickets, and burying root-cause analysis under noise. Categorizing each interface error in isolation, before it reaches a NOC queue, is what keeps misclassification from inflating MTTR and holds the false-positive ticket rate below 2%.

This page works through that exact element-level classification pattern: the precompiled threshold heuristics that separate Layer 1 degradation from logical protocol events, the bounded state table that flags a flap storm without leaking memory, and the async hook that runs the whole pass in sub-millisecond time on the hot path. A correct classification here typically collapses an 8–15 message burst into a single severity-tagged event and shaves minutes off MTTA, because transport teams receive a pre-labelled layer1 / physical_degradation fault instead of a queue of orphaned interface traps.

Schema Alignment and Taxonomy Anchor

This pattern is the worked element-level case of Error Categorization Pipelines, the deterministic classification stage inside the Ingestion & Parsing Workflows data plane. The classifier assumes clean, already-normalized input: every event arriving here carries a canonical device_id, an if_index, an if_name, an epoch timestamp, and the raw_message string, exactly as the contracts in Event Schema Design require. This page does not re-parse Cisco syslog headers or decode trap OIDs — that work belongs to upstream parsers, standardized by How to Map Cisco Syslog to RFC 5424 and Configuring SNMPv3 Trap Receivers in Python before a record ever reaches classification.

The taxonomy is deliberately small and physical-vs-logical. A physical_degradation or physical_fault event (CRC/FCS errors, optical power excursions, carrier loss) maps to the layer1 domain and the transport team; a logical_protocol_event (MTU mismatch, BFD state change, OSPF neighbor down, storm-control trip) maps to layer2/3 and the IP/MPLS engineers; a repeated state change on the same if_index is escalated as a flap storm. The severity scale itself is not invented here — it aligns to the shared definitions in Defining Severity Levels for Telecom Faults, so a critical interface fault means the same thing across ingestion, correlation, and routing.

Production Classifier

The core logic evaluates each error against operational thresholds using precompiled regular expressions from the Python re module, so per-event matching stays off the pattern-build path entirely. A strict Pydantic V2 model is the type boundary — anything that fails to construct is diverted, never coerced into a wrong category — and physical-layer anomalies are kept strictly separate from logical-layer events.

import re
from enum import Enum
from typing import Optional
from pydantic import BaseModel, Field


class FaultSeverity(str, Enum):
    CRITICAL = "critical"
    WARNING = "warning"
    INFO = "info"


class FaultCategory(str, Enum):
    LAYER1 = "layer1"
    LAYER2_3 = "layer2/3"
    UNCLASSIFIED = "unclassified"


class InterfaceEvent(BaseModel):
    # Strict mode rejects silent coercion — a string "42" will not become an int.
    model_config = {"strict": True}

    device_id: str
    if_index: int
    if_name: str
    raw_message: str
    timestamp: float                     # UTC epoch seconds, set upstream
    error_type: Optional[str] = None
    severity: FaultSeverity = FaultSeverity.INFO
    category: FaultCategory = FaultCategory.UNCLASSIFIED
    metadata: dict = Field(default_factory=dict)


class ErrorClassifier:
    def __init__(self, crc_threshold: int = 50):
        self.crc_threshold = crc_threshold
        # Precompile once at init to eliminate per-event regex build overhead.
        self.physical_re = re.compile(
            r"(?:CRC|FCS|Frame).*error.*count\s*(\d+)|"
            r"optical.*power.*(?:low|high|degraded)|"
            r"link.*down|carrier.*loss|physical.*layer.*fault",
            re.IGNORECASE,
        )
        self.logical_re = re.compile(
            r"protocol.*mismatch|mtu.*exceeded|"
            r"bfd.*state.*change|ospf.*neighbor.*down|"
            r"storm.*control.*triggered|broadcast.*suppression",
            re.IGNORECASE,
        )
        self._count_re = re.compile(r"error.*count\s*(\d+)", re.IGNORECASE)

    def classify(self, event: InterfaceEvent) -> InterfaceEvent:
        msg = event.raw_message

        if self.physical_re.search(msg):
            count_match = self._count_re.search(msg)
            if count_match:
                count = int(count_match.group(1))
                event.error_type = "physical_degradation"
                # A count above the CRC floor is a hard fault, not a blip.
                event.severity = (
                    FaultSeverity.CRITICAL
                    if count > self.crc_threshold
                    else FaultSeverity.WARNING
                )
            else:
                event.error_type = "physical_fault"
                event.severity = FaultSeverity.CRITICAL
            event.category = FaultCategory.LAYER1
            return event

        if self.logical_re.search(msg):
            event.error_type = "logical_protocol_event"
            event.severity = FaultSeverity.WARNING
            event.category = FaultCategory.LAYER2_3
            return event

        # No deterministic match — tag unclassified rather than guess.
        event.category = FaultCategory.UNCLASSIFIED
        return event

Flap detection is a second, stateful pass. A bounded deque per link evicts timestamps by age, so a chatty interface during network-wide reconvergence cannot exhaust the worker heap, and a hard ceiling on the state table prevents dictionary sprawl:

from collections import deque
from typing import Deque, Dict


class FlapTracker:
    def __init__(self, flap_window: float = 300.0, max_state_entries: int = 50_000):
        self.flap_window = flap_window
        self.max_state_entries = max_state_entries
        # (device_id:if_index) -> bounded deque of recent state-change epochs
        self.state_cache: Dict[str, Deque[float]] = {}

    def is_storm(self, event: InterfaceEvent) -> bool:
        key = f"{event.device_id}:{event.if_index}"
        now = event.timestamp
        buf = self.state_cache.setdefault(key, deque(maxlen=128))
        buf.append(now)

        # Sliding-window prune: drop transitions older than the flap window.
        cutoff = now - self.flap_window
        while buf and buf[0] < cutoff:
            buf.popleft()

        # Global eviction so the table can never grow without bound.
        if len(self.state_cache) > self.max_state_entries:
            self.state_cache.pop(next(iter(self.state_cache)))

        return len(buf) > 8        # >8 transitions in the window = flap storm

Async Ingestion Hook

The classifier is built to sit inline on the hot path without blocking the collectors feeding it. A consumer task drains an asyncio.Queue populated by the ingestion tier, runs the deterministic classification and flap passes — both pure-CPU and non-blocking — and forwards the tagged event to routing. Because no stage performs synchronous I/O, sustained error bursts on one chassis cannot stall classification for the rest of the fabric.

import asyncio
import logging

logger = logging.getLogger("interface_classifier")


async def run_worker(
    queue: "asyncio.Queue[InterfaceEvent]",
    classifier: ErrorClassifier,
    flaps: FlapTracker,
    emit: "asyncio.Queue[InterfaceEvent]",
) -> None:
    while True:
        event = await queue.get()
        try:
            event = classifier.classify(event)
            if flaps.is_storm(event):
                # A flap storm overrides per-event severity: one P2, not 12 P4s.
                event.severity = FaultSeverity.CRITICAL
                event.metadata["flap_storm"] = True
            await emit.put(event)        # bounded queue → natural backpressure
        except Exception:
            logger.exception("classify_failed event_device=%s", event.device_id)
        finally:
            queue.task_done()

When sustained ingest exceeds classification throughput, the bounded emit queue is the backpressure signal: the worker drains in bounded micro-batches via Async Batch Processing rather than collapsing under head-of-line blocking, and non-critical telemetry is shed upstream by Rate Limiting Strategies before it ever reaches the classifier. The same non-blocking discipline applied to raw collection is detailed in Implementing asyncio for High-Volume SNMP.

Routing and Remediation Mapping

Once classified, each event maps to a deterministic routing matrix and its runbook:

Layer 1 / physical — auto-route to the optical transport team. Trigger fiber-testing APIs when optical power drops below -24 dBm, and attach the CRC/FCS count so the on-call engineer sees the degradation curve, not just a binary down state.
Layer 2/3 / logical — route to IP/MPLS routing engineers. These events are also the binding candidates handed to Topology-Aware Correlation, which validates chassis adjacency before a protocol flap and an interface drop are merged into one incident.
Flap storms — apply an automatic suppression window (300 s by default) after the storm threshold trips, and emit a single P2 with the aggregated flap count and interface utilization, rather than one ticket per oscillation.

Mitigation and Hardening

The classifier degrades gracefully rather than dropping events silently. Concrete failure paths and their mitigations:

Dead-letter isolation. An event that fails to construct against the InterfaceEvent model — a missing if_index, a string where a float is expected — is quarantined to a dead-letter queue with the rejection reason attached, never coerced into a wrong category. The DLQ is replayable once the upstream parser is fixed.
Unclassified is a signal, not a drop. A message that matches neither regex is tagged category=unclassified and routed to manual triage, then logged so the pattern can drive a new deterministic rule. A rising unclassified rate is the earliest warning that a vendor firmware change altered the log syntax.
Flap-storm suppression, not silent merge. When is_storm fires, the suppression window collapses subsequent transitions into the existing P2 for the dampening cycle, so a re-arming link produces one incident per cycle rather than one per flap.
Idempotent tagging. Classification is stateless and pure for a given raw_message, so a worker restart or active-active failover re-tags the same event identically and never emits a divergent category.
Poison-message containment. A single pathological payload that throws inside the pass must not stall the loop — the broad guard in the worker logs it, routes it to the DLQ, and lets the worker continue draining.

Operational Hardening Notes

Performance and accuracy tuning specific to this pattern:

Compile regex once, never per event. The precompiled physical_re / logical_re objects hold a p99 classification latency at or below 1 ms even at thousands of events per second; rebuilding the pattern inside classify would put microseconds of grammar-build cost back on every event.
Bound every flap deque. The deque(maxlen=128) cap means a single chatty link cannot exhaust memory under a storm; excess transitions evict by age and surface in eviction-rate metrics. Keep the maxlen just above the expected transitions per window.
Cache the enum lookups. In a tight ingest loop, repeated string-to-enum coercion is measurable; let Pydantic resolve FaultSeverity/FaultCategory once at the model boundary and pass the enum through, never the raw string.
Tune crc_threshold per element class. Default 50 suits long-haul DWDM spans; raise it on access links where low-rate CRC noise is expected, and re-measure against SLA targets so a warning never silently becomes a critical.
Watch p99, not the mean. Alert when p99 classification latency exceeds ~5 ms — that threshold typically signals state-table contention or an unbounded regex backtrack on a pathological message, well before a stalled classifier delays every ticket behind it.

Frequently Asked Questions

Why split physical from logical errors before correlation instead of after? Tagging a layer1 versus layer2/3 domain in isolation is cheap, stateless, and scales horizontally with no shared memory. It lets the correlation tier group on clean, pre-labelled inputs — a physical degradation and the protocol flaps it caused — instead of re-deriving the domain for every event during a storm.

How does the classifier avoid the memory exhaustion common in flap detection? Three bounds: precompiled regex objects (no per-event allocation), fixed-length deque state per link that evicts timestamps by age, and a hard ceiling on the state table that evicts the oldest link key once it is reached. Memory stays O(active links), not O(events seen).

What happens to a message that matches neither pattern? It is tagged category=unclassified and routed to manual triage rather than guessed, and the payload is logged for review. A sustained rise in the unclassified rate is treated as a backlog of missing deterministic rules — usually triggered by a vendor firmware or log-format change.

When does a flap storm override per-event severity? When more than 8 state changes land on the same device_id:if_index inside the flap window (300 s by default). The worker promotes the event to critical, stamps flap_storm in the metadata, and the routing layer emits one aggregated P2 instead of a ticket per oscillation.

Up to the parent stage: Error Categorization Pipelines — the deterministic classification layer this element-level pattern lives inside
Upstream contract: Event Schema Design — the canonical fields the classifier consumes
Severity alignment: Defining Severity Levels for Telecom Faults — the shared scale this stage applies
Storm absorption: Async Batch Processing — bulk evaluation when single-worker capacity is exceeded
Topology validation: Topology-Aware Correlation — merge a protocol flap with its physical interface drop only after adjacency is confirmed

Categorizing Network Interface Errors Automatically #

Schema Alignment and Taxonomy Anchor #

Production Classifier #

Async Ingestion Hook #

Routing and Remediation Mapping #

Mitigation and Hardening #

Operational Hardening Notes #

Frequently Asked Questions #

Related #