Error Categorization Pipelines for Telecom Fault Correlation & Ticket Routing

Inside the Ingestion & Parsing Workflows data plane, error categorization is the deterministic bridge between normalized telemetry and automated ticket routing. It sits immediately downstream of schema normalization and immediately upstream of the correlation tier, consuming canonical fault records and emitting category-tagged, severity-resolved events ready for grouping. The operational intent is narrow and measurable: eliminate manual triage latency by classifying each fault against a vendor-agnostic network taxonomy before it ever reaches a NOC queue. When this layer is engineered correctly, misclassification stops inflating MTTR, false-positive ticket storms collapse at the source, and operators trust the routing decisions the platform makes on their behalf.

Operational Intent and Boundary

The boundary of this workflow is deliberately sharp. What enters is a stream of already-normalized fault records, each conforming to the canonical contract defined in Event Schema Design, so every payload carries the same mandatory fields — device_id, fault_code, event_time, interface_name, metric_vector, and a raw_payload_hash — before classification begins. What exits is the same record enriched with two authoritative fields: a category drawn from a fixed operational taxonomy and a severity drawn from the standard scale. Nothing more.

What is explicitly excluded matters just as much. This stage does not re-parse raw payloads — re-parsing here would duplicate upstream work, add unacceptable latency, and break idempotency. It does not perform root-cause analysis, topology grouping, or cross-event severity arbitration; those belong to the correlation tier and depend on Topology-Aware Correlation and Severity Scoring Algorithms. The categorization engine assigns a category and a confidence-weighted severity to a single event in isolation, stays stateless, and never asks “was this interface already down?” — the moment a transformation needs cross-event memory, it has crossed into correlation territory. Keeping the boundary this tight is what lets the engine scale by adding worker replicas behind a partitioned bus with no shared mutable state.

Diagram: deterministic-first categorization with a probabilistic fallback.

Pipeline Architecture

The internal flow is stage-gated: schema guard, deterministic rule match, confidence test, probabilistic fallback, and emit. Each stage validates its input, performs one transformation, and either advances a well-formed payload or quarantines it to a dead-letter queue. Two execution paths exist by design. The deterministic path handles the overwhelming majority of well-understood signatures — interface flapping thresholds, BGP session state transitions, optical power degradation curves — at sub-millisecond latency. The probabilistic path is a deliberate safety net for ambiguous events: when a deterministic match returns confidence below the threshold, a lightweight classifier built on historical correlation matrices assigns a best-effort category rather than dropping the event or forwarding it unclassified.

This two-path model is what gives the layer predictable behaviour under both normal and degraded conditions. A clean signature resolves on the fast path; a novel or noisy signature is still categorized, just with a lower confidence score that downstream consumers and observability can act on.

Canonical Taxonomy and Schema Alignment

A robust categorization engine begins with a vendor-agnostic taxonomy that maps raw fault codes to operational domains — TRANSPORT_OPTICAL, CORE_ROUTING, ACCESS_LAYER, POWER_ENVIRONMENTAL. Each category carries explicit severity weights, escalation SLAs, and downstream routing tags. Schema validation happens immediately on receipt: a malformed payload must never reach the decision graph, where it could corrupt a category assignment. The severity scale itself is not invented here — it aligns to the shared definitions documented in Defining Severity Levels for Telecom Faults, so that a MAJOR fault means the same thing across ingestion, correlation, and routing.

Field alignment strictly references the upstream contract. The pipeline consumes pre-normalized JSON or Protobuf envelopes; it never re-derives fields from raw_payload. The Pydantic V2 model below is the type boundary — anything that fails to construct is diverted, not coerced.

from pydantic import BaseModel, Field
from enum import Enum
from typing import Optional


class FaultSeverity(str, Enum):
    CRITICAL = "critical"
    MAJOR = "major"
    MINOR = "minor"
    INFO = "info"


class NetworkCategory(str, Enum):
    TRANSPORT_OPTICAL = "transport_optical"
    CORE_ROUTING = "core_routing"
    ACCESS_LAYER = "access_layer"
    POWER_ENVIRONMENTAL = "power_environmental"


class NormalizedFault(BaseModel):
    # Strict mode rejects silent type coercion — a string "1.0" will not become a float.
    model_config = {"strict": True, "frozen": False}

    event_id: str
    device_id: str
    fault_code: str
    event_time: float  # UTC epoch seconds, set upstream during normalization
    trace_id: str       # propagated from ingestion for end-to-end tracing
    category: Optional[NetworkCategory] = None
    severity: Optional[FaultSeverity] = None
    # ge/le constraints are enforced by Pydantic V2 at construction time.
    confidence: float = Field(default=0.0, ge=0.0, le=1.0)

Deterministic Rule Compilation and AST Execution

Primary classification executes against a compiled decision graph. Rather than hard-coding conditionals, rule definitions are authored in a small domain-specific language and compiled into an abstract syntax tree using a grammar parser such as Lark. AST compilation enables bytecode-level caching, which is what keeps per-event evaluation in the sub-millisecond range even with thousands of active rules. When a deterministic match is ambiguous, the secondary classifier applies historical fault correlation matrices to assign a fallback category. The concrete threshold-based pattern matching for a specific element class is worked through in Categorizing Network Interface Errors Automatically.

from lark import Lark, Transformer
from functools import lru_cache

RULE_GRAMMAR = r"""
rule: "IF" condition "THEN" category "SEVERITY" severity
condition: metric OPERATOR threshold
metric: WORD
OPERATOR: ">" | "<" | ">=" | "<=" | "=="
threshold: NUMBER
category: WORD
severity: WORD

%import common.WORD
%import common.NUMBER
%import common.WS
%ignore WS
"""


class RuleCompiler(Transformer):
    def rule(self, items):
        return {"condition": items[0], "category": items[1], "severity": items[2]}
    # Additional transformer methods map AST nodes to executable callables.


# Compiling the parser once and caching by rule text keeps the hot path off the
# grammar build — a cold compile costs milliseconds, a cache hit costs nanoseconds.
@lru_cache(maxsize=1024)
def compile_rule(rule_text: str) -> dict:
    parser = Lark(RULE_GRAMMAR, start="rule")
    tree = parser.parse(rule_text)
    return RuleCompiler().transform(tree)

Production-Ready Async Implementation

The categorization engine runs as an ephemeral, stateless async worker. Each normalized event triggers a non-blocking evaluation pass against the active rule set. A circuit breaker guards category assignment so that a partial ontology update or schema drift cannot cascade into a flood of misclassifications: after a run of failures the breaker opens, the engine falls back to the probabilistic classifier, and it self-heals through a half-open probe once the reset window elapses. During sustained bursts that exceed single-worker capacity, batch evaluation is delegated to the Async Batch Processing layer, which aggregates events by device cluster before categorization.

import asyncio
import logging
from dataclasses import dataclass
from typing import Dict

logger = logging.getLogger("categorization_engine")


@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    reset_timeout: float = 30.0
    failures: int = 0
    last_failure_time: float = 0.0

    def is_open(self, now: float) -> bool:
        if self.failures < self.failure_threshold:
            return False
        # Transition to half-open once the reset window elapses so the breaker
        # can recover instead of staying open permanently.
        if now - self.last_failure_time >= self.reset_timeout:
            self.failures = 0
            return False
        return True

    def record_failure(self, now: float) -> None:
        self.failures += 1
        self.last_failure_time = now

    def record_success(self) -> None:
        self.failures = 0


class CategorizationEngine:
    def __init__(self, rules: Dict[str, dict]):
        self.rules = rules
        self.circuit = CircuitBreaker()
        self.confidence_threshold = 0.75

    async def evaluate(self, event: NormalizedFault) -> NormalizedFault:
        # Breaker open: skip the deterministic graph entirely and degrade gracefully.
        if self.circuit.is_open(event.event_time):
            logger.warning("trace=%s breaker open, using probabilistic fallback", event.trace_id)
            return await self._probabilistic_fallback(event)

        try:
            matched = self._deterministic_match(event)
            if matched.confidence >= self.confidence_threshold:
                self.circuit.record_success()
                return matched
            # Confident-enough match not found — fall back rather than emit a guess.
            return await self._probabilistic_fallback(event)
        except Exception as exc:  # noqa: BLE001 — any rule fault must not stall the loop
            self.circuit.record_failure(event.event_time)
            logger.error("trace=%s rule evaluation failed: %s", event.trace_id, exc)
            return await self._probabilistic_fallback(event)

    def _deterministic_match(self, event: NormalizedFault) -> NormalizedFault:
        # Production path traverses compiled rule bytecode; illustrative result here.
        event.category = NetworkCategory.CORE_ROUTING
        event.severity = FaultSeverity.MAJOR
        event.confidence = 0.92
        return event

    async def _probabilistic_fallback(self, event: NormalizedFault) -> NormalizedFault:
        # Lightweight inference over a historical correlation matrix; never blocks.
        await asyncio.sleep(0)
        event.category = NetworkCategory.ACCESS_LAYER
        event.severity = FaultSeverity.MINOR
        event.confidence = 0.61
        return event

The deterministic path holds a p99 evaluation latency at or below 5 ms; the probabilistic path is budgeted to 45 ms. Anything beyond 100 ms is an SLA breach and pages, because a stalled categorizer silently delays every ticket behind it.

Schema and Boundary Validation

Because the engine is stateless and thin, it does not re-validate every upstream field — that would duplicate the contract enforcement already done at ingestion. Instead it enforces two narrow boundary constraints that exist specifically to suppress false positives before a category is committed:

Schema guard. Every event must construct cleanly against the NormalizedFault model. A field-type mismatch — a string where a float is expected, a missing fault_code — cannot be categorized deterministically, so it is quarantined to the dead-letter queue rather than coerced into a wrong category. This keeps the categorization stage the single owner of “this event is shaped wrong” decisions.
Confidence boundary. A deterministic match below the configured floor is treated as a non-match, not a weak match. Emitting a low-confidence category as if it were authoritative is the most common source of misrouting; the floor forces those events onto the probabilistic path where their lower confidence score is carried forward honestly.

The deliberate split keeps responsibility unambiguous: ingestion guarantees shape, categorization guarantees a domain and severity tag, and correlation guarantees meaning. No layer second-guesses another.

Configuration and Tuning Parameters

The defaults below are starting points. Tune per element class and re-measure against SLA targets.

Parameter	Default	Rationale and tuning guidance
`confidence_threshold`	0.75	The deterministic/probabilistic cutover. Raise it for transport-critical domains where a wrong category is expensive; lower it where the fallback classifier is well-trained and coverage matters more than precision.
`failure_threshold` (breaker)	5	Consecutive rule failures before the breaker opens. Too low and a transient parse error degrades the whole worker; too high and a genuine ontology defect floods bad categories before tripping.
`reset_timeout` (breaker)	30 s	Half-open probe interval. Align to how long an ontology hot-reload realistically takes so recovery is automatic.
AST cache size	1024 rules	`lru_cache` ceiling for compiled rules. Size to the active rule count plus headroom; a thrashing cache reintroduces grammar-build latency on the hot path.
Deterministic confidence floor	≥ 0.85	Below this, treat as a non-match. The single most effective dial for cutting misroutes.
Probabilistic confidence floor	≥ 0.60	Floor for accepting a fallback category at all; below it, tag the event `category=UNKNOWN` and route to manual triage rather than guess.

Adaptive behaviour pays off most during storms: when the breaker is flapping or fallback frequency spikes, it is usually cheaper to widen the deterministic confidence floor temporarily — accepting a higher manual-triage rate — than to let low-confidence categories pollute downstream routing. Properly tuned, deterministic-first evaluation routes 85–95% of events on the fast path and holds the false-positive ticket rate below 2%.

Debugging Workflow and Observability

Every event carries a trace_id propagated from ingestion, enabling end-to-end visibility across parsing, categorization, and routing. Standard tracing and metric instrumentation captures rule-evaluation latency, AST cache-hit rate, and confidence-score distribution. Work through this checklist when accuracy or latency regresses:

Confidence histograms. Emit a histogram of confidence per category. A sustained shift toward the probabilistic path signals taxonomy drift or missing deterministic rules — not a code defect — and is the earliest warning that the rule set is going stale.
AST cache-hit rate. Track compile_rule hits versus misses as a gauge. A falling hit rate means rule text is churning or the cache is undersized; either way it puts grammar-build latency back on the hot path.
Rule hit counters. Expose a per-rule-ID counter. Rules with zero hits over 72 hours are candidates for deprecation; a rule whose hit rate suddenly collapses often means an upstream field rename broke its condition.
Breaker state and fallback rate. Surface the circuit-breaker state and the fraction of events taking the fallback path. A breaker stuck open, or a fallback rate climbing past its baseline, is an immediate page — categorization is degraded even if nothing has crashed.
Schema-drift alerts. Count Pydantic construction failures by field. A spike on one field name pinpoints exactly which upstream parser changed shape, turning a vague “categorization is wrong” report into a one-line root cause.
Replay buffer. Persist low-confidence and quarantined events to a time-bounded buffer keyed by trace_id. Engineers replay that exact payload against an updated rule set in staging to separate a parsing defect from a routing defect — without touching live traffic.

Keep instrumentation lock-free on the hot path; sampling is preferable to a mutex inside evaluate.

Failure Modes and Mitigation

Error categorization fails in a small number of well-understood ways, and each has a concrete containment strategy.

Ontology drift. As the network evolves, fault codes appear that no deterministic rule covers, quietly pushing more traffic onto the probabilistic path. Mitigate by alerting on fallback-rate growth (see the confidence histogram) and feeding quarantined UNKNOWN events back into rule authoring as a standing backlog.
Schema drift. An upstream parser change alters a field type and the model stops constructing. The schema guard contains the blast radius by quarantining the affected events to the dead-letter queue instead of emitting wrong categories, and the per-field drift alert names the culprit immediately.
Misclassification storms. A bad ontology hot-reload can mislabel a flood of events. The circuit breaker trips after failure_threshold consecutive faults, diverts to the probabilistic path, and self-heals via a half-open probe — degrading gracefully rather than amplifying the error.
Memory pressure under sustained bursts. Unbounded ingestion exhausts worker heap. Use bounded asyncio.Queue instances with overflow-to-disk for the lowest-severity tier, pool AST nodes, and offload the compiled-rule cache to a shared store so horizontally scaled workers stay stateless. Pair this with the Rate Limiting Strategies layer so non-critical telemetry is shed before it ever reaches categorization, while CRITICAL and MAJOR events bypass the limiter through the priority channel.
Poison events. A single pathological payload that throws inside rule evaluation must not stall the loop. The broad exception guard records a breaker failure, routes the event to the dead-letter queue, and lets the worker continue.

Held to these mitigations, the engine transforms raw telemetry into auto-routed, auditable incidents even through multi-domain fault storms — deterministic where it can be, probabilistic where it must be, and never silently wrong.

Frequently Asked Questions

Why categorize each event in isolation instead of grouping first? Categorization is intentionally stateless so it can scale horizontally with no shared memory. Grouping requires cross-event state and topology awareness, which belong to the correlation tier. Tagging an event with a domain and severity in isolation is what lets correlation then group on clean, pre-labelled inputs.

When does the probabilistic fallback fire, and is it a failure? It fires whenever a deterministic match returns confidence below confidence_threshold (default 0.75), or when the circuit breaker is open. It is a designed safety net, not a failure — but a sustained rise in fallback rate is a leading indicator of taxonomy drift and should be treated as a backlog of missing deterministic rules.

How is severity decided here versus in the correlation engine? This stage assigns a per-event severity from the standard scale defined in the taxonomy, used for routing priority and rate-limit bypass. The correlation engine later performs cross-event severity arbitration — decaying, escalating, or merging severities across a group. The two are complementary; categorization never arbitrates across events.

What happens to an event no rule and no classifier can confidently label? If the probabilistic path also falls below its floor (default 0.60), the event is tagged category=UNKNOWN and routed to manual triage rather than guessed. It is also written to the replay buffer so its payload can drive new rule authoring.

Up to the parent reference: Ingestion & Parsing Workflows — the end-to-end data plane this stage sits inside
Worked element-level example: Categorizing Network Interface Errors Automatically — threshold heuristics for a single chassis
Upstream contract: Event Schema Design — the canonical fields categorization consumes
Storm absorption: Async Batch Processing — bulk evaluation when single-worker capacity is exceeded
Inbound traffic shaping: Rate Limiting Strategies — shed non-critical telemetry before categorization
Severity alignment: Defining Severity Levels for Telecom Faults — the shared severity scale this stage applies

Error Categorization Pipelines for Telecom Fault Correlation & Ticket Routing #

Operational Intent and Boundary #

Pipeline Architecture #

Canonical Taxonomy and Schema Alignment #

Deterministic Rule Compilation and AST Execution #

Production-Ready Async Implementation #

Schema and Boundary Validation #

Configuration and Tuning Parameters #

Debugging Workflow and Observability #

Failure Modes and Mitigation #

Frequently Asked Questions #

Related #

In this section