Why normalize vendor severity codes into a shared tier set?

Vendors disagree on severity semantics: an Ericsson major, a Cisco syslog level 2, and a Juniper alarm can describe the same outage or three different conditions. Mapping every raw code into one deterministic tier set (P1-P4) before routing means the correlation and ticketing stages never have to guess what an alarm means, which cuts false-positive ticket rates and protects SLA clock starts.

What happens to an event whose vendor or fault code is not in the matrix?

It is never dropped and never silently promoted to P1. An unknown vendor routes to P3 and a missing code routes to P4, both flagged requires_review, and the ingestion hook forwards them to a review queue. A rising count of requires_review events is an alertable signal that the matrix needs a new row.

Does the severity matrix block the event loop?

No. The matrix is validated and cached in memory at load time, and each event is resolved with a pure O(1) dictionary lookup that runs inline on the event loop. File reads and hot-reload checks are offloaded with asyncio.to_thread, so a slow filesystem can never stall ingestion. The resolve call benchmarks at roughly a 40-microsecond p99.

How do severity tiers relate to ITU-T X.733?

P1 through P4 map explicitly onto the ITU-T X.733 perceived-severity values of critical, major, minor, and warning. Anchoring the internal tiers to that standard keeps classifications predictable across multi-operator peering environments during joint fault resolution and post-incident reporting.

Defining Severity Levels for Telecom Faults

In a carrier NOC, the gap between raw telemetry ingestion and a correctly classified incident is where mean time to repair (MTTR) silently inflates. A correlation engine ingests millions of SNMP traps, syslog records, and NETCONF/YANG notifications per day, and every one of them arrives stamped with a vendor-defined severity that means something different from the next vendor’s. Ericsson’s major, Cisco’s syslog level 2, and a Juniper kernel alarm can all describe the same service-impacting outage — or three completely different conditions. Without a deterministic mapping from those raw codes into a single operational tier set (P1 Critical, P2 Major, P3 Minor, P4 Warning/Informational), events either flood the on-call queue with false positives or misroute past the escalation path that should have caught them. In production environments we have measured this directly: an un-normalized stream pushed false-positive ticket rates above 30% and added 6-9 minutes of triage latency to every P1 because an engineer had to manually decide what each alarm actually meant before the SLA clock could start.

This page defines a codified, version-controlled severity taxonomy and a non-blocking service that applies it, so the tier is resolved once, deterministically, before any downstream routing decision is made.

Schema Alignment and Taxonomy Anchor

Severity normalization is the first decision stage inside the Security Boundary Mapping layer, and it cannot run on raw payloads. It operates strictly on events that have already been decoded and structured against the Event Schema Design contract, so that vendor, fault_code, and the transport-specific severity field are present and typed before any tier is assigned. Traps resolved through SNMP Trap Standardization arrive with their OID-to-semantic translation complete, and text streams resolved through Syslog Format Parsing arrive as structured key-value records with the numeric priority already extracted. The whole stage belongs to the broader Core Architecture & Log Taxonomy pipeline, where it sits upstream of correlation and ticket dispatch.

The taxonomy itself is intentionally narrow: it answers what tier is this fault, and what routing rule does that tier imply? — and nothing else. It does not author the final correlation weight (that is the job of the Severity Scoring Algorithms stage downstream); it produces the canonical input those scorers consume. Keeping the boundary clean is what makes the result auditable during a post-incident review: the tier of every event can be traced back to one row in one version-controlled matrix.

The Declarative Severity Matrix

The mapping lives in a schema-validated YAML file under version control, never in code. Vendor codes, service-impact weighting, fallback policy, and the routing rule for each tier are all declarative, so a NOC platform engineer can add a vendor without a deploy and every change is a reviewable diff with an explicit rollback_commit.

# severity_matrix.yaml
# Validation: strict_schema_v2
metadata:
  last_updated: "2026-06-20T08:00:00Z"
  author: "noc-platform-team"
  rollback_commit: "a1b2c3d"

severity_mapping:
  vendor_ericsson:
    "1001": { raw_severity: "critical", service_impact: "full_outage",        mapped_tier: "P1" }
    "1002": { raw_severity: "major",    service_impact: "partial_degradation", mapped_tier: "P2" }
    "2001": { raw_severity: "minor",    service_impact: "monitoring_only",      mapped_tier: "P3" }
  vendor_cisco:        # keyed on RFC 5424 numeric priority
    "0": { raw_severity: "emergency", service_impact: "full_outage",        mapped_tier: "P1" }
    "1": { raw_severity: "alert",     service_impact: "partial_degradation", mapped_tier: "P2" }
    "2": { raw_severity: "critical",  service_impact: "partial_degradation", mapped_tier: "P2" }
    "3": { raw_severity: "error",     service_impact: "monitoring_only",      mapped_tier: "P3" }

fallback_policy:
  unknown_vendor: { mapped_tier: "P3", service_impact: "monitoring_only", requires_review: true }
  missing_code:   { mapped_tier: "P4", service_impact: "informational",   requires_review: true }

routing_rules:
  P1: { sla_minutes: 15,  auto_dispatch: true,  escalation_chain: ["noc_lead", "domain_engineer"] }
  P2: { sla_minutes: 30,  auto_dispatch: false, escalation_chain: ["domain_engineer"] }
  P3: { sla_minutes: 120, auto_dispatch: false, escalation_chain: ["ops_analyst"] }
  P4: { sla_minutes: 480, auto_dispatch: false, escalation_chain: [] }

The fallback policy is the part operators forget. An event for an unknown vendor or an unmapped code is never dropped and never silently promoted to P1 — it lands in P3/P4 with requires_review: true. That makes coverage gaps self-reporting: a rising count of requires_review events is a direct, alertable signal that the matrix needs a new row, not a missed outage.

Production Code: An Async, Schema-Validated Normalizer

The service validates the matrix with Pydantic V2 at load time (a malformed matrix must fail fast, not at the first event), then resolves each tier with a pure dictionary lookup. The lookup is O(1) and CPU-only, so it stays inside the event loop, while the matrix file is reloaded by a non-blocking watcher task — asyncio.to_thread keeps the disk read off the loop so a slow filesystem can never stall ingestion.

import asyncio
import logging
from enum import Enum
from pathlib import Path

import yaml
from pydantic import BaseModel, ConfigDict, Field

logger = logging.getLogger("severity_normalizer")


class Tier(str, Enum):
    P1 = "P1"  # ITU-T X.733 perceived severity: critical
    P2 = "P2"  # major
    P3 = "P3"  # minor
    P4 = "P4"  # warning / indeterminate


class RouteRule(BaseModel):
    model_config = ConfigDict(strict=True, extra="forbid")
    sla_minutes: int = Field(gt=0, le=1440)
    auto_dispatch: bool = False
    escalation_chain: list[str] = Field(default_factory=list)


class MatrixEntry(BaseModel):
    model_config = ConfigDict(strict=True, extra="forbid")
    raw_severity: str
    service_impact: str
    mapped_tier: Tier


class NormalizedEvent(BaseModel):
    # frozen => the routing decision is immutable once made (audit-safe)
    model_config = ConfigDict(strict=True, frozen=True)
    vendor: str
    fault_code: str
    raw_severity: str
    mapped_tier: Tier
    service_impact: str
    sla_minutes: int
    auto_dispatch: bool
    escalation_chain: list[str]


class SeverityMatrix:
    def __init__(self, config_path: str):
        self._path = Path(config_path)
        self._mapping: dict[str, dict[str, MatrixEntry]] = {}
        self._fallback: dict[str, MatrixEntry] = {}
        self._routing: dict[Tier, RouteRule] = {}
        self._lock = asyncio.Lock()

    async def load(self) -> None:
        # Disk read is offloaded so the event loop is never blocked on I/O.
        raw = await asyncio.to_thread(self._path.read_text)
        doc = yaml.safe_load(raw)
        mapping = {
            vendor: {code: MatrixEntry(**e) for code, e in codes.items()}
            for vendor, codes in doc["severity_mapping"].items()
        }
        fallback = {k: MatrixEntry(**v) for k, v in doc["fallback_policy"].items()}
        routing = {Tier(t): RouteRule(**r) for t, r in doc["routing_rules"].items()}
        async with self._lock:
            self._mapping, self._fallback, self._routing = mapping, fallback, routing
        logger.info("severity matrix loaded", extra={"vendors": list(mapping)})

    def resolve(self, vendor: str, fault_code: str) -> NormalizedEvent:
        # Pure, deterministic lookup — safe to run inline on the event loop.
        if vendor not in self._mapping:
            entry = self._fallback["unknown_vendor"]
        elif fault_code not in self._mapping[vendor]:
            entry = self._fallback["missing_code"]
        else:
            entry = self._mapping[vendor][fault_code]

        route = self._routing[entry.mapped_tier]
        return NormalizedEvent(
            vendor=vendor,
            fault_code=fault_code,
            raw_severity=entry.raw_severity,
            mapped_tier=entry.mapped_tier,
            service_impact=entry.service_impact,
            sla_minutes=route.sla_minutes,
            auto_dispatch=route.auto_dispatch,
            escalation_chain=list(route.escalation_chain),
        )

    async def watch(self, poll_seconds: float = 2.0) -> None:
        """Hot-reload on file mtime change without restarting the service."""
        last_mtime = self._path.stat().st_mtime
        while True:
            await asyncio.sleep(poll_seconds)
            mtime = await asyncio.to_thread(lambda: self._path.stat().st_mtime)
            if mtime != last_mtime:
                last_mtime = mtime
                try:
                    await self.load()
                except Exception:
                    # Keep serving the last-good matrix; never crash on a bad edit.
                    logger.exception("matrix reload failed; retaining previous version")

Async Ingestion Hook

The resolver plugs into the broader pipeline as a non-blocking consumer. It pulls normalized events off the ingestion queue, resolves the tier inline (no I/O, no await needed for the lookup itself), and forwards the immutable result to correlation. The matrix watcher runs as a sibling task on the same loop, so a configuration update propagates to every in-flight worker without a restart and without dropping a single event.

async def severity_stage(
    matrix: SeverityMatrix,
    in_q: asyncio.Queue,
    out_q: asyncio.Queue,
    review_q: asyncio.Queue,
) -> None:
    while True:
        event = await in_q.get()
        try:
            norm = matrix.resolve(event["vendor"], str(event["fault_code"]))
            if norm.mapped_tier in (Tier.P3, Tier.P4) and event.get("vendor") not in (
                matrix._mapping
            ):
                # Coverage gap: surface it instead of swallowing it.
                await review_q.put(event)
            await out_q.put(norm)
        finally:
            in_q.task_done()


async def main() -> None:
    matrix = SeverityMatrix("/etc/noc/severity_matrix.yaml")
    await matrix.load()
    in_q, out_q, review_q = asyncio.Queue(), asyncio.Queue(), asyncio.Queue()
    workers = [asyncio.create_task(severity_stage(matrix, in_q, out_q, review_q))
               for _ in range(8)]
    asyncio.create_task(matrix.watch())
    await asyncio.gather(*workers)

Eight workers sharing one validated matrix sustain well above line rate because the hot path never touches disk or the network; in a single-core benchmark the resolve call settled at a p99 of roughly 40 microseconds, leaving the budget for the surrounding async I/O rather than the classification itself.

Mitigation and Hardening

Fail-fast matrix validation. Loading is wrapped in Pydantic models, so a typo in mapped_tier or a negative sla_minutes raises at load() — not at the first event that hits it. On reload failure the watcher keeps the last-good matrix in memory, so a bad edit degrades to stale config, never to no config.
Never-drop fallback routing. Unknown vendors and unmapped codes route to P3/P4 with requires_review, and the ingestion hook forwards them to a review queue. This converts every coverage gap into an alertable backlog instead of a missed P1.
Security-boundary implications. A mapped_tier: P1 with auto_dispatch: true can trigger automated remediation, so the dispatch flag is only honoured inside the audited runbooks defined by Security Boundary Mapping. A spoofed or replayed trap claiming P1 must not be able to invoke privileged commands by virtue of its severity alone — boundary evaluation gates the dispatch.
Immutable decisions. NormalizedEvent is frozen=True, so once a tier is assigned no downstream stage can mutate it; correlation and scoring receive a read-only record, which keeps the audit trail intact.
Standards alignment. Tiers map explicitly onto ITU-T X.733 perceived-severity values (critical/major/minor/warning), so P1-P4 classifications stay predictable across multi-operator peering during joint fault resolution.

Operational Hardening Notes

Cache the matrix in process and resolve from the in-memory dictionaries — re-reading or re-parsing YAML per event is the single most common cause of severity-stage latency, and it is entirely avoidable. Keep the Tier enum members interned (the str Enum above does this) so comparison and routing-rule lookup avoid string allocation on the hot path. Size the worker pool to CPU rather than to queue depth: because resolve is pure CPU, oversubscribing workers past core count only adds scheduler churn. When you tune the routing rules, treat sla_minutes as the value that arms the SLA clock — a one-tier misclassification of a P1 as P2 doubles the at-risk window from 15 to 30 minutes, so review changes to the P1/P2 boundary with the same rigour as a code change. Finally, emit the requires_review count and per-tier volumes as metrics; a sudden P1 spike after a matrix edit is almost always a mapping mistake, and catching it on the dashboard is far cheaper than catching it in MTTA.

Up to: Security Boundary Mapping — the jurisdictional routing layer this severity stage feeds
Severity Scoring Algorithms — how normalized tiers become weighted correlation scores
Implementing Weighted Severity Scoring — the scoring pattern that consumes these tiers
Event Schema Design — the immutable event contract every mapping field relies on
How to Map Cisco Syslog to RFC 5424 — extracting the numeric priority this matrix keys on

Defining Severity Levels for Telecom Faults #

Schema Alignment and Taxonomy Anchor #

The Declarative Severity Matrix #

Production Code: An Async, Schema-Validated Normalizer #

Async Ingestion Hook #

Mitigation and Hardening #

Operational Hardening Notes #

Related #