Implementing Weighted Severity Scoring

Q: Why multiply the topology and service weights instead of adding them?

Multiplication makes the two dimensions interact the way operations reason about impact: a core router with high topology weight carrying enterprise leased lines with high service weight compounds into a near-ceiling score, while the same router carrying only internal monitoring traffic is meaningfully discounted. Addition would let a high value on one axis mask a low value on the other, producing false criticals for high-impact customers on trivial elements.

Q: How does temporal decay avoid dropping legitimate long-running faults?

The decay factor is max(0.0, 1.0 minus elapsed_minutes divided by decay_window), so the window length is the lever. Set the decay window to the median MTTR for that element class so a genuine outage stays above its tier for the whole expected repair window, and reserve aggressive decay for chronically flapping signatures that have already been acknowledged.

Q: What stops correlation suppression from hiding a second root cause?

Suppression is capped by max_correlation_suppression and is keyed to an acknowledged parent. If downstream alarms for a signature exceed a configured count within a five-minute window, the suppression index resets so a true multi-vector failure re-escalates. The penalty damps duplicate symptom tickets without ever zeroing out an independent fault.

Q: Where should the weight matrix live and how is it changed safely?

Keep the weights matrix YAML in version control and hot-reload it through a file watcher or GitOps pipeline, stamping the config version into every scoring breakdown. Shadow-test a candidate matrix against about 10 percent of live traffic and require that it lowers the false-positive routing rate without suppressing legitimate P1 events before promoting it.

Static severity mapping in telecom fault management consistently underperforms during cascading outages. When a core BGP peer or optical transport node fails, downstream access nodes, aggregation switches, and customer premises equipment generate thousands of raw alarms. A rule engine that assigns identical priority to every event floods NOC queues and buries the true root cause, and the operational cost is direct: a node failure that should produce one Critical ticket instead produces hundreds of equal-weight alarms, and engineers burn 20–40 minutes of mean time to resolution (MTTR) manually triaging noise before automated remediation can even start. Weighted severity scoring resolves this by transforming raw alarm attributes into a single normalized routing score that reflects actual service impact, topology position, and SLA exposure — so that one upstream cause outranks its symptoms deterministically.

Schema alignment and taxonomy anchor

This page is the weight-calibration application of the parent Severity Scoring Algorithms stage inside the broader Fault Correlation & Rule Engines pipeline. That parent stage owns the end-to-end scoring contract — what enters, what tier exits, and how the scorer stays stateless; this page owns the narrower job of deriving and tuning the weights that drive that score. By the time an event reaches the scorer it has already passed validation against the canonical contract in Event Schema Design, so the weighting logic can trust that ne_type, alarm_code, and sla_tier are present, typed, and normalized rather than re-coercing them on the hot path.

The scorer consumes already-correlated events and emits a bounded score plus a deterministic tier label. It does not create tickets, mutate topology state, or decide which symptoms to suppress — those responsibilities belong to adjacent stages. Symptom suppression and adjacency resolution are owned by Topology-Aware Correlation, and the boundary calibration of the alert thresholds themselves belongs to Threshold Tuning Methods.

The weight matrix

The score is built from four primary weight vectors, each enriched onto the event before arithmetic runs:

Topology criticality (W_t) — positional weight derived from network hierarchy (core > aggregation > access).
Service impact multiplier (W_s) — customer-facing exposure pulled from CMDB/SLA tiers.
Temporal decay factor (D_t) — a time-based reduction that prevents stale alarms from monopolizing routing queues.
Correlation suppression index (C_s) — a penalty applied when an event is a downstream symptom of an already-acknowledged parent fault.

The final routing score is a bounded linear transformation, expressed here to keep the rounding and clamp behaviour explicit:

decay = max(0.0, 1.0 - elapsed_minutes / decay_window)
score = clamp((W_t * W_s) * decay - C_s, lower=0.0, upper=1.0)

The clamp guarantees the output stays in [0.0, 1.0] so a misconfigured weight can never breach an SLA tier, and the multiplicative W_t * W_s core guarantees that upstream infrastructure failures dominate downstream symptoms while decay prevents score inflation during prolonged fault windows.

Deployment begins with a version-controlled YAML manifest that maps network element types, vendor-specific alarm codes, and SLA tiers to base weights. It is hot-reloaded via a file watcher or GitOps pipeline so calibration changes are auditable and reversible:

# weights_matrix.yaml
topology_criticality:
  core_router: 0.95
  optical_mux: 0.88
  aggregation_switch: 0.70
  access_dslam: 0.35
  cpe: 0.15

service_impact:
  enterprise_leased_line: 1.50
  wholesale_transport: 1.40
  residential_broadband: 1.00
  internal_monitoring: 0.50

vendor_alarm_overrides:
  CISCO-ENVMON-MIB::envMonFanStatusChange:
    topology_weight: 0.40
    impact_multiplier: 0.80
  HUAWEI-ALARM-MIB::hwOpticalLossOfSignal:
    topology_weight: 0.92
    impact_multiplier: 1.35

scoring_parameters:
  decay_window_minutes: 45
  max_correlation_suppression: 0.30
  score_floor: 0.0
  score_ceiling: 1.0

Production scoring engine

The scoring service should be stateless, idempotent, and non-blocking so a single worker can fan a correlation window across the event loop without stalling ingestion. The model below uses Pydantic V2 strict validation for the inbound event and the weight matrix, and the pure-CPU score computation is wrapped in an async method so it composes cleanly with the rest of the pipeline.

from __future__ import annotations

import asyncio
import time
from enum import Enum
from typing import Optional

from pydantic import BaseModel, ConfigDict, Field, field_validator


class SeverityTier(str, Enum):
    BATCHED = "batched"
    L2_RUNBOOK = "l2_runbook"
    P1_PAGE = "p1_page"


class FaultEvent(BaseModel):
    """A normalized, already-correlated event entering the weighted scorer."""

    model_config = ConfigDict(strict=True, extra="forbid", populate_by_name=True)

    event_id: str = Field(alias="eventId")
    ne_type: str = Field(alias="neType")
    alarm_code: str = Field(alias="alarmCode")
    sla_tier: str = Field(alias="slaTier")
    timestamp_epoch: float = Field(alias="timestampEpoch", gt=0.0)
    correlation_parent_id: Optional[str] = Field(default=None, alias="correlationParentId")

    @field_validator("event_id")
    @classmethod
    def _non_empty(cls, value: str) -> str:
        if not value.strip():
            raise ValueError("event_id must not be empty")
        return value


class WeightConfig(BaseModel):
    """Immutable scoring policy resolved per event from the YAML weight matrix."""

    model_config = ConfigDict(frozen=True, extra="forbid")

    topology_weight: float = Field(ge=0.0, le=1.0)
    impact_multiplier: float = Field(ge=0.0)
    decay_window_min: float = Field(default=45.0, gt=0.0)
    max_suppression: float = Field(default=0.30, ge=0.0, le=1.0)


class WeightedScorer:
    """Stateless, deterministic weighted-severity calculator for the hot path."""

    def __init__(self, parent_severity_lookup) -> None:
        # Async callable: parent_id -> acknowledged parent score (or None).
        self._lookup_parent = parent_severity_lookup

    def _tier(self, score: float) -> SeverityTier:
        if score >= 0.85:
            return SeverityTier.P1_PAGE
        if score >= 0.60:
            return SeverityTier.L2_RUNBOOK
        return SeverityTier.BATCHED

    async def compute(self, event: FaultEvent, cfg: WeightConfig) -> dict:
        # 1. Base vector: topology criticality modulated by service impact.
        raw = cfg.topology_weight * cfg.impact_multiplier

        # 2. Temporal decay damps stale alarms so they cannot monopolize routing.
        elapsed_min = (time.time() - event.timestamp_epoch) / 60.0
        decay = max(0.0, 1.0 - (elapsed_min / cfg.decay_window_min))
        decayed = raw * decay

        # 3. Correlation suppression penalizes symptoms of an acknowledged parent.
        suppression = 0.0
        if event.correlation_parent_id:
            parent_score = await self._lookup_parent(event.correlation_parent_id)
            if parent_score is not None:
                # Heavier penalty the more severe the acknowledged parent already is.
                suppression = cfg.max_suppression * min(1.0, parent_score)

        # 4. Bounded clamp so a bad weight can never breach an SLA tier.
        final = max(0.0, min(1.0, decayed - suppression))
        final = round(final, 4)

        return {
            "event_id": event.event_id,
            "raw": round(raw, 4),
            "decay": round(decay, 4),
            "suppression": round(suppression, 4),
            "score": final,
            "tier": self._tier(final).value,
            "scored_at": time.time(),
        }

The extra="forbid" setting is deliberate: an unmodeled field from a misbehaving exporter is a schema-drift signal, not something to silently swallow. Because the parent-severity lookup is awaited rather than read from a synchronous cache miss, the scorer never derives the suppression penalty from a stale or guessed value — the same strict-mode discipline applied to inbound protocol payloads in Validating NetFlow Events with Pydantic.

Async ingestion hook

Correlation windows arrive in bursts, so scoring must fan out across the event loop and isolate per-event failures without stalling the consumer. The hook below batches events, routes malformed payloads to a dead-letter queue, and awaits the downstream dispatch handoff so the only place the loop can stall is bounded I/O.

import logging
from pydantic import ValidationError

logger = logging.getLogger("severity_scoring")


async def score_batch(
    scorer: "WeightedScorer",
    raw_events: list[dict],
    resolve_weights,          # async: (ne_type, alarm_code, sla_tier) -> WeightConfig
    dlq: asyncio.Queue,
) -> list[dict]:
    """Validate, weight, and score a correlation window concurrently."""

    async def _one(raw: dict) -> Optional[dict]:
        try:
            event = FaultEvent.model_validate(raw)
        except ValidationError as exc:
            await dlq.put({"raw": raw, "errors": exc.errors(include_url=False)})
            logger.warning("scoring validation failure", extra={"raw_id": raw.get("eventId")})
            return None
        cfg = await resolve_weights(event.ne_type, event.alarm_code, event.sla_tier)
        return await scorer.compute(event, cfg)

    results = await asyncio.gather(*(_one(raw) for raw in raw_events))
    scored = [r for r in results if r is not None]
    await route_to_itsm(scored)   # awaited so backpressure propagates upstream
    await asyncio.sleep(0)        # yield to the loop between windows
    return scored

Because validation and the per-event score are pure CPU and bounded, the loop stays responsive under alarm-storm load; the awaited route_to_itsm and dlq.put calls are the only suspension points, which is exactly the behaviour you want for backpressure to reach the ingestion edge.

Routing thresholds and ITSM handoff

The computed score drives deterministic ticket routing via ITSM webhooks (ServiceNow, Jira, or a custom platform). Bands are calibrated to align with SLA breach windows and engineering capacity, and they stay fixed so routing remains auditable:

Score band	Tier	Routing action	Automation trigger
`≥ 0.85`	P1 page	Auto-assign to L3 engineering	Immediate page, topology snapshot attached, runbook held for manual validation
`0.60 – 0.84`	L2 runbook	L2 queue with runbook execution	Automated remediation (interface bounce, BGP soft reset, optical power check)
`< 0.60`	Batched	Maintenance-window digest	Aggregated daily summary, suppressed from real-time NOC dashboards

The webhook payload should carry the raw score, the contributing vectors (raw, decay, suppression), and a deterministic trace ID so every routing decision is reproducible during post-incident review.

Mitigation and hardening

Weighted scoring introduces new failure modes that need SRE-grade safeguards. Each path below is a concrete defect a production deployment must handle:

Topology graph staleness. If the CMDB or topology graph is out of sync, W_t becomes inaccurate and a core failure can be scored as access-tier noise. Fall back to a conservative default matrix (for example 0.60) tagged degraded=true, and alert when graph refresh latency exceeds 5 minutes so routing can apply a wider safety margin rather than trusting a stale weight.
Malformed enrichment to the dead-letter queue. A missing field, a contradictory topology flag, or an out-of-range timestamp must isolate the event to a DLQ carrying the raw payload and the flat errors() list — never default-critical and never crash the worker. Alert when the DLQ rejection rate exceeds 0.5% over five minutes, since a spike usually means a firmware or template change on one element.
Over-suppression masking multi-vector failures. An aggressive C_s can hide a genuine second root cause behind an acknowledged parent. Reset the suppression index for a signature if downstream alarms exceed a configurable count within a 5-minute window, so a true multi-fault event re-escalates instead of staying buried.
Transport resilience to ITSM. Use exponential backoff and a circuit breaker on downstream webhooks, and buffer unacknowledged payloads in a persistent queue (Kafka or Redis Streams) so a platform outage degrades to delayed delivery rather than silent score loss.
Configuration drift between workers. Load the weight matrix from a single versioned source and stamp the config version into every scoring breakdown. A circuit breaker should halt dispatch and page operations if two workers report different versions for the same window, preventing inconsistent tiering from reaching customers.

Operational hardening notes

Tuning this pattern is about keeping the per-event cost flat under storm load. Strict mode (ConfigDict(strict=True)) removes the coercion path entirely, which both speeds validation and surfaces schema drift as an explicit failure. Resolve and cache WeightConfig objects per (ne_type, alarm_code, sla_tier) key so the YAML lookup is not repeated for every event in a flood — the matrix changes on a GitOps cadence, not per message. Calibrate decay_window_minutes against the median MTTR per element class: if D_t decays too quickly a prolonged-but-legitimate fault drops below its tier prematurely; too slowly and stale alarms linger in P1. Map proprietary MIBs and syslog patterns to canonical alarm classes using the severity model in ITU-T Recommendation X.733 before they reach the scorer, so vendor-specific noise cannot skew W_t and W_s. Finally, export score, raw, decay, and suppression as features for the offline models that recommend weight-matrix adjustments, and keep an immutable log of every decision so a regressive tuning change can be replayed against the previous matrix over the last 24 hours of events. Validation and scoring workers should stay stateless and horizontally scalable, holding the same sub-2-second SLA ticket-creation budget as the rest of the correlation fabric.

Frequently Asked Questions

Why multiply the topology and service weights instead of adding them?

Multiplication makes the two dimensions interact the way operations actually reason about impact: a core router (W_t high) carrying enterprise leased lines (W_s high) compounds into a near-ceiling score, while the same core router carrying only internal monitoring traffic is meaningfully discounted. Addition would let a high value on one axis mask a low value on the other, producing false criticals for high-impact customers on trivial elements.

How does temporal decay avoid dropping legitimate long-running faults?

The decay factor is max(0.0, 1.0 - elapsed_minutes / decay_window), so the window length is the lever. Set decay_window_minutes to the median MTTR for that element class so a genuine outage stays above its tier for the whole expected repair window, and reserve aggressive decay for chronically flapping signatures that have already been acknowledged.

What stops correlation suppression from hiding a second root cause?

Suppression is capped by max_correlation_suppression and is keyed to an acknowledged parent. If downstream alarms for a signature exceed a configured count within a five-minute window, the suppression index resets so a true multi-vector failure re-escalates. The penalty damps duplicate symptom tickets without ever zeroing out an independent fault.

Where should the weight matrix live and how is it changed safely?

Keep weights_matrix.yaml in version control and hot-reload it through a file watcher or GitOps pipeline, stamping the config version into every scoring breakdown. Shadow-test a candidate matrix against ~10% of live traffic and require that it lowers the false-positive routing rate without suppressing legitimate P1 events before promoting it.

Up to: Severity Scoring Algorithms — the parent scoring stage this weight matrix feeds
Threshold Tuning Methods — closed-loop recalibration of the dispatch bands
Topology-Aware Correlation — the adjacency resolution that sets the topology weight
Defining Severity Levels for Telecom Faults — the canonical severity taxonomy the base weights map onto

Implementing Weighted Severity Scoring #

Schema alignment and taxonomy anchor #

The weight matrix #

Production scoring engine #

Async ingestion hook #

Routing thresholds and ITSM handoff #

Mitigation and hardening #

Operational hardening notes #

Frequently Asked Questions #

Why multiply the topology and service weights instead of adding them? #

How does temporal decay avoid dropping legitimate long-running faults? #

What stops correlation suppression from hiding a second root cause? #

Where should the weight matrix live and how is it changed safely? #

Related #