Implementing Weighted Severity Scoring

Static severity mapping in telecom fault management consistently underperforms during cascading outages. When a core BGP peer or optical transport node fails, downstream access nodes, aggregation switches, and customer premises equipment generate thousands of raw alarms. Traditional rule engines assign identical priority levels to each event, flooding NOC queues and obscuring the true root cause. The operational consequence is a direct increase in MTTR as engineers manually triage noise instead of executing automated remediation. Implementing Weighted Severity Scoring resolves this by transforming raw alarm attributes into a normalized routing score that reflects actual service impact, topology position, and SLA exposure.

Deterministic Scoring Architecture

The scoring model must operate as a deterministic pipeline rather than a heuristic guess. Each incoming fault event is enriched with four primary weight vectors:

  1. Topology Criticality (W_t): Positional weight derived from network hierarchy (core > aggregation > access).
  2. Service Impact Multiplier (W_s): Customer-facing exposure factor pulled from CMDB/SLA tiers.
  3. Temporal Decay Factor (D_t): Time-based reduction that prevents stale alarms from monopolizing routing queues.
  4. Correlation Suppression Index (C_s): Penalty applied when an event is identified as a downstream symptom of an already-acknowledged parent fault.

The final routing score is calculated using a bounded linear transformation:

Where D_t = max(0.0, 1.0 - (elapsed_minutes / decay_window)) and clamp() ensures output remains within [0.0, 1.0]. This mathematical structure guarantees that upstream infrastructure failures dominate downstream symptoms while preventing score inflation during prolonged fault windows. Real-time execution requires tight integration with your existing Fault Correlation & Rule Engines to ensure topology graphs and service dependency maps are queried before scoring occurs.

Configuration: The Weight Matrix

Deploying the engine begins with a version-controlled YAML manifest that maps network element types, vendor-specific alarm codes, and SLA tiers to base weights. This configuration is hot-reloaded via a file watcher or GitOps pipeline.

# weights_matrix.yaml
topology_criticality:
  core_router: 0.95
  optical_mux: 0.88
  aggregation_switch: 0.70
  access_dslam: 0.35
  cpe: 0.15

service_impact:
  enterprise_leased_line: 1.50
  wholesale_transport: 1.40
  residential_broadband: 1.00
  internal_monitoring: 0.50

vendor_alarm_overrides:
  CISCO-ENVMON-MIB::envMonFanStatusChange:
    topology_weight: 0.40
    impact_multiplier: 0.80
  HUAWEI-ALARM-MIB::hwOpticalLossOfSignal:
    topology_weight: 0.92
    impact_multiplier: 1.35

scoring_parameters:
  decay_window_minutes: 45
  max_correlation_suppression: 0.30
  score_floor: 0.0
  score_ceiling: 1.0

Python Implementation Pipeline

The scoring service should be lightweight, stateless, and exposed via gRPC or REST. Below is a production-grade Python implementation using pydantic for strict schema validation and asyncio for non-blocking I/O.

import asyncio
import time
from dataclasses import dataclass
from typing import Optional
from pydantic import BaseModel, Field, ValidationError

class FaultEvent(BaseModel):
    event_id: str = Field(..., alias="eventId")
    ne_type: str = Field(..., alias="neType")
    alarm_code: str = Field(..., alias="alarmCode")
    sla_tier: str = Field(..., alias="slaTier")
    timestamp_epoch: float = Field(..., alias="timestampEpoch")
    correlation_parent_id: Optional[str] = Field(None, alias="correlationParentId")

@dataclass
class WeightConfig:
    topology_weight: float
    impact_multiplier: float
    decay_window: float = 45.0
    max_suppression: float = 0.30

def compute_weighted_score(event: FaultEvent, config: WeightConfig) -> float:
    # 1. Base multiplication
    raw_score = config.topology_weight * config.impact_multiplier
    
    # 2. Temporal decay
    elapsed_min = (time.time() - event.timestamp_epoch) / 60.0
    decay_factor = max(0.0, 1.0 - (elapsed_min / config.decay_window))
    decayed_score = raw_score * decay_factor
    
    # 3. Correlation suppression
    suppression = 0.0
    if event.correlation_parent_id:
        # In production, fetch actual parent severity from correlation engine cache
        suppression = config.max_suppression * 0.85  # Example: 85% of max penalty
        
    # 4. Bounded final score
    final_score = max(0.0, min(1.0, decayed_score - suppression))
    return round(final_score, 4)

# Example execution
if __name__ == "__main__":
    evt = FaultEvent(
        eventId="TRAP-99281",
        neType="core_router",
        alarmCode="BGP_PEER_DOWN",
        slaTier="enterprise_leased_line",
        timestampEpoch=time.time() - 120,  # 2 mins ago
        correlationParentId=None
    )
    cfg = WeightConfig(topology_weight=0.95, impact_multiplier=1.50)
    print(f"Routing Score: {compute_weighted_score(evt, cfg)}")

Routing Thresholds & ITSM Integration

The computed score drives deterministic ticket routing via ITSM webhooks (ServiceNow, Jira, or custom platforms). Thresholds are calibrated to align with SLA breach windows and engineering capacity:

Score RangeRouting ActionAutomation Trigger
≥ 0.85P1 Auto-Assignment to L3 EngineeringImmediate page, topology snapshot attached, runbook execution paused for manual validation
0.60 – 0.84L2 Queue with Runbook ExecutionAutomated remediation scripts (interface bounce, BGP soft reset, optical power check)
< 0.60Batched Maintenance WindowAggregated daily digest, suppressed from real-time NOC dashboards

Integration with standardized Severity Scoring Algorithms ensures the routing logic remains auditable and compliant with telecom operational baselines. The webhook payload should include the raw score, contributing vectors, and a deterministic trace ID for post-incident review.

Threshold Tuning & Flood Control

False positive flood control requires continuous calibration. During initial rollout, implement the following mitigation paths:

  1. Decay Window Calibration: Monitor queue dwell time. If D_t decays too quickly, legitimate but prolonged faults may drop below routing thresholds prematurely. Adjust decay_window_minutes based on median MTTR per NE class.
  2. Correlation Suppression Tuning: Over-suppression masks legitimate multi-vector failures. Implement a sliding window that resets C_s if downstream alarms exceed a configurable count threshold within a 5-minute interval.
  3. Vendor Alarm Normalization: Map proprietary MIBs and syslog patterns to canonical alarm classes using IETF YANG models or ITU-T X.733 severity definitions (ITU-T X.733 Alarm Reporting Function). This prevents vendor-specific noise from skewing W_t and W_s.
  4. Predictive Fault Modeling Hooks: Feed historical score distributions into time-series forecasting models. When a node’s rolling average score approaches 0.75 without an active alarm, trigger proactive maintenance tickets before cascading failure occurs.

SRE Mitigation & Operational Best Practices

Deploying weighted scoring introduces new failure modes that require SRE-grade safeguards:

  • Topology Graph Staleness: If the CMDB or topology graph is out of sync, W_t becomes inaccurate. Implement a fallback weight matrix with conservative defaults (e.g., 0.60) and alert on graph refresh latency > 5 minutes.
  • gRPC/Transport Resilience: Use exponential backoff and circuit breakers for downstream ITSM webhooks. Queue unacknowledged payloads in a persistent buffer (Kafka or Redis Streams) to prevent score loss during platform outages.
  • AI-Driven Root Cause Analysis Integration: While weighted scoring handles deterministic routing, it should feed structured telemetry into ML pipelines. Export Score, W_t, W_s, and C_s as features for AI-driven RCA models that identify latent topology dependencies and recommend weight matrix adjustments.
  • Audit & Rollback: Maintain immutable logs of every scoring decision. If threshold tuning causes routing regressions, hot-swap the YAML manifest and replay the last 24 hours of events against the previous configuration to validate impact.

Implementing Weighted Severity Scoring transforms fault management from reactive noise suppression to proactive, topology-aware routing. By anchoring decisions in deterministic math, version-controlled configuration, and tight correlation engine integration, telecom operators can reduce MTTR, eliminate false positive floods, and scale automated remediation across multi-vendor, multi-domain networks.