Dynamic Threshold Adjustment for Latency Alerts

Q: How long is the cold-start window, and what protects the element during it?

With min_samples of 100 at a 15-second interval, the worker needs about 25 minutes of data before it computes a dynamic boundary. Until then it returns the static fallback SLA ceiling, so a freshly provisioned element or a restarted pipeline is always covered by the static threshold and there is no unguarded gap.

Static latency thresholds in telecom transport and core networks degrade MTTR the moment traffic stops behaving like the day the SLA was written. A fixed 50 ms one-way-delay trigger fires during a legitimate 70% diurnal traffic surge, and the NOC burns acknowledgment minutes validating a false positive that resolves itself by the next polling cycle. The inverse is worse: a sudden 15 ms degradation during an off-peak trough — a real fibre micro-bend or a congesting upstream peer — slips beneath the same fixed ceiling and never raises an event, delaying fault isolation until a customer escalation arrives. Both failure modes share one root cause: the boundary is a constant, but the network underneath it is not.

This page implements the fix as a focused, copy-paste worker: recompute each (element_id, metric) latency boundary continuously from rolling telemetry so that only statistically significant deviations promote into events. Done correctly, a dynamic boundary collapses the diurnal false-positive flood that typically accounts for 30–40% of latency tickets, while catching the off-peak degradations a static ceiling structurally ignores — pulling MTTA down because every alert that fires is one an engineer should actually look at.

Schema Alignment and Taxonomy Anchor

Dynamic threshold adjustment is the latency-specific case of the Threshold Tuning Methods stage inside the Fault Correlation & Rule Engines pipeline. It runs as a pre-correlation normalization layer: strictly downstream of telemetry parsing, strictly upstream of severity resolution and dispatch. What enters is a typed latency time series per network element; what exits is a single ThresholdBreach carrying the offending value, the active computed bounds, and the breach duration — nothing more.

The worker assumes clean input. Every measurement arriving here already carries a canonical element_id, a metric name, a millisecond latency_ms, and an epoch timestamp, exactly as the contracts in Event Schema Design require — this stage does not re-parse gNMI notifications or decode trap OIDs. The boundary it emits is consumed downstream by Severity Scoring Algorithms for final tiering and by Cross-Source Event Linking when a latency breach must be bound to co-occurring control-plane alarms. Tuning computes the boundary; it never opens a ticket.

Production Threshold Worker

The worker below maintains a per-element sliding window, computes a time-decayed EWMA blended with a percentile tail buffer, and evaluates each new sample against the recomputed boundary. Configuration is a Pydantic V2 model so decay and safety parameters are runtime-tunable knobs rather than buried constants. The evaluation itself is synchronous and CPU-bound; the next section shows the async hook that keeps it off the event loop.

# dynamic_latency_threshold.py
import time
import logging
from collections import deque
from typing import Deque, Dict, Optional, Tuple

import numpy as np
from pydantic import BaseModel, Field, field_validator

logger = logging.getLogger("dynamic_latency_threshold")


class ThresholdConfig(BaseModel):
    """Runtime-tunable knobs, injected via ConfigMap or control-plane API."""
    model_config = {"frozen": True}

    window_hours: int = Field(168, ge=1)        # 7-day rolling baseline
    sample_interval_s: int = Field(15, ge=1)
    decay_factor: float = Field(0.05, gt=0.0)   # per-hour EWMA decay
    percentile: float = Field(0.95, gt=0.0, lt=1.0)
    min_samples: int = Field(100, ge=10)        # cold-start gate
    safety_multiplier: float = Field(1.2, ge=1.0)
    static_fallback_ms: float = Field(50.0, gt=0.0)
    drift_guard_ratio: float = Field(0.40, gt=0.0)  # circuit-breaker bound

    @field_validator("safety_multiplier")
    @classmethod
    def _sane_safety(cls, v: float) -> float:
        if v > 3.0:
            raise ValueError("safety_multiplier > 3.0 will mask real degradation")
        return v


class DynamicLatencyThreshold:
    """Adaptive per-element latency boundary from time-decayed EWMA + percentile tail."""

    def __init__(self, cfg: ThresholdConfig):
        self.cfg = cfg
        self.max_len = cfg.window_hours * (3600 // cfg.sample_interval_s)
        self.buffer: Deque[Tuple[float, float]] = deque(maxlen=self.max_len)
        self._last_threshold: Optional[float] = None

    def ingest(self, latency_ms: float, timestamp: Optional[float] = None) -> Dict:
        if latency_ms < 0:
            logger.warning("negative latency dropped: %sms", latency_ms)
            return {"status": "invalid_input", "threshold_ms": None}
        self.buffer.append((timestamp or time.time(), latency_ms))
        return self._evaluate()

    def _evaluate(self) -> Dict:
        if len(self.buffer) < self.cfg.min_samples:
            # Cold start: downstream must use the static SLA ceiling.
            return {
                "status": "cold_start",
                "threshold_ms": self.cfg.static_fallback_ms,
                "sample_count": len(self.buffer),
            }

        latencies = np.fromiter((v for _, v in self.buffer), dtype=float)
        timestamps = np.fromiter((t for t, _ in self.buffer), dtype=float)

        # Recent samples carry more weight; weights normalize to 1.0.
        delta_hours = (timestamps[-1] - timestamps) / 3600.0
        weights = np.exp(-self.cfg.decay_factor * delta_hours)
        weights /= weights.sum()

        ewma = float(np.average(latencies, weights=weights))
        p_tail = float(np.percentile(latencies, self.cfg.percentile * 100))
        candidate = ewma + (p_tail - ewma) * self.cfg.safety_multiplier

        # Drift circuit breaker: reject a boundary that jumps too far in one step.
        if self._last_threshold is not None:
            jump = abs(candidate - self._last_threshold) / self._last_threshold
            if jump > self.cfg.drift_guard_ratio:
                logger.error("drift guard tripped: %.1f%% jump, holding last bound",
                             jump * 100)
                return {
                    "status": "drift_held",
                    "threshold_ms": round(self._last_threshold, 2),
                    "sample_count": len(self.buffer),
                }

        self._last_threshold = candidate
        breached = latencies[-1] > candidate
        return {
            "status": "breach" if breached else "active",
            "threshold_ms": round(candidate, 2),
            "ewma_ms": round(ewma, 2),
            "p_tail_ms": round(p_tail, 2),
            "value_ms": round(float(latencies[-1]), 2),
            "sample_count": len(self.buffer),
        }

Async Ingestion Hook

The numpy evaluation is CPU-bound and runs in microseconds for a 7-day window, but on a busy event loop it should never block the collectors feeding it. A consumer task drains an asyncio.Queue populated by the telemetry tier, routes each measurement to its per-element worker, and forwards a ThresholdBreach onto the correlation queue only when the boundary is crossed. A hysteresis band keeps a value oscillating around the boundary from flapping the alert open and shut. This is the same non-blocking discipline applied to raw collection in Implementing asyncio for High-Volume SNMP.

import asyncio
from dataclasses import dataclass


@dataclass(frozen=True)
class Measurement:
    element_id: str
    metric: str
    latency_ms: float
    timestamp: float


async def run_threshold_worker(
    cfg: ThresholdConfig,
    inbound: "asyncio.Queue[Measurement]",
    breaches: "asyncio.Queue[dict]",
    hysteresis_ms: float = 2.0,
) -> None:
    workers: Dict[Tuple[str, str], DynamicLatencyThreshold] = {}
    armed: Dict[Tuple[str, str], bool] = {}
    while True:
        m = await inbound.get()
        try:
            key = (m.element_id, m.metric)
            worker = workers.setdefault(key, DynamicLatencyThreshold(cfg))
            # Offload the numpy reduction so a wide window never stalls ingest.
            result = await asyncio.to_thread(worker.ingest, m.latency_ms, m.timestamp)

            thr = result.get("threshold_ms")
            if result["status"] == "breach" and not armed.get(key):
                armed[key] = True
                await breaches.put({"element_id": m.element_id, "metric": m.metric,
                                    **result})
            elif thr is not None and m.latency_ms < thr - hysteresis_ms:
                armed[key] = False   # re-arm only after clearing the band
        except Exception:
            logger.exception("threshold_eval_failed element=%s", m.element_id)
        finally:
            inbound.task_done()

Because each element owns an isolated worker and the heavy reduction is pushed to a thread, a single chatty interface under a micro-burst storm cannot stall threshold evaluation for the rest of the fabric. The bounded breaches queue provides natural backpressure, draining in micro-batches exactly as described in Async Batch Processing.

Mitigation and Hardening

Dynamic boundaries trade static blind spots for new failure modes, each with an explicit mitigation:

Cold-start fallback. Before min_samples (100 here) accumulate, the worker returns cold_start and emits the static_fallback_ms ceiling. Downstream never sees a None boundary, so a pipeline restart or a freshly provisioned element is protected by the static SLA until the window fills — roughly 25 minutes at a 15 s interval.
Drift circuit breaker. A boundary that jumps more than drift_guard_ratio (40%) in one step is rejected and the last good value is held. This stops a routing flap, a telemetry-corruption burst, or a clock skew from ratcheting the threshold into a level that masks every real fault behind it.
Hysteresis against flapping. The async hook re-arms an element only after the value falls a full hysteresis_ms band below the boundary, so a value oscillating on the line yields one breach, not a storm of open/close churn that inflates false-positive rate.
Measurement-standard compliance. Asymmetric path sampling or inconsistent probe intervals poison the EWMA. Align probes to a recognized OAM standard such as ITU-T Y.1731 for Ethernet frame-delay measurement so the window reflects real path behaviour, not sampling artefacts.
Security boundary. Telemetry is an untrusted input: a spoofed or replayed measurement can deliberately drag a baseline upward to hide a later attack. Gate the worker behind authenticated transport and reject samples whose timestamp is implausibly old or future-dated before they enter the buffer.

Operational Hardening Notes

Performance and accuracy tuning specific to the rolling-baseline pattern:

Size the window to the seasonality, not the disk. A 7-day window (window_hours=168) captures the weekly diurnal cycle; going to 30 days mostly adds memory and dampens legitimate growth response. The deque(maxlen=...) bounds memory at roughly window_hours * 240 floats per element — cheap, but multiply it by element count before scaling out.
Offload the reduction, don’t inline it. asyncio.to_thread keeps the numpy average/percentile pass off the event loop; inlining it on a 60 k-element fabric is a leading cause of ingest-loop stalls under load.
Tune decay to drift speed, not to taste. A higher decay_factor tracks fast traffic growth but reacts to noise; start at 0.05 per hour and raise it only if the baseline lags real, sustained shifts.
Watch p99 of the boundary itself. Emit threshold_ms, ewma_ms, and drift_held count as metrics. A rising drift_held rate is an early signal of telemetry corruption or a genuine network-wide shift that warrants human review before the SLA acknowledgment clock is at risk.
Warm baselines from cold storage. On restart, rehydrate each element’s buffer from a recent telemetry snapshot rather than re-entering cold_start for 25 minutes across the whole fabric at once.

Frequently Asked Questions

Why blend an EWMA with a percentile tail instead of using either alone? The EWMA tracks the central trend so the boundary follows legitimate traffic growth, while the p95 tail captures normal burstiness so the boundary sits above routine spikes rather than on top of the mean. Using the EWMA alone would fire on every normal micro-burst; using the percentile alone would lag real, sustained degradation because it has no recency weighting.

How long is the cold-start window, and what protects the element during it? With min_samples=100 at a 15-second interval, the worker needs about 25 minutes of data before it computes a dynamic boundary. Until then it returns the static_fallback_ms SLA ceiling, so a freshly provisioned element or a restarted pipeline is always covered by the static threshold — there is no unguarded gap.

What stops a routing flap from ratcheting the threshold up until it hides real faults? The drift circuit breaker. Any candidate boundary that moves more than drift_guard_ratio (40% here) in a single step is rejected and the last good value is held, with a drift_held metric emitted for review. A corruption burst or clock skew therefore cannot walk the boundary into a level that masks subsequent genuine degradations.

Does dynamic adjustment replace the static SLA value entirely? No. The contractual SLA ceiling remains the cold-start fallback and an absolute backstop. Dynamic adjustment makes the alerting boundary context-aware, but a value that breaches the hard SLA should still escalate regardless of the rolling baseline, since that is a contractual commitment rather than a statistical one.

Up to the parent stage: Threshold Tuning Methods — the closed-loop calibration layer this latency pattern lives inside
Severity Scoring Algorithms — turn a ThresholdBreach into an SLA-aligned dispatch tier
Correlating BGP Flaps with Interface Down Events — bind a latency breach to co-occurring control-plane alarms
Implementing asyncio for High-Volume SNMP — the non-blocking ingest tier that feeds this worker

Dynamic Threshold Adjustment for Latency Alerts #

Schema Alignment and Taxonomy Anchor #

Production Threshold Worker #

Async Ingestion Hook #

Mitigation and Hardening #

Operational Hardening Notes #

Frequently Asked Questions #

Related #