Security Boundary Mapping: Jurisdictional Routing for Telecom Fault Correlation

Q: Why map boundaries before correlation instead of after?

Routing jurisdiction is a property of where a fault originates, which is known the moment an event is normalized. Resolving it before correlation means the correlation engine only ever links events that already share a trust domain, which prevents cross-domain ticket noise and keeps each escalation path inside the right team's scope.

Q: What happens to an event that matches no rule?

It is routed to the unclassified queue with a penalized 1.5 SLA multiplier and an explicit fallback escalation path, and is never dropped. The penalty makes unrouted volume self-correcting: a growing unclassified queue is a direct, alertable signal that a manifest needs a new rule.

Q: How does boundary mapping change the SLA window?

The mapper attaches an SLA multiplier based on boundary criticality (0.5 for management plane, 0.75 for customer edge, 1.0 for core transport). Downstream ticketing multiplies the severity-derived SLA window by this coefficient, so an identical severity escalates faster in subscriber-impacting zones than in the stable core.

Security Boundary Mapping is the deterministic routing layer that translates already-normalized telemetry into logical network perimeter assignments before a ticket is ever created. It sits inside the Core Architecture & Log Taxonomy pipeline, downstream of protocol normalization and upstream of correlation and dispatch. Its single responsibility is to answer one question for every event: which trust zone owns this fault, and which jurisdictional queue is allowed to receive it? By evaluating each event against a predefined matrix of trust zones, asset ownership, and operational jurisdictions, the mapper isolates fault-propagation paths, enforces least-privilege escalation, and prevents cross-domain ticket noise.

Operational Intent and Scope

The boundary mapper consumes structured events and emits routing metadata — nothing more. What enters is a fully normalized event that has already cleared protocol-specific decoding and schema validation. What exits is the same immutable event annotated with a target_domain, an sla_multiplier, and an explicit escalation path. What is explicitly excluded is re-parsing, severity authorship, correlation, and ticket creation; those belong to neighbouring stages and the mapper must never duplicate them.

This narrow contract is what keeps routing deterministic. Events reaching this stage have already been transformed: traps processed through SNMP Trap Standardization arrive with OID-to-semantic translations resolved, and text streams processed through Syslog Format Parsing arrive as structured key-value records. The mapper treats both as read-only inputs that already conform to the Event Schema Design contract, applying routing predicates without altering any upstream field. This strict separation of concerns eliminates parsing drift and guarantees that a routing decision is auditable during regulatory review or post-incident root-cause analysis.

The boundary that matters here is not a firewall — it is a trust and jurisdiction boundary. A core transport fault, a customer-edge fault, and a management-plane fault may carry identical raw severity yet must reach entirely different teams under entirely different data-sovereignty rules. The mapper encodes that policy as machine-checked rules rather than tribal knowledge.

Pipeline Architecture

Within the broader pipeline, the mapper runs as a stateless stage between normalization and correlation. Internally it executes a three-step flow: zone resolution (which trust zone does the source asset belong to), predicate evaluation (which highest-priority rule matches the zone and asset class), and decision emission (attach the target domain, SLA multiplier, and escalation path). Each step is pure with respect to the event payload, so any worker can process any event without shared state.

Diagram: deterministic boundary rule evaluation and fallback routing.

The decision logic is intentionally a compiled decision table rather than a heuristic or machine-learning classifier. Rules are version-controlled manifests, sorted by priority and compiled into memory at load time, so evaluation is a deterministic walk of strict boolean and regex predicates against normalized fields such as source_zone, asset_class, management_plane_indicator, and topology_parent. The first (highest-priority) match wins; if nothing matches, a penalized fallback applies so that no event is ever silently dropped.

Production-Ready Async Implementation

The implementation below uses Pydantic V2 for schema enforcement and asyncio for non-blocking topology lookups, matching the rest of the pipeline. Predicate evaluation itself is CPU-bound and synchronous, but the topology-cache resolution that enriches topology_parent is awaited so a slow CMDB never blocks the event loop. Regex patterns are compiled once at manifest load, not per event.

import asyncio
import re
import logging
from typing import Any
from pydantic import BaseModel, Field, field_validator

logger = logging.getLogger("boundary_mapper")


class NormalizedEvent(BaseModel):
    """Immutable input — already decoded, schema-validated upstream."""
    model_config = {"frozen": True}

    event_id: str
    source_zone: str
    asset_class: str
    management_plane_indicator: bool = False
    topology_parent: str | None = None
    raw_severity: int
    timestamp_ms: int


class BoundaryRule(BaseModel):
    rule_id: str
    priority: int                       # lower number = higher precedence
    zone_pattern: str
    asset_pattern: str
    target_domain: str
    sla_multiplier: float = 1.0

    @field_validator("zone_pattern", "asset_pattern")
    @classmethod
    def _compilable(cls, v: str) -> str:
        # Fail fast in CI if a manifest ships an invalid regex.
        re.compile(v)
        return v


class RoutingDecision(BaseModel):
    event_id: str
    matched_rule_id: str | None
    target_domain: str
    sla_multiplier: float
    escalation_path: list[str]
    metadata: dict[str, Any] = Field(default_factory=dict)


class BoundaryMapper:
    def __init__(self, raw_rules: list[dict[str, Any]], topology):
        # topology is any async cache exposing `await topology.parent(asset)`.
        self._topology = topology
        self._rules: list[BoundaryRule] = sorted(
            (BoundaryRule(**r) for r in raw_rules), key=lambda r: r.priority
        )
        # Pre-compile every predicate once at load time.
        self._compiled: dict[str, tuple[re.Pattern, re.Pattern]] = {
            rule.rule_id: (
                re.compile(rule.zone_pattern, re.IGNORECASE),
                re.compile(rule.asset_pattern, re.IGNORECASE),
            )
            for rule in self._rules
        }

    async def evaluate(self, event: NormalizedEvent) -> RoutingDecision:
        # Resolve topology parent without blocking the loop. A cache miss
        # routes to a dedicated stale-topology lane rather than guessing.
        parent = event.topology_parent
        if parent is None:
            parent = await self._topology.parent(event.asset_class)
            if parent is None:
                return self._stale_topology(event)

        for rule in self._rules:
            zone_re, asset_re = self._compiled[rule.rule_id]
            if zone_re.match(event.source_zone) and asset_re.match(event.asset_class):
                return RoutingDecision(
                    event_id=event.event_id,
                    matched_rule_id=rule.rule_id,
                    target_domain=rule.target_domain,
                    sla_multiplier=rule.sla_multiplier,
                    escalation_path=[
                        f"{rule.target_domain}_tier1",
                        f"{rule.target_domain}_tier2",
                    ],
                    metadata={"rule_priority": rule.priority, "topology_parent": parent},
                )

        return self._fallback(event)

    @staticmethod
    def _fallback(event: NormalizedEvent) -> RoutingDecision:
        # Penalized multiplier forces manifest remediation; nothing is dropped.
        return RoutingDecision(
            event_id=event.event_id,
            matched_rule_id=None,
            target_domain="unclassified",
            sla_multiplier=1.5,
            escalation_path=["noc_general_queue"],
            metadata={"fallback_applied": True},
        )

    @staticmethod
    def _stale_topology(event: NormalizedEvent) -> RoutingDecision:
        return RoutingDecision(
            event_id=event.event_id,
            matched_rule_id=None,
            target_domain="stale_topology",
            sla_multiplier=1.25,
            escalation_path=["topology_review_queue"],
            metadata={"topology_miss": True},
        )

Schema validation via Pydantic ensures a malformed manifest fails fast in CI rather than causing a silent routing misfire in production, and asyncio keeps topology enrichment off the critical path. For the regex semantics used in the predicates, see Python’s re module documentation; for the concurrency model, see the asyncio documentation.

Topology and Boundary Validation

A routing decision is only trustworthy if the trust zone it assigns is real. The mapper therefore validates two constraints before emitting a decision. First, adjacency: the topology_parent resolved for an asset must belong to the same zone the rule assigns, otherwise the event is suspected of zone spoofing and diverted for review rather than dispatched. This adjacency check is conceptually the same constraint enforced by Topology-Aware Correlation when it validates chassis relationships, applied here at the perimeter instead of the correlation graph.

Second, schema completeness: because the event arrives frozen and pre-validated against the Event Schema Design contract, the mapper can assume every routing field is present and well-typed. If a field the rule depends on is absent — for example a management_plane_indicator that should never be null for a control-plane asset — the mapper treats this as an upstream contract breach and routes to the review lane instead of fabricating a default. Suppressing false positives at this layer is cheaper than unwinding a mis-dispatched P1 hours later: a single misrouted management-plane fault can mask a network-wide risk while NOC engineers triage it in the wrong queue.

Configuration and Tuning Parameters

Boundary behaviour is governed by a small set of tunable parameters, each carrying an operational rationale:

Topology cache TTL (default 300s): how long a resolved topology_parent stays valid. Shorten to 60s in networks with frequent re-homing; lengthen to 900s for stable transport cores to cut CMDB load. Too long and the mapper routes on stale ownership; too short and cache-miss rate (and therefore p99 latency) climbs.
Cache lookup timeout (default 8ms): the await budget for a topology resolution. Exceeding it triggers graceful degradation to static source_zone routing with an elevated multiplier rather than blocking the worker.
Fallback SLA multiplier (default 1.5): the penalty applied to unclassified events. Keep it above 1.0 so unrouted volume is self-correcting — an operator watching the unclassified queue grow has a direct signal that a manifest needs a new rule.
SLA multipliers by boundary criticality, applied to downstream auto-escalation timers:
- Core / transport boundaries — 1.0 (standard SLA window).
- Customer edge / access boundaries — 0.75 (accelerated; direct subscriber impact).
- Management plane / control boundaries — 0.5 (immediate P1; network-wide risk).
- Unclassified / stale topology — 1.5 / 1.25 (penalized, forces remediation).

These multipliers are where boundary mapping and severity policy meet. When combined with Defining Severity Levels for Telecom Faults, a CRITICAL fault in a customer_edge zone resolves to a different escalation matrix than an identical-severity event in a core_transport zone — preventing blanket P1 storms while preserving rapid response for subscriber-impacting domains. Weighting itself is owned by Severity Scoring Algorithms; the mapper only multiplies the resulting SLA window by the boundary coefficient.

Debugging Workflow and Observability

Deterministic routing demands traceable evaluation paths. When a misclassification occurs, the goal is to isolate whether the failure originated upstream (normalization), in stale topology, or in predicate misalignment. Work the checklist in order:

Attach a trace_id to every event and emit it on the routing decision. With DEBUG logging enabled, log the matched rule_id, the resolved topology_parent, and the boolean result of each predicate so the exact decision path can be reconstructed across workers.
Run a shadow (dry-run) evaluator that consumes production telemetry but publishes no tickets. Diff its matched_rule_id and sla_multiplier against historical baselines; flag any event whose decision drifts.
Alert on matched_rule_id == null rate. A rising unclassified ratio (alert above 2% over five minutes) is the earliest signal of a vendor firmware change or an undocumented zone introducing events no rule covers.
Track cache hit/miss and a circuit breaker on the topology lookup. A None parent for a known asset should route to stale_topology, never to unclassified — keep the two queues separate so the metric stays diagnostic.
Lint manifests before deploy. Reject manifests with overlapping regex patterns, contradictory priority assignments, or unreachable rules. Watch p99 evaluation latency: standardization plus boundary mapping must stay under 5ms per event to avoid backpressure during storms.

Failure Modes and Mitigation

CMDB / topology outage. If lookups exceed the timeout threshold, degrade gracefully to static source_zone routing with an elevated sla_multiplier, and log the dependency failure as an operational warning rather than a routing error. The pipeline keeps moving; ownership precision is temporarily relaxed, not lost.
Manifest poisoning or regression. Validate every new manifest in a sandboxed evaluator and hot-swap the in-memory rule set only after it passes; never restart the process to reload rules. A bad manifest is rejected at the gate, not in production.
Zone-spoofed telemetry. Events whose claimed source_zone contradicts their resolved topology_parent are isolated to a review lane (a dedicated dead-letter queue) with the full payload and the conflicting fields, so a spoofing attempt becomes a measurable signal instead of a mis-dispatch. This is the boundary-specific case of the DLQ discipline used across the Core Architecture & Log Taxonomy pipeline.
Worker saturation under storm. Because evaluation is stateless and the event is frozen, scale horizontally behind a load balancer — any instance processes any event. Pair this with a circuit breaker at the queue boundary so that if latency degrades for more than 5% of traffic, the stage fails into static routing rather than collapsing.
Audit gaps. Append every routing decision — rule ID, evaluated predicates, multiplier, and timestamp — to an append-only store (for example a Kafka topic). This satisfies compliance requirements and enables precise post-incident reconstruction without replaying in-flight events.

Frequently Asked Questions

Why map boundaries before correlation instead of after?

Routing jurisdiction is a property of where a fault originates, which is known the moment an event is normalized. Resolving it before correlation means the correlation engine only ever links events that already share a trust domain, which prevents cross-domain ticket noise and keeps each escalation path inside the right team’s scope.

What happens to an event that matches no rule?

It is routed to the unclassified queue with a penalized 1.5 SLA multiplier and an explicit fallback escalation path — never dropped. The penalty makes unrouted volume self-correcting: a growing unclassified queue is a direct, alertable signal that a manifest needs a new rule.

How does boundary mapping change the SLA window?

The mapper attaches an sla_multiplier based on boundary criticality (0.5 for management plane, 0.75 for customer edge, 1.0 for core transport). Downstream ticketing multiplies the severity-derived SLA window by this coefficient, so an identical severity escalates faster in subscriber-impacting zones than in the stable core.

Does the mapper ever block on the CMDB?

No. Topology resolution is awaited with a strict timeout (default 8ms); if the cache is slow or returns nothing for a known asset, the event degrades to static source_zone routing or a stale_topology lane. The event loop is never blocked on a slow dependency.

Up to: Core Architecture & Log Taxonomy — the reference pipeline this routing layer belongs to
SNMP Trap Standardization — normalizing vendor MIBs before boundary evaluation
Syslog Format Parsing — RFC 5424 normalization that feeds this stage
Event Schema Design — the immutable event contract every routing field relies on
Defining Severity Levels for Telecom Faults — how severity pairs with boundary multipliers
Topology-Aware Correlation — adjacency validation applied in the correlation graph

Security Boundary Mapping: Jurisdictional Routing for Telecom Fault Correlation #

Operational Intent and Scope #

Pipeline Architecture #

Production-Ready Async Implementation #

Topology and Boundary Validation #

Configuration and Tuning Parameters #

Debugging Workflow and Observability #

Failure Modes and Mitigation #

Frequently Asked Questions #

Why map boundaries before correlation instead of after? #

What happens to an event that matches no rule? #

How does boundary mapping change the SLA window? #

Does the mapper ever block on the CMDB? #

Related #

In this section