How Detection Works¶
PulseRoute monitors multiple dimensions of processor health simultaneously, going far beyond simple uptime checks or error rate thresholds.
Multi-Signal Health Analysis¶
Traditional monitoring watches a single metric — usually error rate or uptime. PulseRoute tracks multiple health signals in parallel:
Success and error patterns: PulseRoute tracks per-processor success rates and error rates over a sliding time window. But unlike simple threshold monitoring, it also watches which errors are occurring. A spike in timeouts means something different than a spike in authentication failures — PulseRoute distinguishes between transient errors and systemic degradation.
Latency distribution shifts: A processor doesn't go from "healthy" to "dead" instantly. Degradation usually shows up in latency first — p95 and p99 response times creep up while median latency looks normal. PulseRoute monitors the full latency distribution (p50, p95, p99) and detects when the distribution shape changes, catching degradation that average-based monitoring misses entirely.
Response time variability: Healthy processors have consistent response times. A processor under stress shows increasing variability — some requests return fast, others take 10x longer. PulseRoute tracks this jitter as an early warning signal, often detecting problems before error rates move at all.
Three Levels of Intelligence¶
Tier 1: Threshold-Based Detection¶
The foundation. PulseRoute continuously evaluates processor health and compares against configurable thresholds:
- Healthy: Processor is performing within normal parameters
- Degraded: Performance has deteriorated but is still functional
- Failed Over: Processor has crossed the failure threshold, traffic has been moved
When a processor's metrics cross a threshold, PulseRoute automatically shifts traffic to the backup. When the processor recovers, traffic gradually returns.
This is reactive — it responds to problems after they're measurable. Fast (seconds, not minutes), but still reactive.
Tier 2: Predictive Detection¶
This is where PulseRoute goes beyond traditional monitoring.
Our ML model is trained to recognize the subtle signatures that precede processor outages. It monitors multiple health signals including latency distribution shifts, error code frequency changes, and response time variability. When these signals combine in patterns consistent with impending failure, PulseRoute assigns a failure probability to each processor.
When a processor's failure probability exceeds a configurable risk threshold, PulseRoute begins pre-emptively shifting traffic — typically 30 to 60 seconds before transaction success rates visibly decline.
Why 30-60 seconds matters: At 100 transactions per minute, a 30-second early warning prevents 50 failed transactions. At $75 average order value, that's $3,750 saved per incident.
The model runs inference in microseconds with no external dependencies — no GPU required, no ML framework, no network calls. It's embedded directly in the routing engine.
Tier 3: Continuous Optimization¶
Tier 3 moves beyond binary failover (all-primary or all-secondary) to continuous traffic optimization.
Instead of waiting for a processor to fail, Tier 3 continuously adjusts the traffic split based on real-time performance. If Processor A has a 98% success rate and Processor B has 96%, Tier 3 will route slightly more traffic to A — without any hard failover.
When combined with Tier 2 predictions, Tier 3 makes routing decisions that account for both current performance and predicted future performance. A processor that's healthy now but showing early degradation signals will receive less traffic, even before it crosses any threshold.
Key behaviors: - Gradual, not binary: Traffic shifts smoothly, avoiding the thundering herd problem of sudden cutover - Self-balancing: Continuously adapts to changing processor performance - Exploration: Always sends a small percentage of traffic to the secondary processor, ensuring it has fresh data if it needs to failover - Fast adaptation: Weights recent observations more heavily than historical ones, tracking real-time conditions
What PulseRoute Doesn't Do¶
It's important to be clear about boundaries:
- PulseRoute doesn't touch payment data. It only sees routing metadata: processor IDs, success/failure flags, latency, and error codes. No card numbers, no amounts, no PII.
- PulseRoute doesn't make payment decisions. It recommends which processor to use. Your system makes the final routing decision and processes the payment.
- PulseRoute doesn't replace your payment infrastructure. It's an intelligence layer that sits alongside your existing payment flow, making it more resilient.
Detection in Practice¶
Here's what happens during a real degradation event:
Timeline:
─────────────────────────────────────────────────────────────►
t=0 Processor A starts responding slower.
Error rate: 2% (normal). p95 latency: up 40%.
t=15s Tier 2 failure probability rises to 0.45.
Not yet critical, but PulseRoute is watching.
t=30s p95 latency up 80%. Error rate: 5%. Jitter increasing.
Tier 2 probability: 0.72 — exceeds threshold.
Tier 3 begins shifting traffic: 70% A → 55% A.
t=45s Error rate hits 12%. Still below Tier 1 failover threshold.
Tier 3 has shifted to 30% A, 70% B.
Most transactions are already on the healthy processor.
t=60s Error rate hits 30%. Tier 1 threshold breached.
Tier 3 is now at 5% A, 95% B.
By the time traditional monitoring would alert,
PulseRoute has already moved 95% of traffic.
t=90s Processor A enters full outage.
Impact: minimal — only 5% of traffic was still on A.
Without PulseRoute, you'd discover the outage at t=60s (if you have good monitoring) or t=5min+ (if you rely on customer complaints). All traffic would be on the failing processor until someone manually intervenes.