Latency Tail Analysis: Window Feature Extraction Method¶
Context¶
When analyzing GC tuning impact on production services, we needed a robust method to detect latency degradation—especially in P99 tails. Standard averaging doesn't work because: - Different services have vastly different baselines (P99 from 5ms to 500ms) - Direct averaging of relative changes is skewed by outliers - We need to identify "jitter" in the tail, not just average shifts
Analysis Method¶
Data Collection¶
For each service, metric (API latency, STW), and percentile (50/95/99): 1. Pull 10-minute time series x(t) with 15-second intervals (~40 data points) 2. Extract window features from each series
Window Feature Extraction¶
repr = median(x(t)) # Main representative value
hi = quantile(x(t), 0.95) # Worst 5% behavior (more stable than max)
instability = hi / repr # Tail severity metric
Key insight: instability identifies P99 tail jitter that simple averages miss.
Baseline Comparison¶
Δ = after_repr - before_repr # Absolute change
r = (after_repr - before_repr) / before_repr # Relative change (with epsilon cutoff)
degraded = (r > r0) AND (Δ > Δ0) # Degradation condition
The deadband (Δ > Δ0) prevents noise from triggering false positives on small baseline values.
Cross-Service Analysis¶
Use log ratio to handle asymmetric changes:
This addresses the issue where services with vastly different baselines would skew aggregated metrics.
Thresholds (Empirically Chosen)¶
| Percentile | Relative Threshold (r0) | Absolute Threshold (Δ0) |
|---|---|---|
| P50 | 3% | 0.2 ms |
| P95 | 5% | 1 ms |
| P99 | 8-10% | 2 ms |
Tail Effect Analysis¶
Indicates whether tail behavior became heavier or lighter relative to median behavior.
Production Findings¶
Results from 67 Services¶
| Metric | P50 | P95 | P99 |
|---|---|---|---|
| API Latency | Stable | Stable | Stable |
| STW Duration | +10-20% | Stable | Stable |
| Error Rate | Stable | N/A | N/A |
Key Observations¶
- P99 remained stable: No degradation in tail latency despite larger heaps
- Individual STW increased: Each pause ~10-20% longer (expected due to larger heap)
- Total STW decreased: Fewer GC cycles → less total pause time
- No incidents: Zero OOM, zero latency-related failures during major sales event
Why This Method Works¶
- Median (
repr): Robust to outliers in the time series - P95 (
hi): Captures tail behavior without being as volatile as max - Instability ratio: Normalizes for baseline differences
- Deadband (
Δ > Δ0): Prevents false positives on tiny absolute changes - Log ratio: Handles asymmetric changes and vast baseline differences
Related Topics¶
- GC Mark Theory - STW impact on latency
- RSS vs Heap Target - Memory behavior analysis