Skip to content

Latency Tail Analysis: Window Feature Extraction Method

Context

When analyzing GC tuning impact on production services, we needed a robust method to detect latency degradation—especially in P99 tails. Standard averaging doesn't work because: - Different services have vastly different baselines (P99 from 5ms to 500ms) - Direct averaging of relative changes is skewed by outliers - We need to identify "jitter" in the tail, not just average shifts

Analysis Method

Data Collection

For each service, metric (API latency, STW), and percentile (50/95/99): 1. Pull 10-minute time series x(t) with 15-second intervals (~40 data points) 2. Extract window features from each series

Window Feature Extraction

repr = median(x(t))              # Main representative value
hi = quantile(x(t), 0.95)        # Worst 5% behavior (more stable than max)
instability = hi / repr          # Tail severity metric

Key insight: instability identifies P99 tail jitter that simple averages miss.

Baseline Comparison

Δ = after_repr - before_repr                     # Absolute change
r = (after_repr - before_repr) / before_repr     # Relative change (with epsilon cutoff)
degraded = (r > r0) AND (Δ > Δ0)                 # Degradation condition

The deadband (Δ > Δ0) prevents noise from triggering false positives on small baseline values.

Cross-Service Analysis

Use log ratio to handle asymmetric changes:

ℓ = ln(after/before)

This addresses the issue where services with vastly different baselines would skew aggregated metrics.

Thresholds (Empirically Chosen)

Percentile Relative Threshold (r0) Absolute Threshold (Δ0)
P50 3% 0.2 ms
P95 5% 1 ms
P99 8-10% 2 ms

Tail Effect Analysis

tail_ratio = repr_p99 / repr_p50

Indicates whether tail behavior became heavier or lighter relative to median behavior.

Production Findings

Results from 67 Services

Metric P50 P95 P99
API Latency Stable Stable Stable
STW Duration +10-20% Stable Stable
Error Rate Stable N/A N/A

Key Observations

  1. P99 remained stable: No degradation in tail latency despite larger heaps
  2. Individual STW increased: Each pause ~10-20% longer (expected due to larger heap)
  3. Total STW decreased: Fewer GC cycles → less total pause time
  4. No incidents: Zero OOM, zero latency-related failures during major sales event

Why This Method Works

  1. Median (repr): Robust to outliers in the time series
  2. P95 (hi): Captures tail behavior without being as volatile as max
  3. Instability ratio: Normalizes for baseline differences
  4. Deadband (Δ > Δ0): Prevents false positives on tiny absolute changes
  5. Log ratio: Handles asymmetric changes and vast baseline differences