Latency Tail Analysis: Window Feature Extraction Method¶

Context¶

When analyzing GC tuning impact on production services, we needed a robust method to detect latency degradation—especially in P99 tails. Standard averaging doesn't work because: - Different services have vastly different baselines (P99 from 5ms to 500ms) - Direct averaging of relative changes is skewed by outliers - We need to identify "jitter" in the tail, not just average shifts

Analysis Method¶

Data Collection¶

For each service, metric (API latency, STW), and percentile (50/95/99): 1. Pull 10-minute time series x(t) with 15-second intervals (~40 data points) 2. Extract window features from each series

Window Feature Extraction¶

repr = median(x(t))              # Main representative value
hi = quantile(x(t), 0.95)        # Worst 5% behavior (more stable than max)
instability = hi / repr          # Tail severity metric

Key insight: instability identifies P99 tail jitter that simple averages miss.

Baseline Comparison¶

Δ = after_repr - before_repr                     # Absolute change
r = (after_repr - before_repr) / before_repr     # Relative change (with epsilon cutoff)
degraded = (r > r0) AND (Δ > Δ0)                 # Degradation condition

The deadband (Δ > Δ0) prevents noise from triggering false positives on small baseline values.

Cross-Service Analysis¶

Use log ratio to handle asymmetric changes:

ℓ = ln(after/before)

This addresses the issue where services with vastly different baselines would skew aggregated metrics.

Thresholds (Empirically Chosen)¶

Percentile	Relative Threshold (r0)	Absolute Threshold (Δ0)
P50	3%	0.2 ms
P95	5%	1 ms
P99	8-10%	2 ms

Tail Effect Analysis¶

tail_ratio = repr_p99 / repr_p50

Indicates whether tail behavior became heavier or lighter relative to median behavior.

Production Findings¶

Results from 67 Services¶

Metric	P50	P95	P99
API Latency	Stable	Stable	Stable
STW Duration	+10-20%	Stable	Stable
Error Rate	Stable	N/A	N/A

Key Observations¶

P99 remained stable: No degradation in tail latency despite larger heaps
Individual STW increased: Each pause ~10-20% longer (expected due to larger heap)
Total STW decreased: Fewer GC cycles → less total pause time
No incidents: Zero OOM, zero latency-related failures during major sales event

Why This Method Works¶

Median (repr): Robust to outliers in the time series
P95 (hi): Captures tail behavior without being as volatile as max
Instability ratio: Normalizes for baseline differences
Deadband (Δ > Δ0): Prevents false positives on tiny absolute changes
Log ratio: Handles asymmetric changes and vast baseline differences

GC Mark Theory - STW impact on latency
RSS vs Heap Target - Memory behavior analysis