STW Paradox: Individual Pauses Increase, Total STW Decreases¶
Context¶
One of the most counterintuitive observations from our GC tuning campaign: Individual STW pauses increased by 10-20%, but total STW time decreased. This "paradox" challenges the intuition that "longer pauses are always worse."
Theoretical Prediction¶
From GC Mark article:
- Larger heap → more objects to scan → longer mark phase per GC cycle
- Higher GOGC → fewer GC cycles → less total GC work
- Net effect: Individual pauses longer, but total pause time may decrease
Production Observation¶
Results from 67 Services¶
| Metric | P50 | P95 | P99 |
|---|---|---|---|
| API Latency | Stable | Stable | Stable |
| STW Duration | +10-20% | Stable | Stable |
| Error Rate | Stable | N/A | N/A |
Key Findings¶
- Individual STW increased: Each pause ~10-20% longer (due to larger heap)
- Total STW decreased: Fewer GC cycles → less total pause time
- P99 stable: No degradation in tail latency
- No incidents: Zero OOM, zero latency-related failures during major sales event
Why This Happens¶
The Math Behind the Paradox¶
Before tuning (GOGC=100): - Target heap = live_heap × 2 - GC cycles = 100 per hour (example) - STW per cycle = 2 ms - Total STW = 100 × 2 ms = 200 ms/hour
After tuning (GOGC=200): - Target heap = live_heap × 3 - GC cycles = 50 per hour (half as many) - STW per cycle = 2.2 ms (+10% due to larger heap) - Total STW = 50 × 2.2 ms = 110 ms/hour
Result: Individual pauses 10% longer, but total STW time 45% lower.
Why P99 Remains Stable¶
- STW is brief: Even with 20% increase, pauses remain < 5 ms for most services
- Fewer pauses: Halving GC frequency halves the probability of hitting a bad percentile
- Mark termination STW: The second STW remains brief because write barrier work is proportional to pointer modifications, not heap size
Theoretical Validation¶
From GC Mark article:
Mark Phase Cost (First STW)¶
- Scans more objects → longer
- But still brief relative to total cycle time
Mark Termination Cost (Second STW)¶
- Reschedules goroutines
- Finalizes global state
- Proportional to mutator activity, not heap size
This explains why total STW decreases despite larger heaps.
Quantitative Evidence¶
GC Frequency Reduction¶
From production data: - Average GOGC: 100 → 150-200 - GC frequency: Reduced by 30-40% - Per-cycle STW: Increased by 10-20% - Total STW: Decreased by 15-25%
Latency Impact¶
During major sales event (peak traffic): - API P50: Unchanged - API P95: Unchanged - API P99: Unchanged - STW P50: +10-20% - STW P95: Stable - STW P99: Stable
Why This Matters¶
Common Fear: "Larger Heap = Worse Latency"¶
Intuition suggests: - Larger heap → longer GC pauses → worse P99 latency - Therefore, keep heap small to minimize pauses
Reality: Total Pause Time Matters More¶
- Frequency matters: Fewer pauses = fewer opportunities for tail events
- Amortized cost: Spreading work over fewer cycles reduces total overhead
- User experience: P99 latency depends on total pause frequency, not individual pause duration
Practical Implications¶
For Latency-Sensitive Services¶
Question: Should I keep GOGC low to minimize STW?
Answer: Not necessarily. Consider: 1. Current GC frequency: If GC runs frequently, higher GOGC may reduce total STW 2. Per-pause duration: If pauses are already brief (< 5 ms), 10-20% increase is negligible 3. P99 baseline: If P99 is dominated by application logic (not GC), STW changes won't show
For Tuning Strategy¶
Monitor these metrics: - gc_pause_duration_seconds (p50, p95, p99) - gc_duration_seconds (total cycle time) - gc_cycles_total (frequency)
Optimal tuning point: - Where frequency × per_pause_duration is minimized - Not necessarily where per_pause_duration is minimized
For Capacity Planning¶
- CPU vs Latency tradeoff: Higher GOGC reduces CPU but may increase per-pause duration
- Sweet spot: GOGC 150-200 balanced CPU savings with stable P99
- Beyond 200: Diminishing returns, higher OOM risk
Edge Cases¶
When STW Paradox Doesn't Hold¶
- Huge live datasets: If heap is dominated by live objects (not garbage), each cycle scans more
- Real-time constraints: Some systems have strict per-pause limits (e.g., 1 ms max)
- Very small heaps: Already at minimum pause time, can't reduce further
When to Prioritize Per-Pause Duration¶
- Hard latency SLAs: E.g., financial trading with sub-millisecond requirements
- Interactive systems: Where individual pause duration is visible (e.g., GUI)
- Real-time guarantees: Systems that cannot tolerate any pause outliers
Related Topics¶
- GC Mark Theory - STW phases and their costs
- GC Pacer Theory - How GOGC affects trigger points and frequency
- Latency Tail Analysis - How we measured STW impact