STW Paradox: Individual Pauses Increase, Total STW Decreases¶

Context¶

One of the most counterintuitive observations from our GC tuning campaign: Individual STW pauses increased by 10-20%, but total STW time decreased. This "paradox" challenges the intuition that "longer pauses are always worse."

Theoretical Prediction¶

From GC Mark article:

Larger heap → more objects to scan → longer mark phase per GC cycle
Higher GOGC → fewer GC cycles → less total GC work
Net effect: Individual pauses longer, but total pause time may decrease

Production Observation¶

Results from 67 Services¶

Metric	P50	P95	P99
API Latency	Stable	Stable	Stable
STW Duration	+10-20%	Stable	Stable
Error Rate	Stable	N/A	N/A

Key Findings¶

Individual STW increased: Each pause ~10-20% longer (due to larger heap)
Total STW decreased: Fewer GC cycles → less total pause time
P99 stable: No degradation in tail latency
No incidents: Zero OOM, zero latency-related failures during major sales event

Why This Happens¶

The Math Behind the Paradox¶

Before tuning (GOGC=100): - Target heap = live_heap × 2 - GC cycles = 100 per hour (example) - STW per cycle = 2 ms - Total STW = 100 × 2 ms = 200 ms/hour

After tuning (GOGC=200): - Target heap = live_heap × 3 - GC cycles = 50 per hour (half as many) - STW per cycle = 2.2 ms (+10% due to larger heap) - Total STW = 50 × 2.2 ms = 110 ms/hour

Result: Individual pauses 10% longer, but total STW time 45% lower.

Why P99 Remains Stable¶

STW is brief: Even with 20% increase, pauses remain < 5 ms for most services
Fewer pauses: Halving GC frequency halves the probability of hitting a bad percentile
Mark termination STW: The second STW remains brief because write barrier work is proportional to pointer modifications, not heap size

Theoretical Validation¶

From GC Mark article:

Mark Phase Cost (First STW)¶

Scans more objects → longer
But still brief relative to total cycle time

Mark Termination Cost (Second STW)¶

Reschedules goroutines
Finalizes global state
Proportional to mutator activity, not heap size

This explains why total STW decreases despite larger heaps.

Quantitative Evidence¶

GC Frequency Reduction¶

From production data: - Average GOGC: 100 → 150-200 - GC frequency: Reduced by 30-40% - Per-cycle STW: Increased by 10-20% - Total STW: Decreased by 15-25%

Latency Impact¶

During major sales event (peak traffic): - API P50: Unchanged - API P95: Unchanged - API P99: Unchanged - STW P50: +10-20% - STW P95: Stable - STW P99: Stable

Why This Matters¶

Common Fear: "Larger Heap = Worse Latency"¶

Intuition suggests: - Larger heap → longer GC pauses → worse P99 latency - Therefore, keep heap small to minimize pauses

Reality: Total Pause Time Matters More¶

Frequency matters: Fewer pauses = fewer opportunities for tail events
Amortized cost: Spreading work over fewer cycles reduces total overhead
User experience: P99 latency depends on total pause frequency, not individual pause duration

Practical Implications¶

For Latency-Sensitive Services¶

Question: Should I keep GOGC low to minimize STW?

Answer: Not necessarily. Consider: 1. Current GC frequency: If GC runs frequently, higher GOGC may reduce total STW 2. Per-pause duration: If pauses are already brief (< 5 ms), 10-20% increase is negligible 3. P99 baseline: If P99 is dominated by application logic (not GC), STW changes won't show

For Tuning Strategy¶

Monitor these metrics: - gc_pause_duration_seconds (p50, p95, p99) - gc_duration_seconds (total cycle time) - gc_cycles_total (frequency)

Optimal tuning point: - Where frequency × per_pause_duration is minimized - Not necessarily where per_pause_duration is minimized

For Capacity Planning¶

CPU vs Latency tradeoff: Higher GOGC reduces CPU but may increase per-pause duration
Sweet spot: GOGC 150-200 balanced CPU savings with stable P99
Beyond 200: Diminishing returns, higher OOM risk

Edge Cases¶

When STW Paradox Doesn't Hold¶

Huge live datasets: If heap is dominated by live objects (not garbage), each cycle scans more
Real-time constraints: Some systems have strict per-pause limits (e.g., 1 ms max)
Very small heaps: Already at minimum pause time, can't reduce further

When to Prioritize Per-Pause Duration¶

Hard latency SLAs: E.g., financial trading with sub-millisecond requirements
Interactive systems: Where individual pause duration is visible (e.g., GUI)
Real-time guarantees: Systems that cannot tolerate any pause outliers

GC Mark Theory - STW phases and their costs
GC Pacer Theory - How GOGC affects trigger points and frequency
Latency Tail Analysis - How we measured STW impact