Skip to content

STW Paradox: Individual Pauses Increase, Total STW Decreases

Context

One of the most counterintuitive observations from our GC tuning campaign: Individual STW pauses increased by 10-20%, but total STW time decreased. This "paradox" challenges the intuition that "longer pauses are always worse."

Theoretical Prediction

From GC Mark article:

  • Larger heap → more objects to scan → longer mark phase per GC cycle
  • Higher GOGC → fewer GC cycles → less total GC work
  • Net effect: Individual pauses longer, but total pause time may decrease

Production Observation

Results from 67 Services

Metric P50 P95 P99
API Latency Stable Stable Stable
STW Duration +10-20% Stable Stable
Error Rate Stable N/A N/A

Key Findings

  1. Individual STW increased: Each pause ~10-20% longer (due to larger heap)
  2. Total STW decreased: Fewer GC cycles → less total pause time
  3. P99 stable: No degradation in tail latency
  4. No incidents: Zero OOM, zero latency-related failures during major sales event

Why This Happens

The Math Behind the Paradox

Before tuning (GOGC=100): - Target heap = live_heap × 2 - GC cycles = 100 per hour (example) - STW per cycle = 2 ms - Total STW = 100 × 2 ms = 200 ms/hour

After tuning (GOGC=200): - Target heap = live_heap × 3 - GC cycles = 50 per hour (half as many) - STW per cycle = 2.2 ms (+10% due to larger heap) - Total STW = 50 × 2.2 ms = 110 ms/hour

Result: Individual pauses 10% longer, but total STW time 45% lower.

Why P99 Remains Stable

  1. STW is brief: Even with 20% increase, pauses remain < 5 ms for most services
  2. Fewer pauses: Halving GC frequency halves the probability of hitting a bad percentile
  3. Mark termination STW: The second STW remains brief because write barrier work is proportional to pointer modifications, not heap size

Theoretical Validation

From GC Mark article:

Mark Phase Cost (First STW)

  • Scans more objects → longer
  • But still brief relative to total cycle time

Mark Termination Cost (Second STW)

  • Reschedules goroutines
  • Finalizes global state
  • Proportional to mutator activity, not heap size

This explains why total STW decreases despite larger heaps.

Quantitative Evidence

GC Frequency Reduction

From production data: - Average GOGC: 100 → 150-200 - GC frequency: Reduced by 30-40% - Per-cycle STW: Increased by 10-20% - Total STW: Decreased by 15-25%

Latency Impact

During major sales event (peak traffic): - API P50: Unchanged - API P95: Unchanged - API P99: Unchanged - STW P50: +10-20% - STW P95: Stable - STW P99: Stable

Why This Matters

Common Fear: "Larger Heap = Worse Latency"

Intuition suggests: - Larger heap → longer GC pauses → worse P99 latency - Therefore, keep heap small to minimize pauses

Reality: Total Pause Time Matters More

  • Frequency matters: Fewer pauses = fewer opportunities for tail events
  • Amortized cost: Spreading work over fewer cycles reduces total overhead
  • User experience: P99 latency depends on total pause frequency, not individual pause duration

Practical Implications

For Latency-Sensitive Services

Question: Should I keep GOGC low to minimize STW?

Answer: Not necessarily. Consider: 1. Current GC frequency: If GC runs frequently, higher GOGC may reduce total STW 2. Per-pause duration: If pauses are already brief (< 5 ms), 10-20% increase is negligible 3. P99 baseline: If P99 is dominated by application logic (not GC), STW changes won't show

For Tuning Strategy

Monitor these metrics: - gc_pause_duration_seconds (p50, p95, p99) - gc_duration_seconds (total cycle time) - gc_cycles_total (frequency)

Optimal tuning point: - Where frequency × per_pause_duration is minimized - Not necessarily where per_pause_duration is minimized

For Capacity Planning

  • CPU vs Latency tradeoff: Higher GOGC reduces CPU but may increase per-pause duration
  • Sweet spot: GOGC 150-200 balanced CPU savings with stable P99
  • Beyond 200: Diminishing returns, higher OOM risk

Edge Cases

When STW Paradox Doesn't Hold

  1. Huge live datasets: If heap is dominated by live objects (not garbage), each cycle scans more
  2. Real-time constraints: Some systems have strict per-pause limits (e.g., 1 ms max)
  3. Very small heaps: Already at minimum pause time, can't reduce further

When to Prioritize Per-Pause Duration

  • Hard latency SLAs: E.g., financial trading with sub-millisecond requirements
  • Interactive systems: Where individual pause duration is visible (e.g., GUI)
  • Real-time guarantees: Systems that cannot tolerate any pause outliers