Skip to content

Large-scale Go GC Tuning in Production

This article documents a production GC tuning project that achieved CPU savings across hundreds of services.

Project Scale:

  • 67 services selected from 6700+ candidates
  • 49K instances affected (42% coverage out of 116K total)
  • Total CPU quota: 682K cores
  • Total memory: 1.11 PB
  • Statistical window: 1 hour after campaign start (12.12 sales event)

1. Background and Goal

Goal: Reduce CPU usage for services with large CPU quotas.

Hypothesis: If we increase memory utilization by increasing GOGC, we can: 1. Trigger GC less frequently 2. Reduce GC CPU overhead 3. Improve application throughput

2. Memory and CPU

Core Mechanism

GC tuning improves instance memory utilization, triggers fewer GC cycles, and reduces application CPU usage by lowering GC CPU overhead.

Baseline vs Feature

The heatmap below offers a comprehensive overview of memory and cpu usage for all candidate services:

The top 2 boards show gc tuner enabled services memory distributions, by setting preferred target heap, most instances' memory usages stay at 70%. If we narrow down the heatmap(top right board), most of instances stay at 71% and 69%.

Memory and CPU Distribution

The CPU usage heatmap shows the featured cpu usage has overally distributed at 35-45%, while 40-45% cpu usage in baseline. Roughly 5% cpu usage improvement.

*Note: the timespan is quite short(3min only) because I have joined the whole 116k instances by 2 separated metrics(memory/cpu usage and the runtime toggle) to offer a better distribution graph.

GC Tuned Memory During Campaign

During campaign, gc tuner will set 70% memory usage as target heap so gc will try to keep memory usage under 70% and the total container memory usage stays near 70%.

Below shows the concrete memory usage distribution for all gc tuner enabled instances to demonstrate the golang runtime gc has well ablility to keep memory usage at a reasonable position.

Before campaign:

Memory Before Campaign

During campaign: Memory locked at preset boundaries. Stability was high.

Memory During Campaign

After campaign:

Memory After Campaign

Memory After Campaign - Additional View 1 Memory After Campaign - Additional View 2

Overall memory trend:

Memory Trend 1 Memory Trend 2 Memory Trend 3 Memory Trend 4 Memory Trend 5

Overall memory changes:

Overall Memory Changes 1 Overall Memory Changes 2 Overall Memory Changes 3

3. STW and App Latency

Campaign execution: - Zero incidents: No OOM, no latency-related failures - Zero alert fatigue: Alert thresholds proactively adjusted following our SOP - Smooth execution: Campaign completed without any alerts

Memory and CPU During Peak

4. Service Selection

Selection criteria: - Large CPU quota and sufficient memory - Ignore services with 70%+ memory usage (not suitable for GC tuner) - Select top 100 services by CPU quota

Rationale: By selecting services meeting these criteria, we balanced maximizing CPU coverage with manageable rollout workload.

Top Services (Anonymized)

Service Type Memory CPU Cores Instance Count
Pricing Service A 99.1 TiB 50.7 K 6.36 K
Promotion Service B 96.8 TiB 49.5 K 6.22 K
Price Boundary C 93.2 TiB 47.7 K 5.97 K
Listing Aggregation D 74.4 TiB 39.1 K 4.91 K
Usage API E 71.6 TiB 36.5 K 9.18 K

5. GC Tuner Design

Mechanism

The GC tuner uses a runtime toggle to enable GC tuning for specific instance percentages. It checks current live heap and memory usage to determine whether to adjust GOGC. A dedicated gc_tuner_status metric indicates the current state.

Safety

  • Additional container-level RSS exposed (cluster IP is container-level, business metrics are instance-level)
  • Go runtime metrics validate GOGC and target_heap match expectations
  • GC tuner is adaptive—all safety protections except manual config changes are automatic

6. Data Collection

Metrics Collected

  • Container CPU usage (cgroup metrics)
  • Memory RSS (container level)
  • Go runtime metrics (go_memstats_*)
  • API latency: P50 / P95 / P99
  • Application error rate
  • OOM / restart count

Latency Analysis Method

Query: Pull 10-minute window of p50/p95/p99 time series

Window processing: Extract "window features" for each service, metric (API latency, STW), and percentile (50/95/99)

Metrics: - Main representative: median(x(t)) - Bad level (p95-of-time): quantile(x(t), 0.95) — worst 5% of time in 10min window - Stability/jitter: hi / repr — identifies "p99 tail jitter"

Comparison: - Absolute change: Δ = after_repr - before_repr - Relative change: r = (after_repr - before_repr) / before_repr - Degradation condition: (r > r0) AND (Δ > Δ0)

Thresholds: - p50: r0=3%, Δ0=0.2ms - p95: r0=5%, Δ0=1ms - p99: r0=8-10%, Δ0=2ms

7. Production Observations

Throughout this project, we documented several production-only findings—behaviors that can only be discovered by running systems at scale:

Observation Key Finding Link
RSS vs Heap Target RSS (79.5%) exceeds heap target (70%) due to scavenger buffer + non-heap overhead
Latency Tail Analysis Window feature extraction method for detecting P99 jitter
Memory Lock Behavior Memory usage "locks" at preset bounds during steady state
Scavenger Elasticity Memory scales with load and returns to OS after traffic decreases
Assist Ratio Spikes Transient spikes during traffic surges, not sustained elevation
STW Paradox Individual pauses ↑10-20%, but total STW time decreased

Further Reading

Theory Articles

Production Observations