GC Pacer¶
Implementation in Go 1.23.12: src/runtime/mgcpacer.go
Target Heap Calculation¶
Target heap is the heap size GC tries to maintain. It's calculated mainly based on GOGC, GOMEMLIMIT with some minor adjustments.
The stack and global size multiply \(\frac{GOGC}{100}\) to reserve the runway to ensure assist ratio won't surge and affect mutator a lot.
The target heap calculated by GOGC is soft, and GOMEMLIMIT is a hard limit for upper boubnd.
The default lower bound is \(4\text{MB}\). The final target heap will be:
Later we will see at some place, target heap will increase at some scenarios.
Static Triggered Point Calculation¶
Go GC tries to keep heap usage under target heap, hence, when a gc cycle is triggered the heap usage is smaller than target heap. The position is called triggered point, and triggered point is determined by gc pacer based on historical data. The triggered point is not adaptive based on current scan and alloc speed.
GC pacer calculated triggered point by calculating runway based on consMark. The consMark is a ratio between allocs and scan speed.
// scan covers heap, (go routine) stack and global.
// The heap scan is not the full heap, instead, it's only live_heap from last turn.
scanWork = heapScanWork + stackScanWork + globalsScanWork
currentConsMark :=
(
// heapLive is the number of bytes considered live by the GC.
// heap live keeps increasing until gc marking is done, as it records the possible live heap size.
float64(c.heapLive.Load()-c.triggered) // allocation size
* (utilization + idleUtilization)
) /
(float64(scanWork) * (1 - utilization))
GC picks the biggest value of consMark from 4 previous records to make less assist pressure to mutator. It prefers to start the gc cycle eariler instead of let mutator assist more.
// mgcpacer.go:660-663
// Update our cons/mark estimate. This is the maximum of the value we just computed and the last
// 4 cons/mark values we measured. The reason we take the maximum here is to bias a noisy
// cons/mark measurement toward fewer assists at the expense of additional GC cycles (starting
// earlier).
Then we will calculate runway by the formula below:
c.runway.Store(uint64((c.consMark * (1 - gcGoalUtilization) / (gcGoalUtilization)) *
float64(c.lastHeapScan+c.lastStackScan.Load()+c.globalsScan.Load())))
GC uses hardcoded 25% as gcGoalUtilization, hence the formula becomes:
Besides, GC limits the trigger point range to ensure it at least has a 5% buffer for runway. This follows the same idea mentioned above to reduce assist pressure for mutator.
const (
// These constants determine the bounds on the GC trigger as a fraction
// of heap bytes allocated between the start of a GC (heapLive == heapMarked)
// and the end of a GC (heapLive == heapGoal).
//
// The constants are obscured in this way for efficiency. The denominator
// of the fraction is always a power-of-two for a quick division, so that
// the numerator is a single constant integer multiplication.
triggerRatioDen = 64
// The minimum trigger constant was chosen empirically: given a sufficiently
// fast/scalable allocator with 48 Ps that could drive the trigger ratio
// to <0.05, this constant causes applications to retain the same peak
// RSS compared to not having this allocator.
minTriggerRatioNum = 45 // ~0.7
// The maximum trigger constant is chosen somewhat arbitrarily, but the
// current constant has served us well over the years.
maxTriggerRatioNum = 61 // ~0.95
)
The trigger point is limited at \([70\%, 95\%]\), so the runway is limited at \([5\%, 30\%]\).
- The min runway is 64KB, GC adjusts the target goal instead of trigger point if the original runway is too small.
- sweep also offers a lower bound to ensure GC cycle won't start too earlier before sweeping is done.
Possible consMark range¶
We can derive the formula below:
Hence, the scan speed should always be 3 times faster than allocs, otherwise we will gain a continueous GC.
However, the trigger point is limited in \([70\%, 95\%]\), so the runway is limited in \([5\%, 30\%]\).
As a result, \(\text{consMark}\) is limited at \([0.016, 0.1]\).
Note that within a single idle phase, the trigger point in bytes is static (not adaptive). It is determined by the historical consMark at the end of the previous cycle. If the allocs suddenly become 100x faster than historical data, the trigger point (target bytes) won't dynamically shrink; instead, the system simply hits this trigger point much sooner in wall-clock time.
Once the trigger point is reached and the Mark phase starts, the defense mechanism switches. The assistRatio (calculated via c.revise()) becomes highly adaptive. It recalculates based on real-time remaining scan work vs. remaining heap runway. If allocation is still 100x faster, the assistRatio will instantly spike, forcing the mutators to pause and do the scan work themselves.
GC CPU Utilization¶
By default, GC utilization is set 25%. For example, if GOMAXPROCS=8, there will be \(8 \times 25\% = 2\) dedicated workers used for background GC.
However if the GOMAXPROCS is small and cannot get a dedicated worker, pacer will use fractional worker for GC. To be more concrete, if the abs of utilError is bigger than \(30\%\), the fractional mode is enabled.
The mode is calculated inside startCycle.
totalUtilizationGoal := float64(procs) * gcBackgroundUtilization
dedicatedMarkWorkersNeeded := int64(totalUtilizationGoal + 0.5)
utilError := float64(dedicatedMarkWorkersNeeded)/totalUtilizationGoal - 1
const maxUtilError = 0.3
if utilError < -maxUtilError || utilError > maxUtilError {
When GC Cycle Starts¶
GC cycle starts when the heap usage exceeds the triggered point. But who knows it, is there a background checker for it?
Not really, the triggered point is checked on allocation path:
- small objects(<=32KB): checks when refill the span from mcentral.
- big objects(>32KB): check every allocation.
- arena: check every new arena.
Besides, there is a background periodically check in every 2min, which means the min gc frequency is once per 2min.
Note: stack and global sizes increasing won't trigger a new GC cycle, and heap only.
GC Assist Ratio: the Adaptive Algorithm¶
The triggered point is calculated statically. But the goal of GC is to maintain heap usage under target heap, and the allocation speed may differ from previous scenario. The calculation happened inside revise function.
To ensure gc can keep the heap usage well, runtime gc has an adaptive aglorithm to ensure the allocs and mark speed are matched. This is called assist, which means allocs need to assit GC mark when allocs is too fast.
AssitRatio is updated on the allocation path when gcBlackenEnabled != 0. The updating condition is as the same as gc cycle starting check.
- small objects: every refill
- big objects: every allocation
As we know, GC works at background and has 25% CPU utilization, and the left 75% is used by regular worker. Assist means allocation needs to pay extra CPU utilization under some scenarios.
The assitRatio is calculated by the formula, using remaining scan/heap size, it gets assistWorkPerByte and assistBytesPerWork.
Logically \(\text{assistWorkPerByte} \times \text{assistBytesPerWork} = 1\) but as they're updated uniquely so it's not really always be true. GC maintains 2 variables to prevent frequent divisions during small object allocations. The revise updates assist ratio not per small object allocates, but when refill.
Then assist ratio is calculated as:
GC Assist Ratio: the Cost of Prediction Failure¶
GC assist ratio relies on scanWorkExpected to predict future as it believes the future state will be steady.
If everything works as before, it work fine. However, what if the predication failed? GC has designed the dedicated feature for it.
When work > scanWorkExpected is true, it means the prediction is totally wrong, as the scanned work in current turn exceeds the expectation.
Without adjustment, GC will wrongly believe the mutator doesn't need any assit(as assistRatio~=0) and allocs goes without any assist. This indeed makes the allocation faster, but what's the cost and is the benefits worth it?
No, without assist it will be a disaster for mutator, and allocation speed is not a good news already.
Once GC is triggered, the first STW will help to open the write barrier, and every allocation later will be wrote to the wbBuf of write barrier. Without any assist, there will be huge amount of objects inside wgBuf and required to be flushed during mark termination, which requires STW. As a result, the STW takes much longer than before.
Besides STW, allocation without assisting makes the heap grows much faster than before, and may exceed target heap far away (\(2\times\), \(3\times\)).
GC decides to extrapolate the target goal by the formula below:
Consequentially, the assist ratio keeps unchanged(TODO: add a new section about how to maintain the slope, see https://github.com/golang/proposal/blob/master/design/44167-gc-pacer-redesign.md).
func (c *gcControllerState) revise() {
// ...
// The expected scan work is computed as the amount of bytes scanned last
// GC cycle (both heap and stack), plus our estimate of globals work for this cycle.
scanWorkExpected := int64(c.lastHeapScan + c.lastStackScan.Load() + c.globalsScan.Load())
// ...
// The expected scan work is computed as the amount of bytes scanned last
// GC cycle (both heap and stack), plus our estimate of globals work for this cycle.
scanWorkExpected := int64(c.lastHeapScan + c.lastStackScan.Load() + c.globalsScan.Load())
maxStackScan := c.maxStackScan.Load()
maxScanWork := int64(scan + maxStackScan + c.globalsScan.Load())
if work > scanWorkExpected {
// We've already done more scan work than expected. Because our expectation
// is based on a steady-state scannable heap size, we assume this means our
// heap is growing. Compute a new heap goal that takes our existing runway
// computed for scanWorkExpected and extrapolates it to maxScanWork, the worst-case
// scan work. This keeps our assist ratio stable if the heap continues to grow.
//
// The effect of this mechanism is that assists stay flat in the face of heap
// growths. It's OK to use more memory this cycle to scan all the live heap,
// because the next GC cycle is inevitably going to use *at least* that much
// memory anyway.
extHeapGoal := int64(
float64(heapGoal-int64(c.triggered)) /
float64(scanWorkExpected)*
float64(maxScanWork)
) + int64(c.triggered)
scanWorkExpected = maxScanWork
// hardGoal is a hard limit on the amount that we're willing to push back the
// heap goal, and that's twice the heap goal (i.e. if GOGC=100 and the heap and/or
// stacks and/or globals grow to twice their size, this limits the current GC cycle's
// growth to 4x the original live heap's size).
//
// This maintains the invariant that we use no more memory than the next GC cycle
// will anyway.
hardGoal := int64((1.0 + float64(gcPercent)/100.0) * float64(heapGoal))
if extHeapGoal > hardGoal {
extHeapGoal = hardGoal
}
heapGoal = extHeapGoal
}
GC Assist¶
Assist happens at mallocgc according to assistRatio.
There is a credit system to check before allocation. The credit unit is byte, and before allocating the credit is deducted by desired allocation bytes; if the balance is not sufficient, it will block there and assist until finished.
The credit is based on G rather than P, as big allocations of a G should not affect another G, even though they're executed in the same P.
// malloc.go:1343
assistG.gcAssistBytes -= int64(size)
if assistG.gcAssistBytes < 0 {
gcAssistAlloc(assistG)
}
The gcAssistAlloc happened some branches:
- Try steal -> If a
Gcan steal some credits from background singletonbgScanCreditdefined ingcControllerState, it can skip to assist and return; or then: - Try assist scan -> cost CPU to assist scan; or then:
- block -> pushed to assist queue until reschedule, and repeat the procedures here.
GC requires over-assist at minimal \(64\text{KB}\). For example, credit=\(1\text{byte}\) but allocation is \(2\text{bytes}\), to allocate \(2\text{bytes}\) it must assist \(64\text{KB}\) here to prevent frequent minor assist.
The assist details will be explained inside later passage about Mark, as it relates to more about how assist mark works. The GC assist is one of possible allocation latency source.