GC Mark¶
The mark phase is the heart of Go's garbage collector—tracing the object graph to identify all live objects. Go's implementation is sophisticated, using concurrent marking, write barriers, and carefully synchronized stop-the-world (STW) pauses to minimize application disruption.
This article explores the GC Mark implementation in Go 1.23.12, primarily located in src/runtime/mgcmark.go.
The Two-Phase Mark Design¶
Go's mark phase employs a hybrid approach:
- Concurrent Marking: Most work happens while application (mutators) keep running
- Brief STW Pauses: Two carefully scoped STW periods for setup and termination
This design achieves the best of both worlds: correctness guarantees from STW periods, and low latency from concurrent execution.
STW Before GC Scan: gcStart¶
The mark phase entry point is gcStart, which orchestrates the first STW pause:
func gcStart(trigger gcTrigger) {
// 1. Stop the world
stopTheWorldWithSema(stwGCSweepTerm)
// 2. Ensure previous sweep completed
finishsweep_m() // If pending, STW extends until complete
// 3. Prepare background marking
gcBgMarkStartWorkers() // Launch per-P mark workers
// 4. Clear sync.Pool caches
clearpools()
// 5. Check mode (manual vs automatic)
if mode != gcBackgroundMode {
// runtime.GC() forces different behavior
// TODO: Document difference
}
// 6. Start the world
startTheWorldWithSema(0, stw)
}
Key operations during STW:
finishsweep_m(): Ensures the previous sweep cycle is complete- Implication: If sweep is behind, this STW pause extends until it catches up
-
Why?: Mark operations would overwrite sweep state if they overlapped
-
clearpools(): Flushessync.Poolcaches - Why?: Pooled objects might hold references to dead objects
-
Effect: Pool caches are rebuilt with fresh objects
-
Background worker preparation: Mark workers are pre-launched but parked, ready to start when the world resumes
GC Worker: Concurrent Marking Engine¶
Once the world restarts, background mark workers begin scanning. These workers are per-P—each logical processor (P) has its own dedicated worker goroutine.
func gcBgMarkStartWorkers() {
// Create one worker per P
for _, p := range allp {
go gcBgMarkWorker(p)
}
}
func gcBgMarkWorker(p *p) {
// Park until mark phase begins
for {
gopark(...) // Wait for scheduler signal
// Get notified when mark phase starts
// Scheduler calls: findRunnableGCWorker()
// Bind G to M (prevent preemption during work)
node := (*cgoprofNode)(...)
// Mark self as scannable
casGToWaitingForSuspendG(gp, _Grunning, waitReasonGCWorkerActive)
// Perform work in system stack
systemstack(func() {
gcDrainMarkWorkerDedicated(gcw)
})
}
}
Worker characteristics:
- Parked when idle: Workers block in
gopark()consuming zero CPU - Scheduled on demand: Scheduler wakes workers via
findRunnableGCWorker() - Preemption-resistant: Workers run in system stack to prevent interruption
- Self-describing: Workers mark themselves as active so their stacks can be scanned
Three Worker Modes¶
Workers operate in one of three modes, selected based on GC progress and CPU utilization:
| Mode | Function | Behavior |
|---|---|---|
| Dedicated | gcDrainMarkWorkerDedicated | 100% CPU commitment to marking until work exhausted |
| Fractional | gcDrainMarkWorkerFractional | Throttled CPU usage based on GOMAXPROCS |
| Idle | gcDrainMarkWorkerIdle | Mark only when P has no other runnable goroutines |
The Pacer dynamically adjusts which workers run in which mode to balance:
- GC completion: Ensure enough marking happens before heap hits target
- CPU utilization: Reserve headroom for mutators (application code)
- Latency: Avoid oversubscribing CPUs and causing scheduling delays
gcDrain: The Core Marking Loop¶
All worker modes eventually call into gcDrain, which implements the tri-color marking algorithm:
Color Theory¶
- White: Potentially dead (unscanned, or scanned but unreachable)
- Grey: Reachable, but referenced objects not yet scanned (in work queue)
- Black: Reachable and all references scanned (removed from queue)
Critical insight: An object is "live" if it's black or grey. White objects are presumed dead unless later discovered to be reachable.
The algorithm terminates when the grey queue is empty—all reachable objects are black, all unreachable objects remain white.
Mark Roots: gcDrain Initialization¶
Marking begins at GC roots—objects always known to be live:
- Global variables: Objects in BSS/data segments
- Stack variables: Objects referenced by goroutine stacks
- Finalizers: Objects with cleanup callbacks (must be kept alive)
- Tiny allocs: Pre-marked small objects (optimization)
Root scanning mechanics:
- Basic unit: Scan in fixed-size "blocks"
- Ptr masks: Compiler generates bitmaps indicating which words are pointers
- Grey object: Each discovered object is added to worker's queue via
gcw.put()
Drain Heap: Processing the Grey Queue¶
After roots are marked, workers enter an infinite loop processing the grey queue:
func gcDrain(gcw *gcWork, flags gcDrainFlags) gcDrainResult {
for {
// Fast path: Try to get object from local buffer
if gcw.tryGetFast(&obj) {
goto scan
}
// Slow path: Steal from global queue
if gcw.tryGet(&obj) {
goto scan
}
// Flush write barrier buffer
wbBufFlush(&getg().m.p.wbBuf)
// If still empty, we're done
return gcDrainResult{}
scan:
// Scan object (find all pointers within)
scanobject(obj, gcw)
// For each pointer found:
for _, ptr := range object.pointers {
// Grey the referenced object
greyobject(ptr, gcw)
}
}
}
Process flow:
- Pop object from queue (grey → black transition occurs by removal)
- Scan object: Find all pointer fields within
- Enqueue references: Add each referenced object to queue (white → grey)
Key difference from root scanning: - Roots: Scan blocks (compiler-generated ptr masks) - Heap: Scan individual objects (type metadata determines pointer offsets)
Large Object Handling¶
Large objects (>32KB) receive special treatment to avoid cache pollution:
- Split scanning: Object scanned in chunks rather than monolithically
- Continuation tracking: Partial scan state saved and resumed
- Goal: Avoid holding large memory ranges in CPU caches during long scans
Mark Termination: The Second STW¶
When all workers report their queues empty, one worker executes gcMarkDone to coordinate termination. This process is iterative because of concurrency:
func gcMarkDone() {
top:
// 1. Flush all P-local buffers
for _, p := range allp {
flushCheckbuffers(p)
}
// 2. Check if work remains
if work.nwait == nproc && !gcMarkWorkAvailable() {
// No work found - attempt STW
} else {
// Work found - resume marking
return
}
// 3. Start STW
stopTheWorldWithSema(stwGCMarkTerm)
// 4. Re-check (race between flush and STW)
if gcMarkWorkAvailable() {
// Work appeared - restart world and continue
startTheWorldWithSema(0, stw)
goto top
}
// 5. Final termination
gcMarkTermination()
}
Why the loop? Between flushing P buffers and stopping the world, mutators might have: - Created new objects - Modified pointers - Triggered write barriers
The STW creates a quiescent state where no further changes can occur, allowing final verification.
gcMarkTermination Cleanup¶
Once termination is confirmed, the second STW performs final bookkeeping:
- Update stack sizes: Calculate new stack watermark for GC pacer
- Disable assist: Turn off
gcBlackenEnabledsince no more marking occurs - Wake waiters:
gcWakeAllAssists()andgcWakeAllStrongFromWeak() - Notify pacer:
gcController.endCycle()reports metrics for next cycle's planning - Advance generation: Increment
sweepgenby 2, signaling sweep phase can begin
Why Two STW Pauses?¶
The dual-STW design is fundamental to Go's concurrent marking:
| STW | Purpose | What Would Break Without It? |
|---|---|---|
| Mark Start | Clean state, enable barriers | Previous sweep incomplete, new allocations untracked |
| Mark End | Flush barriers, finalize marking | Objects modified during STW would be missed |
Why not eliminate STW entirely?
The only alternative is fully stop-the-world GC—scan and mark everything while application is paused. This is worse because: - Total pause time is much longer (entire heap vs. two brief pauses) - Application is completely unavailable, not just slightly slowed
Advanced Topics¶
Write Barrier Interaction¶
The hybrid write barrier enables concurrent marking by tracking pointer modifications:
- Dijkstra insertion barrier: When
*ptr = new, both*ptr(old) andneware shaded - Yuasa deletion barrier: When
*ptr = new, only*ptr(old) is shaded
Go's hybrid barrier combines both, providing strong guarantees with minimal overhead.
What happens without write barriers?
Two catastrophic failures:
- Use-after-free: Live object prematurely collected → runtime crash or memory corruption
- Memory leak: Dead object incorrectly kept alive → memory not reclaimed until next cycle
Mutator Assist¶
When allocation outpaces background marking, the Pacer forces allocators to perform marking work:
- Trigger: Heap growing faster than scanning progress
- Cost: Each allocation performs proportional marking work before returning
- Goal: Ensure GC finishes exactly when heap hits target
This ensures backpressure—allocation slowdowns prevent runaway heap growth.
Q&A: Common Mark Phase Questions¶
Q1: STW Boundary and Cost¶
"You mentioned
gcStartbegins with STW (Sweep Term), andgcMarkDoneends with STW (Mark Term). Since Go claims 'concurrent GC', why can't we eliminate these STW pauses? If my service logs show a 50ms STW, what's likely causing it in these two phases?"
Answer: The two STW pauses serve different purposes:
- Mark Start STW: Ensure clean state for marking
- Extended duration = previous sweep incomplete
- Cause: Allocation surge outpaced sweep workers
-
Fix: Investigate allocation patterns, consider
GOGCtuning -
Mark Termination STW: Flush write barrier buffers and finalize
- Extended duration = excessive pointer modifications during mark
- Cause: Write-heavy application (frequent pointer assignment)
- Metrics to check: QPS, allocation rate,
GOMAXPROCS, OS CPU throttling
Q2: Mutator Assist Triggers¶
"I see
atomic.Store(&gcBlackenEnabled, 1)enables assist marking. My Go service normally uses 10% CPU, but during traffic spikes it hits 100% CPU and latency jumps from 20ms to 200ms. Trace shows many goroutines ingcAssistAlloc. How does GC decide a goroutine has 'debt' to repay? How is this debt calculated? What happens if this mechanism is disabled?"
Answer: (Detailed answer to be provided based on GC Pacer mechanics)
Q3: Write Barrier Essence¶
"You mentioned
wbBufFlushprocesses write barrier buffers. Go uses Hybrid Write Barrier. Why do we need write barriers? If I disable them during Mark for performance, what specifically goes wrong? Provide a concrete pointer assignment example showing how objects get 'lost'."
Answer:
Without write barriers, two failure modes occur:
- Critical: Live object collected → crash or memory corruption
- Mild: Dead object kept alive → memory leak (fixed next GC)
Concrete example:
// Before mark: A (black) → B (white), C (white)
// Mark scans A, marks B, proceeds
// Concurrently without write barrier:
A.ptr = C // C still white!
// B dies (no references), but C remains reachable via A
// If C was never rescanned, it gets incorrectly collected
// Future dereference of A.ptr crashes
Write barriers force C to be shaded when assigned to A.ptr, ensuring rescanning.
Q4: Scan Density and Performance¶
"You analyzed
gcDrainandscanobject. Compare two services:
- Service A: 10GB cache,
map[int]*User(many pointers)- Service B: 10GB cache,
map[int][]byte(raw data)Which has longer GC Mark? Why? What's the implication for writing high-performance Go code?"
Answer: (To be completed)
Q5: Termination Detection Limits¶
"You described
gcMarkDone: flush all P's, if work remains goto top. Extreme scenario: P1 flushes and sees empty, P2 still running and creates object in buffer via write barrier. Does P1 think GC is done? How does Go ensure distributed consensus on 'no work left'?"
Answer: (To be completed)
Summary¶
Go's GC Mark phase demonstrates sophisticated concurrent garbage collection:
- Dual-STW design: Minimal pauses without sacrificing correctness
- Per-P workers: Parallel marking with adaptive CPU utilization
- Hybrid write barrier: Enables concurrent marking with low overhead
- Mutator assist: Enforces backpressure when allocation outpaces marking
Understanding mark is essential for diagnosing GC-related latency issues and tuning Go applications for high throughput and low pause times.