5. LLM Boundary¶

Overview¶

In the previous post, Tasks and Turns, we explored how Codex processes user input, manages internal states, and ultimately constructs a comprehensive Prompt object containing conversation history, tools, and skills.

This post will cross the boundary between the local client and the external LLM server. We will step into the run_sampling_request function to see how the assembled Prompt is serialized into a network request and, more importantly, how Codex manages the streaming responses from the LLM via a powerful dual-track processing architecture (Data Flow and Control Flow).

Boundary to LLM: Constructing the Request¶

The execution continues in codex-rs/core/src/codex.rs:L5366, where run_sampling_request is invoked inside run_turn. This function handles sending the request out and processing the incoming response.

Note: sampling is a term specific to the LLM server side, as the LLM essentially outputs responses based on probability. In short, sampling is the process by which the LLM generates and streams a response back to the Codex client.

The Base Instruction¶

While the Prompt structure contains the user's history and available tools, it also requires foundational context to tell the LLM who it is and what it can do. This is handled by the BaseInstructions.

Codex defines a strict default base instruction that helps the LLM understand its environment since it lacks this context from its initial training. Modifying these instructions is strongly discouraged by the authors as it can degrade model performance.

The base instruction is defined in prompts/base_instructions/default.md:

You are a coding agent running in the Codex CLI,
a terminal-based coding assistant.
Codex CLI is an open source project led by OpenAI.
You are expected to be precise, safe, and helpful.

Your capabilities:

- Receive user prompts and other context provided by the harness,
  such as files in the workspace.
- Communicate with the user by streaming thinking & responses,
  and by making & updating plans.
- Emit function calls to run terminal commands and apply patches.
  Depending on how this specific run is configured,
  you can request that these function calls be escalated to the user
  for approval before running.
  More on this in the "Sandbox and approvals" section.

Within this context,
Codex refers to the open-source agentic coding interface
(not the old Codex language model built by OpenAI).

...

The Serialized JSON Request¶

After combining the Prompt (history, tools, base instructions, etc.), Codex converts it into a request payload compatible with the OpenAI /responses API.

What the LLM server actually receives over the network is a JSON payload similar to this (simplified):

{
  "model": "gpt-5.1",
  "instructions": "You are a coding agent running in the Codex CLI...",
  "input": [
    {
      "type": "message",
      "role": "user",
      "content": [
        {
          "type": "input_text",
          "text": "list files in the current directory"
        }
      ]
    }
  ],
  "tools": [
    {
      "type": "function",
      "name": "shell",
      "description": "Runs a shell command...",
      "strict": false,
      "parameters": {
        "type": "object",
        "properties": {
          "command": { "type": "string" },
          "workdir": { "type": "string" }
        }
      }
    }
  ],
  "tool_choice": "auto",
  "parallel_tool_calls": false,
  "reasoning": { "effort": "medium", "summary": "concise" },
  "stream": true,
  "store": false,
  "include": ["reasoning.encrypted_content"]
}

Notice the stream: true flag. This tells the server to return events incrementally, enabling real-time feedback.

2. Response Process: The Dual-Track Architecture¶

Codex sends the request and receives a ResponseStream. But a raw SSE (Server-Sent Events) stream is messy to handle directly. Codex abstracts raw server streaming responses into a ResponseEvent enum to hide underlying details and simplify downstream event consumption.

The Streaming Loop¶

The stream is consumed by a loop (in codex-rs/core/src/codex.rs:L6585-6842) that processes ResponseEvent items one by one.

This loop is the beating heart of Codex's interactivity. Within it, processing happens simultaneously across two distinct dimensions: Data Flow and Control Flow.

Track 1: Data Flow (Fire-and-Forget)¶

Data flow handles the real-time user experience. As the LLM generates tokens (for text, thinking, or tool calls), Codex dispatches these deltas to external consumers (like the terminal UI) instantly.

All data leaves the loop through the Session object (sess), which acts as the central hub:

let outcome: CodexResult<SamplingRequestResult> = loop {
    let event = match stream.next().await { ... };

    match event {
        // UI: Start rendering a new turn item
        ResponseEvent::OutputItemAdded(item) => {
            sess.emit_turn_item_started(...).await;
        }
        // UI: Stream the assistant's message text
        ResponseEvent::OutputTextDelta(delta) => {
            sess.send_event(..., EventMsg::AgentMessageContentDelta(...)).await;
        }
        // UI: Stream the assistant's "thinking" process
        ResponseEvent::ReasoningContentDelta { .. } => {
            sess.send_event(..., EventMsg::ReasoningRawContentDelta(...)).await;
        }
        // ...
    }
};

This async dispatching ensures the CLI interface feels alive and responsive, even while a long request is ongoing.

Track 2: Control Flow (State Accumulation)¶

Unlike the "fire-and-forget" data flow, the control flow acts as a state accumulator throughout the stream's lifecycle. It silently monitors the incoming events to determine the macroscopic routing of the entire agent execution loop.

This is primarily driven by accumulating the needs_follow_up boolean flag:

let mut needs_follow_up = false;
let mut last_agent_message = None;

let outcome: CodexResult<SamplingRequestResult> = loop {
    let event = match stream.next().await { ... };

    match event {
        ResponseEvent::OutputItemDone(item) => {
            let output_result = handle_output_item_done(...).await?;

            // Accumulate: Does the model want to call a tool or continue?
            needs_follow_up |= output_result.needs_follow_up;

            if let Some(agent_message) = output_result.last_agent_message {
                last_agent_message = Some(agent_message);
            }
        }
        ResponseEvent::Completed { .. } => {
            // Also check if the user steered new input while we were generating
            needs_follow_up |= sess.has_pending_input().await;

            // Loop finishes, return the simple control signal
            break Ok(SamplingRequestResult {
                needs_follow_up,
                last_agent_message
            });
        }
        _ => {}
    }
};

By using the bitwise OR operator (|=), Codex evaluates if any action within the entire stream requires further interaction. For example, if the LLM outputted a ToolCall, output_result.needs_follow_up becomes true.

The Result of the Loop¶

Because the Data Flow handled all the heavy lifting of displaying content to the user, the final SamplingRequestResult returned by the function is deceptively simple:

struct SamplingRequestResult {
    needs_follow_up: bool,
    last_agent_message: Option<Message>,
}

If needs_follow_up is false, the conversation turn ends; the LLM simply provided an answer. If needs_follow_up is true, it signals to the outer run_turn loop that the agent is not done yet. It must now execute the requested tools and send the results back to the LLM.

Conclusion¶

We have seen how the Prompt crosses the network boundary and how the complex SSE stream is masterfully split into: 1. Data Flow: Pushing real-time deltas to the UI. 2. Control Flow: Accumulating the needs_follow_up flag to drive the autonomous Agent Loop.

But what exactly happens when needs_follow_up becomes true because the LLM requested a Tool Call? How are these tools executed, and how is the loop continued? We will explore this in the next post: Codex Tools and Toolcalls.

(Note: If you are familiar with the OpenAI API, you may have noticed our serialized JSON request lacks several standard fields like temperature or max_tokens. We will discuss the architectural philosophy behind these omissions in API Subset and Architectural Trade-offs.)