Skip to content

1. Getting Started

This series of articles records my process of reading the Codex source code to deeply understand how an AI code agent works.

I focus on the high level overview first and intentionally skip some 'technical details' to get a basic understanding of an AI code agent, even though sometimes it relates to high level design.

Pre-knowledge:

  • tried LLM models via chat such as ChatGPT, Gemini.
  • tried AI code agents such as Claude Code, Cursor, Codex etc.

The fundamental understanding should be LLM can understand, think and output results based on human language input, and code agents have defined some abilities(file read and write, shell execute, etc...)

Basic Components and Flow

Codex is generally consisted by 3 components: UI, Daemon and Model.

  • UI: The term "UI" is used to refer to the application driving Codex. This may be the CLI / TUI chat-like interface that users operate, or it may be a GUI interface like a VSCode extension. The UI is external to Codex, as Codex is intended to be operated by arbitrary UI implementations.

  • Daemon contains the real implementation of codex, which handles the core logic of code agent. It's named as daemon as it totally decouples from UI and supports various of UI.

  • Model is the LLM side, which performs as a brain to think and make decisions. It holds models(such as gpt-5.3).

The diagram shows how ai code agent performs for task: "how many go code lines under a project".

sequenceDiagram
    autonumber
    actor User
    participant Daemon
    participant Model

    User ->>+ Daemon: Query Go LOC

    Note over Daemon: Append prompts & tools
    Daemon ->>+ Model: Forward query
    Model -->>- Daemon: Tool Call: shell (wc)

    Note over Daemon: Exec: find . -name '*.go' | xargs wc -l
    Daemon ->>+ Model: Output: 1006 total

    Model -->>- Daemon: Return formatted response
    Note right of Model: Task finished. There are total<br>1006 go code lines under<br>this project.

    Daemon -->>- User: Deliver final message
    Note right of Daemon: There are total 1006 go code<br>lines under this project.

Communications Between Components

The UI <-> Daemon and Daemon <-> Model communicate with each other, but UI and Model won't.

UI and Daemon

  • Custom Protocol for CLI/TUI

    Codex defines protocol v1 for internal usage. It defines SQ(Submission Queue) for UI to Daemon, and EQ(Event Queue) for Daemon to UI. See docs and code for more details.

  • App-Server for Extensions

    JSON-RPC 2.0 is used to communicate, and see codex app-server docs for details.

Daemon and Model

Daemon talks to the Model via OpenAI's Responses API at /v1/responses.

  • HTTP + SSE

    Daemon calls POST /v1/responses (response create API). Request is JSON; response is an SSE stream.

  • WebSocket

    When supported, Daemon may use WebSocket for the same endpoint to reuse connections and send incremental requests. It falls back to HTTP if WebSocket fails.

    See OpenAI Responses API and codex-rs/codex-api/README.md for details.

Model: LLM API Provider

LLM is the brain and code agent receives the instruction from it and extends LLM ability. Codex uses /response create API from chatgpt to interact with LLM.

Unlike chat, /response API allows flexible input and output, some extra definition to understand through API:

  • toolcall Function calling lets you connect models to external tools and APIs. Instead of generating text responses, the model determines when to call specific functions and provides the necessary parameters to execute real-world actions. This allows the model to act as a bridge between natural language and real-world actions and data.

  • structured response Structured response ask LLM to generate responses that adhere to a provided JSON Schema. This ensures predictable, type-safe results and simplifies extracting structured data from unstructured text.

Daemon: Codex Core Engine Flow

Codex docs has a comprehensive explainitions for the terms inside codex for developers.

Daemon core engine is Codex, and it runs locally, either in a background thread or separate process. It takes user input, makes requests to the Model, executes commands and applies patches.

Session, Task, and Turn

Note: Be careful not to confuse the core engine Turn with the Turn defined in the internal protocol. In Codex's internal protocol representation, a Turn might conceptually imply a user-agent exchange loop, but at the core execution level, Task and Turn are strictly 1:1.

  • Session: One UI window = one Session. Holds the entire conversation context and state.
  • Task: A single user request. It is essentially a tokio::spawn JoinHandle wrapper that tracks the cancellation and lifecycle of a user request.
  • Turn: The actual execution unit driving the LLM interactions. A Turn handles the multi-round sampling loop (prompting, parsing tool calls, executing tools, re-prompting) until the LLM decides the job is complete.

Relationship: \(Session \supset Tasks \equiv Turns\)

Since a Task delegates its entire lifecycle to a single Turn, their relationship is strictly 1:1. When a Turn ends (the sampling loop finishes), the Task finishes. When the user inputs again, it creates a new Task (and thus a new Turn), rather than creating a new Turn inside an existing Task.

sequenceDiagram
    autonumber
    participant User
    participant Codex
    participant Session
    participant Task (JoinHandle)
    participant Turn (Execution)
    participant Model

    User ->> Codex: Configure
    Codex ->> Session: Initialize Session
    Note over Session: Session is maintained for the current UI state

    User ->> Session: Provide Input 1
    Note over Session: Abort any previous Tasks
    Session ->>+ Task (JoinHandle): Spawn Task 1
    Task (JoinHandle) ->>+ Turn (Execution): Start Turn 1

    Note over Turn (Execution), Model: Turn 1 manages the multi-round LLM sampling loop

    %% Turn Sampling Loop
    loop Multi-round Tool Call / Sampling
        Turn (Execution) ->>+ Model: Request (Prompt/Context)
        Model -->>- Turn (Execution): Response (Stream: Text or Tool Call)
        opt Tool Called
            Turn (Execution) ->> Turn (Execution): Execute Local Tool (e.g. bash, read_file)
        end
    end

    Turn (Execution) -->>- Task (JoinHandle): Turn 1 completes
    Task (JoinHandle) -->>- Session: Task 1 ends (JoinHandle resolves)

    Note over Session: Session waits for new user input

    User ->> Session: Provide Input 2
    Note over Session: New user input starts a NEW Task (and NEW Turn)
    Session ->>+ Task (JoinHandle): Spawn Task 2
    Task (JoinHandle) ->>+ Turn (Execution): Start Turn 2
    Note over Turn (Execution): Executes sampling loop...
    Turn (Execution) -->>- Task (JoinHandle): Turn 2 completes
    Task (JoinHandle) -->>- Session: Task 2 ends