Two teams. Same Claude Sonnet 4.5. Same task. One ships in a day. One gets stuck at sketch. The model is the same. The harness around it is the difference.

What an agent harness actually is

Anthropic's engineering team gives the official definition. They describe the Claude Agent SDK as "a powerful, general-purpose agent harness, adept at coding, as well as other tasks that require the model to use tools to gather context, plan, and execute". Microsoft's Agent Framework team puts it operationally: an agent harness is "the layer where model reasoning connects to real execution: shell and filesystem access, approval flows, and context management across long-running sessions".

A mental model in plain terms

If the LLM is the engine, the harness is the rest of the car: steering, brakes, fuel system, dashboard, and the rules of the road. The engine produces power; the rest of the car turns power into a journey. The same shape applies to an agent system. The model produces text. The harness turns text into useful action: tool calls, file edits, approvals, persisted state, all of it.

From first principles

At first principles, an LLM is a function that takes text in and produces text out. To turn that function into a system an operator can run a business on, three problems get solved between the input and the output:

  • The right context has to land in the prompt (compaction, retrieval, project state, the conversation so far).
  • The output has to be interpreted as more than text (parsed as a tool call, validated, executed in a controlled environment, with the result fed back).
  • Decisions have to be made about what to do next (continue, stop, ask the user, escalate).

The harness handles all three. The model handles the reasoning between them. Every production agent system runs on a harness, even when the team building it calls the layer something else.

Five jobs sit inside that layer

  • Prompt assembly. The system prompt, the AGENTS.md chain, project memory, and tool definitions get composed every turn.
  • Tool-use loop. Each tool call is parsed, validated, executed in a sandbox, and fed back into context, with retries on failure.
  • Context management. Compaction, summarization, and "lost in the middle" mitigation keep the working window useful as conversations grow.
  • Permission gates. Approval flows, scope limits, and sandboxing decide which actions the agent runs autonomously and which ask for human approval.
  • Termination logic. The harness owns this decision. The model keeps offering next steps.

Anthropic's own evidence for why these five jobs matter: even a frontier coding model like Opus 4.5, running on the Claude Agent SDK in a loop across multiple context windows, falls short of building a production-quality web app from a high-level prompt. The model is capable. The model alone is incomplete. The harness fills the gap.

The term itself spread fast in early 2026. Mitchell Hashimoto formalized the practice in his February 5 essay My AI Adoption Journey, dedicating Step 5 to Engineer the Harness. Six days later, OpenAI published Harness engineering, documenting how three to seven engineers shipped roughly one million lines of code across approximately 1,500 pull requests, with all code generated by Codex. The vocabulary settled quickly because developers had been building these things for months and only then settled on a name.

Harness vs framework, the distinction that matters

Harrison Chase, creator of LangChain, published a three-layer taxonomy in October 2025:

  • Frameworks are libraries you build with. They give you abstractions and a standard mental model. LangChain, CrewAI, LlamaIndex, Vercel AI SDK, OpenAI Agents SDK, Google ADK, and Mastra fall here. (See our Agent Frameworks directory for the editorial scoring on each.)
  • Runtimes are infrastructure for running agents in production. They handle durable execution, streaming, human-in-the-loop, and persistence. LangGraph, Temporal, and Inngest fall here.
  • Harnesses are batteries-included runtimes you deploy. They come with default prompts, opinionated tool handling, planning support, filesystem access, and more. Claude Code, Codex CLI, DeepAgents, and Cursor's agent fall here.

Chase admits the lines are blurry. "I don't think there is a clear definition of framework vs runtime vs harness," he writes, framing the post as his attempt to add one. A Reddit thread in r/AI_Agents titled Stop calling it an 'agent harness.' It's an Agent Runtime captures how contested the vocabulary remains.

For an operator the boundary is sharper than the academic debate makes it look. Frameworks ship options. Harnesses ship decisions. A framework hands you a tool-use loop you wire together. A harness hands you a tool-use loop already wired, with the approval policy and the context strategy and the sandbox model already chosen.

Framework vs Runtime vs Harness
DimensionFrameworkRuntimeHarness
ShapeLibraryEngineDeployable runtime with opinions
You bringArchitecture, glue, prompts, tool wiring, context strategyThe agent code (the runtime handles durability)An LLM, an AGENTS.md, optional MCP / tools
ExamplesLangChain, CrewAI, OpenAI Agents SDK, Vercel AI SDK, MastraLangGraph, Temporal, InngestClaude Code, Claude Agent SDK, Codex CLI, DeepAgents, Cursor agent, AWS AgentCore Harness, OpenClaw, Hermes
Governance postureYour problemSome primitives (HIL, persistence)First-class (approvals, scopes, audit)
ReplaceabilityHigh (code is yours)HighOften the lock-in point
Operator decisionBuild your own runtime on topPick durable executionAdopt a harness or build one
TIP
Quick reference

Framework: a library. You glue it together. Runtime: an engine. It runs reliably under load. Harness: a runtime with opinions. It runs your agent the way the harness author thinks it should be run.

Two shapes within the harness layer: interactive and autonomous

Within Chase's harness layer, two deployment shapes serve different buyers.

Interactive harnesses run turn-by-turn alongside a human. Every prompt is human-driven, every step is human-reviewed. Examples: Claude Code's interactive mode, Cursor's agent mode, Aider, Continue, Codex CLI when used at the terminal.

Autonomous agentic harnesses run continuously, often in the background. The harness handles per-turn pacing autonomously. Examples: OpenClaw, Hermes, Devin, Manus, Claude Cowork, Claude Dispatch.

The distinction matters for procurement. Interactive harnesses compete in developer productivity (alongside IDEs and code copilots). Autonomous agentic harnesses compete in workflow automation (alongside Zapier, n8n, and Make). Different buyer, different budget, different governance surface. The same project can offer both shapes. Claude Code does this directly: interactive mode at the CLI and autonomous mode via Cowork and Dispatch.

The rest of this article applies to both shapes. Where the autonomous case carries additional load, typically on permission gates and governance, the relevant section calls it out.

What a harness actually does (the five jobs)

1. Prompt assembly. Every turn, the harness composes the system prompt, the AGENTS.md chain (global plus project-nested), the project memory or progress file, and the tool definitions into one prompt. The model sees the result. The harness owns the recipe.

2. Tool-use loop. When the model emits a tool call, the harness validates the parameters, runs the call in a controlled environment, captures stdout, stderr, and exit code, and feeds the result back into context. Microsoft's Agent Framework documents one common pattern with @tool(approval_mode="always_require"), where every shell command pauses for explicit approval before executing.

3. Context management. Long-running tasks fill the context window. The harness handles compaction (summarizing older turns), retrieval (pulling in relevant docs only at the right step), and middle-of-context positioning (Liu et al. showed model performance degrades when key content sits in the middle of a long prompt). Anthropic's engineering team also notes a small detail with large consequences: JSON works better than Markdown for state files because models edit JSON less casually.

4. Permission gates. Approvals, scope limits, sandboxing. The harness has the authority to refuse a tool call. The model has the authority to suggest one.

5. Termination logic. When does the loop end? On task completion, on max-turns, on a verification check failing, on a budget exhaustion. The harness decides. The model keeps offering next steps.

Why harness design is where governance lives

Two teams, same model, different harness. The model is the cheap commodity in this equation. The harness is the expensive opinion.

Three pieces of evidence make the case.

Cursor's reasoning-trace experiment. Cursor reports that dropping reasoning traces between turns caused a 30% performance drop on their internal Cursor Bench evaluation of GPT-5-Codex. OpenAI observed a 3% degradation for GPT-5 on SWE-bench under the same change. Same model. Different harness behavior. The 30 percentage points speak for themselves.

Anthropic's third-party harness restrictions. In November 2025, Anthropic restricted API access for third-party agent harnesses such as OpenClaw. Pro and Max plan subscribers found that those plans now cover Claude Code (Anthropic's own harness) and require direct API billing for third-party wrappers around Claude. The signal is plain: the harness is where behavior gets shaped, and Anthropic decided that letting unbounded third-party harnesses sit between subscribers and the model was a problem worth solving at the commercial layer.

Anthropic's own example. With their initializer-plus-coding-agent harness, Opus 4.5 completed the same web-app task it had previously left half-built across context windows. The harness consisted of an init.sh, a claude-progress.txt log, a feature list in JSON, and an explicit incremental-progress prompt. Same model. Different harness. Different outcome.

The throughline: the model is one component, the harness is the rest of the system, and the rest of the system determines whether the system is governable.

How to choose a harness

How to choose an autonomous agent harness
  1. 01
    Decide deploy-and-run vs build-and-extend

    Claude Code, Codex CLI, and AWS AgentCore Harness give you a working harness on day one. DeepAgents and equivalent libraries give you the pieces to build your own on top of a framework.

  2. 02
    Audit the governance surface

    Score the harness on three pillars: security and governance, integration and interoperability, observability. Each one is a make-or-break factor for production deployment.

  3. 03
    Check tool-call semantics

    MCP support? Custom tools? What is the retry behavior? What does the sandbox actually allow? The answers shape what your agent can and cannot do safely.

  4. 04
    Look at context management

    What is the compaction strategy? Are there memory primitives? Does state persist as JSON or Markdown? Anthropic's own data favors JSON for state files.

  5. 05
    Evaluate the operator surface

    Logs, traces, audit trails. OpenAI's term for this is agent legibility: the application UI, logs, and metrics should be readable by the agent itself, in addition to a human reviewer.

Examples in the wild

A snapshot of the agent harness landscape in May 2026: interactive, autonomous, model-bound, and orchestrator-adjacent shapes.
  • Claude Code / Claude Agent SDK (Anthropic): the canonical coding harness. The initializer-plus-coding-agent pattern is documented in detail in Anthropic's engineering post.
  • Codex CLI (OpenAI): model-bound harness. AGENTS.md is first-class in its discovery rules, and the Harness engineering case study runs on it.
  • Cursor: harness embedded in an IDE. Tunes per model: shell-forward instructions for Codex, lint-reading prompts, reasoning-trace preservation.
  • DeepAgents (LangChain): a harness on top of the LangChain framework. Default prompts, tool handling, filesystem access.
  • AWS AgentCore Harness (Preview): managed cloud harness. Trace-by-default observability via AgentCore Observability.
  • Hermes Agent (Nous Research): self-improving harness with a built-in learning loop. Ships pre-built.
  • OpenClaw: third-party harness around Claude. Lost subscription-plan API access in November 2025; now requires direct API billing.
  • Paperclip: an agent orchestrator. It sits one layer above harnesses, dispatching work to Claude Code instances through tickets and an immutable audit log. The orchestrator layer is distinct from the harness layer: the harness is what runs each Claude Code instance underneath. The forthcoming OpenClaw vs Hermes vs Paperclip teardown unpacks the distinction in detail.
INFO
Coming soon: harness teardown

A side-by-side teardown of OpenClaw, Hermes, and Paperclip is on the way. Subscribe to the newsletter to hear when it ships.

You are probably already running a harness

Most operators encounter an agent harness before they encounter the word. If you have used Claude Code, Cursor, Codex CLI, or any one of the autonomous coding agents that shipped in 2025 or 2026, you have run one. The harness is what loaded your CLAUDE.md or AGENTS.md at session start, what asked you to approve a shell command, what compacted your context window when the conversation got long, what decided when the loop should stop. All of that comes from the harness layer.

The point of knowing what a harness is, then, is twofold:

It shows where the leverage is. When two members of your team get different output quality from the same model, the gap is at the harness layer. The fix is also there: better prompts, better tool semantics, better permission gates. The model is the same in both cases.

It puts procurement on the right axis. The buyer for an interactive harness like Cursor wants developer velocity. The buyer for an autonomous agentic harness like Hermes or OpenClaw wants reliable, governable, repeatable workflow execution. The two are different products competing in different markets, and treating them as the same category is the source of most procurement mismatches on either side.

The model is the cheap, swappable, commodity layer. The harness is where the system gets its behavior. That is where the engineering time and the procurement attention go.

AGENTS.md: The Universal Agent Contract Explained

The companion explainer in the AI Agents cluster. Where the harness is the runtime, AGENTS.md is the contract every harness reads. Read this next to see how the two layer together.

The model is commodity. The harness is the opinion.