Fundamentals

The model is commodity. The harness is the opinion.

What Is an Agent Harness? (And How It Differs From an Agent Framework)

An agent harness is the runtime that turns an LLM into something that can act. How harnesses differ from frameworks, why the distinction matters, and what to look for in production.

Michael NourielPlatform Engineer & Founder, Scaletific + Automation Switch

4 May 2026Updated 4 May 202610 min readFundamentals

AIagent skillsdeveloper toolsClaude CodeOpenAI Codex

Editorial hero card on dark canvas. The phrase "Agent Harness" rendered as the centerpiece, with subtitled labels showing the runtime layer (loop, tool calls, context, permissions) wrapping a simplified LLM core. Cluster siblings AGENTS.md and SKILL.md acknowledged as adjacent contracts.

Key takeaways

An agent harness is the runtime around an LLM that handles loops, tool calls, context, and permissions.
Harnesses are runtimes you deploy. Frameworks are libraries you build with.
Two teams running the same model on the same task get different outcomes based on harness behavior.
Governance, observability, and approval gates live in the harness layer.
Examples of harnesses: Claude Code, Codex CLI, OpenClaw, Hermes. Examples of frameworks: LangGraph, CrewAI, Mastra.

Two teams. Same Claude Sonnet 4.5. Same task. One ships in a day. One gets stuck at sketch. The model is the same. The harness around it is the difference.

What an agent harness actually is

Anthropic's engineering team gives the official definition. They describe the Claude Agent SDK as "a powerful, general-purpose agent harness, adept at coding, as well as other tasks that require the model to use tools to gather context, plan, and execute". Microsoft's Agent Framework team puts it operationally: an agent harness is "the layer where model reasoning connects to real execution: shell and filesystem access, approval flows, and context management across long-running sessions".

A mental model in plain terms

If the LLM is the engine, the harness is the rest of the car: steering, brakes, fuel system, dashboard, and the rules of the road. The engine produces power; the rest of the car turns power into a journey. The same shape applies to an agent system. The model produces text. The harness turns text into useful action: tool calls, file edits, approvals, persisted state, all of it.

From first principles

At first principles, an LLM is a function that takes text in and produces text out. To turn that function into a system an operator can run a business on, three problems get solved between the input and the output:

The right context has to land in the prompt (compaction, retrieval, project state, the conversation so far).
The output has to be interpreted as more than text (parsed as a tool call, validated, executed in a controlled environment, with the result fed back).
Decisions have to be made about what to do next (continue, stop, ask the user, escalate).

The harness handles all three. The model handles the reasoning between them. Every production agent system runs on a harness, even when the team building it calls the layer something else.

Five jobs sit inside that layer

Prompt assembly. The system prompt, the AGENTS.md chain, project memory, and tool definitions get composed every turn.
Tool-use loop. Each tool call is parsed, validated, executed in a sandbox, and fed back into context, with retries on failure.
Context management. Compaction, summarization, and "lost in the middle" mitigation keep the working window useful as conversations grow.
Permission gates. Approval flows, scope limits, and sandboxing decide which actions the agent runs autonomously and which ask for human approval.
Termination logic. The harness owns this decision. The model keeps offering next steps.

Anthropic's own evidence for why these five jobs matter: even a frontier coding model like Opus 4.5, running on the Claude Agent SDK in a loop across multiple context windows, falls short of building a production-quality web app from a high-level prompt. The model is capable. The model alone is incomplete. The harness fills the gap.

The term itself spread fast in early 2026. Mitchell Hashimoto formalized the practice in his February 5 essay My AI Adoption Journey, dedicating Step 5 to Engineer the Harness. Six days later, OpenAI published Harness engineering, documenting how three to seven engineers shipped roughly one million lines of code across approximately 1,500 pull requests, with all code generated by Codex. The vocabulary settled quickly because developers had been building these things for months and only then settled on a name.

Annotated three-layer taxonomy diagram showing Frameworks (libraries you build with) at the bottom, Runtimes (engines for production execution) in the middle, and Harnesses (deployable runtimes with opinions) at the top. Each layer labelled with example tools: LangChain and CrewAI for frameworks, LangGraph and Temporal for runtimes, Claude Code and Codex CLI for harnesses.

Harness vs framework, the distinction that matters

Harrison Chase, creator of LangChain, published a three-layer taxonomy in October 2025:

Frameworks are libraries you build with. They give you abstractions and a standard mental model. LangChain, CrewAI, LlamaIndex, Vercel AI SDK, OpenAI Agents SDK, Google ADK, and Mastra fall here. (See our Agent Frameworks directory for the editorial scoring on each.)
Runtimes are infrastructure for running agents in production. They handle durable execution, streaming, human-in-the-loop, and persistence. LangGraph, Temporal, and Inngest fall here.
Harnesses are batteries-included runtimes you deploy. They come with default prompts, opinionated tool handling, planning support, filesystem access, and more. Claude Code, Codex CLI, DeepAgents, and Cursor's agent fall here.

Chase admits the lines are blurry. "I don't think there is a clear definition of framework vs runtime vs harness," he writes, framing the post as his attempt to add one. A Reddit thread in r/AI_Agents titled Stop calling it an 'agent harness.' It's an Agent Runtime captures how contested the vocabulary remains.

For an operator the boundary is sharper than the academic debate makes it look. Frameworks ship options. Harnesses ship decisions. A framework hands you a tool-use loop you wire together. A harness hands you a tool-use loop already wired, with the approval policy and the context strategy and the sandbox model already chosen.

Framework vs Runtime vs Harness

Dimension	Framework	Runtime	Harness
Shape	Library	Engine	Deployable runtime with opinions
You bring	Architecture, glue, prompts, tool wiring, context strategy	The agent code (the runtime handles durability)	An LLM, an AGENTS.md, optional MCP / tools
Examples	LangChain, CrewAI, OpenAI Agents SDK, Vercel AI SDK, Mastra	LangGraph, Temporal, Inngest	Claude Code, Claude Agent SDK, Codex CLI, DeepAgents, Cursor agent, AWS AgentCore Harness, OpenClaw, Hermes
Governance posture	Your problem	Some primitives (HIL, persistence)	First-class (approvals, scopes, audit)
Replaceability	High (code is yours)	High	Often the lock-in point
Operator decision	Build your own runtime on top	Pick durable execution	Adopt a harness or build one

TIP

Quick reference

Framework: a library. You glue it together. Runtime: an engine. It runs reliably under load. Harness: a runtime with opinions. It runs your agent the way the harness author thinks it should be run.

Two shapes within the harness layer: interactive and autonomous

Within Chase's harness layer, two deployment shapes serve different buyers.

Interactive harnesses run turn-by-turn alongside a human. Every prompt is human-driven, every step is human-reviewed. Examples: Claude Code's interactive mode, Cursor's agent mode, Aider, Continue, Codex CLI when used at the terminal.

Autonomous agentic harnesses run continuously, often in the background. The harness handles per-turn pacing autonomously. Examples: OpenClaw, Hermes, Devin, Manus, Claude Cowork, Claude Dispatch.

The distinction matters for procurement. Interactive harnesses compete in developer productivity (alongside IDEs and code copilots). Autonomous agentic harnesses compete in workflow automation (alongside Zapier, n8n, and Make). Different buyer, different budget, different governance surface. The same project can offer both shapes. Claude Code does this directly: interactive mode at the CLI and autonomous mode via Cowork and Dispatch.

The rest of this article applies to both shapes. Where the autonomous case carries additional load, typically on permission gates and governance, the relevant section calls it out.

Five-jobs anatomy of an agent harness on dark canvas. Five labelled rings around a central LLM: Prompt Assembly (system prompt, AGENTS.md chain, tool definitions); Tool-Use Loop (parse, validate, execute, feed back); Context Management (compaction, summarization); Permission Gates (approval flows, scope limits); Termination Logic (when to stop iterating). Each ring carries a one-line description of its responsibility.

What a harness actually does (the five jobs)

1. Prompt assembly. Every turn, the harness composes the system prompt, the AGENTS.md chain (global plus project-nested), the project memory or progress file, and the tool definitions into one prompt. The model sees the result. The harness owns the recipe.

2. Tool-use loop. When the model emits a tool call, the harness validates the parameters, runs the call in a controlled environment, captures stdout, stderr, and exit code, and feeds the result back into context. Microsoft's Agent Framework documents one common pattern with @tool(approval_mode="always_require"), where every shell command pauses for explicit approval before executing.

3. Context management. Long-running tasks fill the context window. The harness handles compaction (summarizing older turns), retrieval (pulling in relevant docs only at the right step), and middle-of-context positioning (Liu et al. showed model performance degrades when key content sits in the middle of a long prompt). Anthropic's engineering team also notes a small detail with large consequences: JSON works better than Markdown for state files because models edit JSON less casually.

4. Permission gates. Approvals, scope limits, sandboxing. The harness has the authority to refuse a tool call. The model has the authority to suggest one.

5. Termination logic. When does the loop end? On task completion, on max-turns, on a verification check failing, on a budget exhaustion. The harness decides. The model keeps offering next steps.

Why harness design is where governance lives

Two teams, same model, different harness. The model is the cheap commodity in this equation. The harness is the expensive opinion.

Three pieces of evidence make the case.

Cursor's reasoning-trace experiment. Cursor reports that dropping reasoning traces between turns caused a 30% performance drop on their internal Cursor Bench evaluation of GPT-5-Codex. OpenAI observed a 3% degradation for GPT-5 on SWE-bench under the same change. Same model. Different harness behavior. The 30 percentage points speak for themselves.

Anthropic's third-party harness restrictions. In November 2025, Anthropic restricted API access for third-party agent harnesses such as OpenClaw. Pro and Max plan subscribers found that those plans now cover Claude Code (Anthropic's own harness) and require direct API billing for third-party wrappers around Claude. The signal is plain: the harness is where behavior gets shaped, and Anthropic decided that letting unbounded third-party harnesses sit between subscribers and the model was a problem worth solving at the commercial layer.

Anthropic's own example. With their initializer-plus-coding-agent harness, Opus 4.5 completed the same web-app task it had previously left half-built across context windows. The harness consisted of an init.sh, a claude-progress.txt log, a feature list in JSON, and an explicit incremental-progress prompt. Same model. Different harness. Different outcome.

The throughline: the model is one component, the harness is the rest of the system, and the rest of the system determines whether the system is governable.

How to choose a harness

How to choose an autonomous agent harness

01
Decide deploy-and-run vs build-and-extend
Claude Code, Codex CLI, and AWS AgentCore Harness give you a working harness on day one. DeepAgents and equivalent libraries give you the pieces to build your own on top of a framework.
02
Audit the governance surface
Score the harness on three pillars: security and governance, integration and interoperability, observability. Each one is a make-or-break factor for production deployment.
03
Check tool-call semantics
MCP support? Custom tools? What is the retry behavior? What does the sandbox actually allow? The answers shape what your agent can and cannot do safely.
04
Look at context management
What is the compaction strategy? Are there memory primitives? Does state persist as JSON or Markdown? Anthropic's own data favors JSON for state files.
05
Evaluate the operator surface
Logs, traces, audit trails. OpenAI's term for this is agent legibility: the application UI, logs, and metrics should be readable by the agent itself, in addition to a human reviewer.

Examples in the wild

Visual showcase of agent harnesses currently shipping: Claude Code, Codex CLI, Cursor, DeepAgents, AWS AgentCore Harness, Hermes Agent, OpenClaw, and Paperclip. Each labelled with its harness flavour (interactive, autonomous, model-bound, embedded, third-party). — A snapshot of the agent harness landscape in May 2026: interactive, autonomous, model-bound, and orchestrator-adjacent shapes.

Claude Code / Claude Agent SDK (Anthropic): the canonical coding harness. The initializer-plus-coding-agent pattern is documented in detail in Anthropic's engineering post.
Codex CLI (OpenAI): model-bound harness. AGENTS.md is first-class in its discovery rules, and the Harness engineering case study runs on it.

Five-step roadmap for choosing an agent harness: 1. Decide deploy-and-run vs build-and-extend. 2. Audit the governance surface (permissions, approvals, observability). 3. Check tool-call semantics (MCP support, custom tools, retry behaviour). 4. Look at context management (compaction strategy, memory primitives). 5. Evaluate the operator surface (logs, traces, audit trails). Each step rendered as a numbered card on a dark canvas with amber step indicators.

Cursor: harness embedded in an IDE. Tunes per model: shell-forward instructions for Codex, lint-reading prompts, reasoning-trace preservation.
DeepAgents (LangChain): a harness on top of the LangChain framework. Default prompts, tool handling, filesystem access.
AWS AgentCore Harness (Preview): managed cloud harness. Trace-by-default observability via AgentCore Observability.
Hermes Agent (Nous Research): self-improving harness with a built-in learning loop. Ships pre-built.
OpenClaw: third-party harness around Claude. Lost subscription-plan API access in November 2025; now requires direct API billing.
Paperclip: an agent orchestrator. It sits one layer above harnesses, dispatching work to Claude Code instances through tickets and an immutable audit log. The orchestrator layer is distinct from the harness layer: the harness is what runs each Claude Code instance underneath. The forthcoming OpenClaw vs Hermes vs Paperclip teardown unpacks the distinction in detail.

INFO

Coming soon: harness teardown

A side-by-side teardown of OpenClaw, Hermes, and Paperclip is on the way. Subscribe to the newsletter to hear when it ships.

You are probably already running a harness

Most operators encounter an agent harness before they encounter the word. If you have used Claude Code, Cursor, Codex CLI, or any one of the autonomous coding agents that shipped in 2025 or 2026, you have run one. The harness is what loaded your CLAUDE.md or AGENTS.md at session start, what asked you to approve a shell command, what compacted your context window when the conversation got long, what decided when the loop should stop. All of that comes from the harness layer.

The point of knowing what a harness is, then, is twofold:

It shows where the leverage is. When two members of your team get different output quality from the same model, the gap is at the harness layer. The fix is also there: better prompts, better tool semantics, better permission gates. The model is the same in both cases.

It puts procurement on the right axis. The buyer for an interactive harness like Cursor wants developer velocity. The buyer for an autonomous agentic harness like Hermes or OpenClaw wants reliable, governable, repeatable workflow execution. The two are different products competing in different markets, and treating them as the same category is the source of most procurement mismatches on either side.

The model is the cheap, swappable, commodity layer. The harness is where the system gets its behavior. That is where the engineering time and the procurement attention go.

The model is commodity. The harness is the opinion.

Frequently asked questions

The runtime layer that wraps an LLM and handles prompt assembly, tool-use loops, context management, permission gates, and termination logic. Anthropic describes the Claude Agent SDK as "a powerful, general-purpose agent harness".

A framework is a library you build with. A harness is a runtime you deploy. A framework gives you abstractions; a harness gives you defaults. LangChain is a framework. Claude Code is a harness.

A runtime gives you primitives for production execution: durability, streaming, persistence, human-in-the-loop. A harness layers opinions on top: default prompts, opinionated tool handling, filesystem access, planning. LangGraph is a runtime. DeepAgents (built on LangGraph and LangChain) is a harness.

Yes. Anthropic's own engineering post calls the Claude Agent SDK "a powerful, general-purpose agent harness".

No. Harrison Chase classifies LangGraph as an agent runtime, with DeepAgents as the harness layer built on top.

No. Paperclip is an agent orchestrator. It coordinates multiple Claude Code instances through tickets and an immutable audit log; the harness is what runs each Claude Code instance underneath.

For production yes. The framework gets you started. The harness is what runs reliably across long sessions, tool failures, and approval gates.

Refuse tool calls. Manage context across multiple windows. Enforce approvals. Persist state. Verify a task before declaring it done.

In the harness. Anthropic's restrictions on third-party harness API access in November 2025 made the answer commercial as well as technical.

Article Sources10 referencesShow referencesHide references

We reviewed the sources below to support the claims, pricing, and benchmarks referenced in this article.

Effective harnesses for long-running agents
Anthropic Engineeringofficial
Source for Anthropic's definition of the Claude Agent SDK as a general-purpose agent harness; the Opus 4.5 incomplete-web-app evidence; the JSON-vs-Markdown state file note; the initializer-plus-coding-agent pattern cited in Section 4.
Harness engineering: leveraging Codex in an agent-first world
OpenAIofficial
OpenAI case study: three to seven engineers shipped roughly one million lines of code across approximately 1,500 pull requests; the agent legibility principle cited in Section 5.
My AI Adoption Journey
Mitchell Hashimotocommunity
Source for the Engineer the Harness practice formalisation in February 2026, cited in Section 1.
Agent Frameworks, Runtimes, and Harnesses, oh my!
Harrison Chase / LangChainofficial
Source for the three-layer Frameworks/Runtimes/Harnesses taxonomy in Section 2 and the contested-vocabulary admission cited there.
Agent Harness in Agent Framework
Microsoft Agent Frameworkofficial
Microsoft's operational definition of an agent harness; the @tool(approval_mode="always_require") pattern cited in Section 3.
Improving Cursor's agent for OpenAI Codex models
Cursorofficial
Cursor's 30 percent reasoning-trace performance delta on Cursor Bench cited in Section 4.
Lost in the Middle: How Language Models Use Long Contexts
Liu et al. (arXiv)academic
Source for the middle-of-context performance degradation noted in Section 3.
What Makes a Great Agent Harness?
Credalcommunity
Source for the three governance pillars (security, integration, observability) cited in Section 5.
AgentCore harness (Preview)
AWSofficial
AWS AgentCore Harness reference cited in Section 5 and the Examples in the wild list.
AutoHarness: improving LLM agents by automatically synthesizing a code harness
AutoHarness (arXiv)academic
Academic anchor for the agent equals LLM plus harness formal definition.

Written by

Michael Nouriel

Platform Engineer & Founder, Scaletific + Automation Switch

Michael Nouriel is a platform engineer and founder of Scaletific and Automation Switch. He builds governed AI execution infrastructure, including GoldenPath IDP and AEP, a runtime enforcement layer for AI-assisted software delivery. He writes about automation engineering, cloud infrastructure, and what it actually takes to run AI agents in production.