
Veröffentlicht am
·
20 minutes
Harness Engineering: From AI Agent Demos to Production

Francisco García Sierra

Kurze Zusammenfassung
AI agents are stateless and systematically overconfident. That's why most enterprise AI workflows look amazing in demos and quietly stall in production. The fix is not a smarter model, it's harness engineering: the discipline of designing the environment, state, verification, and recovery paths that surround the model. This post explains what a harness is in plain language, walks through eight practical rails any team can apply, and shows the engineering detail behind a working Zylon example where AI agents grow our frontend end-to-end coverage every single day. It's also why we run our own internal AI workloads on local models, on our own infrastructure - once the harness is right, the model becomes a swappable component rather than a vendor dependency.

Before Jumping Into The Code
Before the code, the ledgers, and the validation pipelines, there's one idea you need to hold in your head — and it applies whether you're a CIO, an operations lead, a compliance officer, or a product manager wondering why your "AI initiative" stalled after the first demo.
An AI agent is not a colleague who learns. It is a colleague with no memory.
Every conversation with an AI agent starts from a blank page. It does not remember what it built yesterday. It does not remember the bug it hit last week. It does not remember the policy your CFO insisted on three months ago. If that knowledge isn't written down somewhere the agent can read at the start of every session, it does not exist.
That single fact is the reason most AI projects look like miracles in week one and like disappointments by quarter two. The first session works because a senior engineer or analyst is in the loop — answering questions, correcting mistakes, holding the context in their head. The tenth session works the same way. The hundredth session is the same person, doing the same correction, paying the same orientation tax, getting the same partial answer. The agent never gets better at your business, because nothing about your business persists between sessions.
The discipline that fixes this has a name: harness engineering. The model is not the harness. The harness is everything around the model — the instructions it reads on startup, the environment it operates in, the state it leaves behind for the next run, the verification it must pass before it can claim something is done, and the boundaries that stop it from doing more than it should.
A useful analogy from the harness engineering literature: a great chef with no kitchen is not going to cook you a great meal. They need recipes (instructions), knives and pans (tools), a stove that actually works (a reliable environment), a prep station that holds yesterday's mise en place (state), and a window where someone tastes the food before it goes out (verification). Take any one of those away and even a Michelin chef ends up serving you a sandwich. Most enterprise AI deployments are exactly this: world-class ingredients, no kitchen.
The rest of this post is what a kitchen looks like.
We're going to walk through it twice — first as principles you can use to evaluate any AI workflow in your organization, then as a concrete engineering example where this discipline is already paying us back at Zylon. If you're non-technical, the principles are enough. If you're an engineer, the example will show you what the principles look like in code.
Why Capable Agents Still Fail
There's a phenomenon every team that has tried to put an agent into production has lived through, and it has a clean technical name. Researchers call it the verification gap: the systematic distance between an agent's confidence that a task is done and whether the task is actually correct. A 2017 calibration paper from Guo et al. showed that modern neural networks are systematically overconfident — they report higher confidence than their actual accuracy warrants. Nine years later, the same is true of the agents built on top of them. Agents are not lying when they say "done." They are mistaken in a structured, predictable way.
This matters operationally because most teams design their AI workflows around the agent's confidence. The agent says it's done; we ship it. The agent says it tested its work; we trust it. The agent says it followed the policy; we move on. Every one of those judgments is a self-grade, and self-grades are systematically generous.
Anthropic's research on long-running agents goes further. When the same agent both generates work and evaluates it, the evaluation is biased — even on tasks where correctness is objectively measurable. The fix is structural, not cognitive: you separate the worker from the checker. The same model can play both roles, but the harness must put them in different sessions, different contexts, different prompts. A student should not grade their own exam.
This is the load-bearing insight that makes harness engineering different from "prompt engineering." Prompt engineering tries to make the model say better things. Harness engineering accepts that the model's self-reports are unreliable and builds the environment to verify them externally.
What A Harness Actually Is
A practical harness has five subsystems. You can think of them as the five functional areas of a kitchen.
Instructions. A short, durable file that tells the agent what this project is, what stack it runs on, what conventions are non-negotiable, and where to look for more detail. In a code repository this is AGENTS.md or CLAUDE.md. In a finance ops workflow it might be a one-page operating doc. The discipline is the same: every session begins by loading this. Around 100 lines is the OpenAI guideline; if it doesn't fit, link out.
Tools. The set of things the agent is actually allowed to do. Not too many — least privilege still applies. Not too few — an agent that cannot run pip install or query a sandboxed database cannot do real work. Most teams err on one extreme or the other. The cure is to write down what tools are needed for this kind of task, and grant exactly that.
Environment. A self-describing, reproducible runtime. Locked dependencies, pinned versions, a known starting state. For non-engineers, the equivalent is the operational equivalent: known data sources, known access scopes, known service availability. The agent should be able to prove it has what it needs before it starts producing output.
State. A durable record of what's been done, what's in progress, what's blocked, and what comes next. This is the file (or set of files) that turns a stateless agent into a system that compounds. Without state, every session is the first session.
Feedback. The verification loop. Explicit commands that tell the agent — and the team — whether the work is actually correct. In code, that's tests, lints, type checks, end-to-end runs. In knowledge work, it's evaluation rubrics, human spot-checks, golden datasets, regression suites. The single highest-ROI investment in any harness, by a wide margin, is making feedback specific, executable, and external to the agent.
A real-world story from the harness engineering literature: a team running GPT-4o on a 20,000-line TypeScript app went from a 20% success rate on agent tasks to near 100% — without changing the model. They added the instruction file. Then they added the verification commands. Then they added a progress file. Four iterations of harness, no model upgrade, five times the success rate. The kitchen got organized.
The Eight Rails
These eight rails are how the five subsystems show up in day-to-day operation. Read them as universal principles first; the technical illustrations come from a Zylon project where AI agents grow our frontend end-to-end test coverage every day, but the rails apply just as cleanly to a contracts review workflow, a regulatory filing pipeline, or a customer-support automation. Anywhere an agent is asked to do real work, repeatedly, against a moving target, these rails apply.
Rail 1 — Startup Is A Real Phase, Not A Formality
The principle: the first thing an agent does in any session is not the work. It is loading the context that lets the work be productive. Most failed AI workflows skip this — the agent is dropped into a task and burns half its useful context budget rediscovering what it already knew yesterday.
Operationally, this means every session begins with a structured initialization: read the instruction file, run the startup script, load the state files. The output of startup is not more output — it's a proven starting position. For a knowledge-work agent, this might be: "I am working on the Q3 close. The last journal entry posted was 4,127. The reconciliation queue has 19 items pending. Three items are blocked on counterparty confirmation." That sentence is the difference between productive work and exploratory work.
In our test automation example, startup loads:
AGENTS.md— the entry point, read first every session.progress.md— what was done in the current line of work.session-handoff.md— concrete restart path for the next session.test_migration_list.md— the migration ledger.feature_list.json— completed work with evidence.docs/e2e-test.md— rules derived from past mistakes.
Once those are loaded, progress.md delivers the agent directly to where work stopped:
That is a fundamentally different starting point than "read the repo and figure it out." The first action the agent takes is productive, not exploratory.
Rail 2 — The Environment Must Be Proven, Not Assumed
The principle: agents will happily build castles on sand. They will write logic that references a system that's offline, query a database that's not provisioned, or generate a report from data that hasn't been refreshed. The work looks correct because the model has no way to know the foundation is broken.
The fix is to make environment verification a mandatory phase before any production work. Two questions must be answered with evidence before the agent moves on: can the system start, and is it reachable where we expect it? In a regulated workflow, the questions might be do I have access to the right data sources, and are they current? In either case, the answers are recorded as facts in the state files, not re-asked every session.
In code, this is concrete:
Two checks. Two recorded facts. No implementation begins until those facts are green. When a check fails — a missing package, a non-responding port — the agent raises it immediately rather than working around it and discovering the problem three layers later.
This rail is where most enterprise AI pilots quietly die. The demo was built against a clean environment with someone in the loop fixing things. The pilot is run against a messy environment with no one in the loop. Without proven environment as an explicit phase, the agent cannot tell the difference, and neither can the team.
Rail 3 — A Source Of Truth, A Migration Ledger, And Tracked State
The principle: the agent does not invent the work. It pulls the work from an authoritative source, checks what has already been done against a tracked ledger, and produces evidence as it goes.
For a software team, the source of truth might be a test management system, a backlog, an issue tracker, a feature spec. For a back-office team, it might be a list of regulations to comply with, a queue of tickets, a portfolio of contracts. The tool is incidental. What matters is that the agent is not free to imagine what the work is, because imagination is where confidence calibration bias does the most damage.
A migration ledger then records, for every item in the source of truth, the current status: queued, in progress, blocked, automated, deferred. This is the second-most important file in any agent workflow, after the instruction file. Without it, the agent will repeat work, skip work, or produce output that drifts from the underlying intent.
For our E2E project, Qase is the source of truth for manual test intent and test_migration_list.md is the ledger:
Same shape works for a contract review workflow, a KYC pipeline, a compliance evidence log. The artifact is generic; the discipline is what's load-bearing.
Rail 4 — Mistakes Become Rules, Not Just Fixes
The principle: a one-time correction is invisible to a stateless agent. The same mistake will recur in the next session, and the next, and the next. The only way to make a fix durable is to convert it into a written rule that the agent loads at startup.
This is one of the highest-leverage rails in harness engineering, and one of the most underused. Every team that works with agents has a moment where they fix the same problem twice. The fix-twice signal is information: there's a missing rule. Write it down, put it in the rules file, move on.
We hit this with fragile UI locators — selectors that broke silently when the interface translated a string differently. The patch was easy. The rule made it durable:
And the rule went into docs/e2e-test.md:
In a regulated workflow, the equivalent might be: the agent attempted to send PII to an external service, we blocked it, we wrote a rule about which data classifications can leave the network. Once the rule is in the harness, the lesson is permanent. This is how a reactive review comment becomes proactive policy.
Rail 5 — Scope Keeps The Agent Honest
The principle: the larger the task, the more the agent loses fidelity. Context fills, earlier decisions get summarized away, the agent's behavior drifts toward whatever is most recent and prominent. This is not a model flaw; it's a property of how long context windows behave in practice. The antidote is small, bounded scope.
Operationally: do not ask the agent to "automate all the tests" or "review all the contracts." Ask it to do five tests, or three contracts, or one quarter of one workflow. Each unit complete, validated, recorded. The next batch is a new session with the previous evidence already on disk.
Our example does not try to cover 500 cases. It targets one batch:
Each case is a single, scoped, individually validated test:
Small scope makes the agent more accurate. Recorded state makes the next session able to continue. The combination is what lets coverage grow incrementally instead of stalling on ambition.
Rail 6 — Blockers Are First-Class Outcomes
The principle: not every task should pass, and a deterministic system has to be able to express that. If the only options are "done" and "failed," agents will edit until something looks done — even when the right answer is "this cannot be done because the underlying system is wrong."
This is where the verification gap becomes most expensive. An agent that cannot say "blocked" with evidence will instead say "done" without it. False green is more expensive than honest red. False green moves on; honest red surfaces the underlying issue and routes it.
A clean lifecycle is the antidote:
In our run, BO-25 — verify that duplicate email is rejected — turned up a backend that didn't actually reject duplicates. The test could not pass, because the product was wrong. The agent's right answer was to flag it:
And the ledger records it:
Blocked with evidence is product information. A backlog of well-documented blockers is a quality signal, not a hole. The same is true in any agentic workflow — an "I cannot do this and here's why" output is far more valuable than a fabricated success.
Rail 7 — Validation Must Be Specified, Not Improvised
The principle: if the agent picks how to validate its own work, it will pick the validation that makes the work look done. The harness must specify which checks run, in which order, and what counts as passing. The agent executes; it does not choose.
This is the formalization of the worker/checker separation at the workflow level. The work is the agent's. The verification rubric is the harness's. They are not the same artifact, and they are not authored at the same time.
Concretely, validation in our example is prescribed and non-negotiable:
The commands are prescribed. The outputs are recorded. The next session inherits both. In a knowledge-work setting, this becomes a rubric and an evaluator — sometimes the evaluator is a separate agent run with a different prompt and a "be picky" instruction, sometimes it's a human reviewer, sometimes it's a deterministic check against known-good data. The shape doesn't matter. What matters is that the validation is external to the agent that did the work.
Rail 8 — Clean Exit Is What Makes Automation Compound
The principle: every session must leave the workspace in a state where the next session can resume immediately, with full context, without asking. This is not housekeeping. This is the entire mechanism by which daily automation accumulates instead of decaying.
Entropy is the default. Without active cleanup discipline, every session adds stale artifacts, breaks implicit assumptions, and leaves unstated context that the next session has to re-infer. Within a few iterations, the rediscovery cost overwhelms the productive cost, and the workflow looks like it's getting slower over time even though nothing in the model has changed.
Clean exit is part of the definition of done — not separate from it. A session is not complete when the work is complete. It is complete when the work is complete and the handoff files are current.
For us, that means:
And the explicit restart path for the next agent:
The next session does not start from blank chat. It starts from repo-owned state, known validation, known blockers, and known file boundaries.
The Full Flow

Why This Matters For Private AI
There's a quiet assumption in most enterprise AI conversations that goes something like this: serious work requires the latest frontier model, served from someone else's cloud, behind someone else's API. Anything less is a toy.
Our experience says the opposite. The hardest part of getting AI to do useful work in production is not the model. It's everything around the model. And once you accept that, the entire cost-and-sovereignty equation changes.
We build Zylon on Zylon. The platform that our customers run on their own servers — to keep their data inside their walls, to satisfy regulators, to stay independent of any single model vendor — is the same platform we use internally to build the product itself. The agents that grow our test coverage every day, draft our internal tooling, and help us ship features are running against local models on our own infrastructure. Not the newest model on the market. Not the most expensive endpoint. Local models, on hardware we control.
This works because the harness does most of the heavy lifting. A well-instructed agent with a small, fast, locally-hosted model that loads the right state, proves its environment, operates in scoped tasks, and validates its work against external commands will outperform a frontier model dropped into a bare repository with a vague prompt. Every time. The model is a component. The harness is the system.
That has two consequences worth sitting with.
The first is economic. Frontier-model APIs are priced as if model capability is the scarce resource. For a lot of enterprise work, it isn't — context, state, and verification are. A team that invests in its harness can run most of its agentic work on smaller, cheaper, local models and reserve the frontier endpoints for the narrow cases that genuinely need them. The cost curve of AI in production starts to look very different when the harness is doing the work that prompts are doing in less mature setups.
The second is strategic. A workflow that depends on the newest model from a single vendor inherits that vendor's roadmap, pricing, deprecation schedule, and geopolitics. A workflow that runs against local models on infrastructure you control inherits none of that. The harness is portable. The state is yours. The instructions are yours. When a better local model comes out next quarter, you swap it in. When a vendor changes terms, you don't notice. This is what real model independence looks like, and it's why the same discipline that makes our internal automation compound is the discipline that makes a private AI platform viable for a regulated bank, hospital, or agency in the first place.
We didn't set out to prove this point. We just kept finding that the harness mattered more than the model — and that local models, well-harnessed, were enough for almost everything we needed to do internally. The test coverage growing every day is one piece of evidence. The product itself, built on the same infrastructure we sell, is the bigger one.
If your AI strategy is currently a bet on which frontier model to use, it might be worth asking a different question: what would your work look like if the harness were the thing you got right, and the model were the part you could swap?
Sources
OpenAI, Harness engineering: leveraging Codex in an agent-first world — https://openai.com/index/harness-engineering/
Anthropic, Effective harnesses for long-running agents — https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
Anthropic, Harness design for long-running application development — https://www.anthropic.com/engineering/harness-design-long-running-apps
Walking Labs, Learn Harness Engineering (lecture series) — https://walkinglabs.github.io/learn-harness-engineering/en/
Guo et al., On Calibration of Modern Neurl Networks, ICML 2017 — https://arxiv.org/abs/1706.04599
Awesome Harness Engineering — https://github.com/walkinglabs/awesome-harness-engineering
Author: Francisco Garcia Sierra, FullStack Developer at Zylon
Published: May 2026
Francisco is a FullStack Developer at Zylon working across product, infrastructure, and AI-powered developer workflows. He has built enterprise products end to end, from backend systems and APIs to frontend experiences and production integrations, with a strong focus on building reliable systems that scale in real-world environments. His background also includes blockchain and cryptography, which shapes the way he approaches security, system design, and trust in software. At Zylon, he works on turning advanced AI capabilities into practical tools that improve developer experience, reduce operational friction, and help teams ship faster with more control.
Veröffentlicht am
Geschrieben von
Francisco García Sierra


