I have been building digital products for 20 years. Now my harness builds for me. Here's how I do it.
20 years of me building for clients some projects were small and fast: two-month builds for $30k, others were multi-million dollar engagements with engineers, designers, product managers, QA, delivery leads, security reviewers, and stakeholders who needed the software to work in production.
After enough of those projects, I have learnt a simple lesson:
Quality software rarely comes from one person delivering brilliant instructions. It comes from many roles contributing to the system around the work.
That is why I think most people are approaching AI agents from the wrong direction.
The teams I see struggling with mastering an AI product workflow are usually trying to write the perfect prompt. We've all done it.
You add more examples. You add more constraints. You add "act like a senior engineer." You add "don't over-engineer." You add "follow existing patterns." You add "write production-ready code."
The prompt gets longer, but the result doesn't get better.
You're asking one paragraph to do the job of an engineering team.
Real teams do not work that way.
They have requirements, design rules, review processes, testing standards, release gates, handoff notes, and shared memory.
An AI agent needs the same kind of operating system around it.
This article covers the system I have built and use daily to make AI agent work more scoped, testable, reviewable, and repeatable.
Read on to learn how to build your own, or just bookmark it and point your agent at it to do it for you. You do you.
What the heck is a harness and why should you care?
An AI agent harness is the system around an AI agent that makes its work scoped, testable, reviewable, and repeatable.
A prompt tells the agent what to do.
A harness gives the agent the conditions to do the work properly: context before it starts, boundaries while it works, proof before it finishes, and memory after it leaves.
If you are using Claude, Cursor, Codex, or another AI coding agent to touch real software, the prompt is only the start.
The reliable output comes from the system around the model: the files it reads, the rules it follows, the checks it must run, the evidence it must capture, and the project memory it leaves behind.
It is the difference between an impressive demo and software you can trust in production.
What an AI agent harness actually does
A good AI agent harness does four jobs.
First, it gives the agent context before it starts, so it understands the project instead of relying only on the current chat.
Second, it gives the agent boundaries while it works, so a small change does not quietly become a refactor.
Third, it defines proof before the agent finishes, so "done" means more than "the agent says it is done."
Fourth, it leaves memory behind, so the next session does not start from zero.
The model still matters. The prompt still matters. But for serious work, the system around the agent matters more.
Prompt-first vs harness-first
Think about it like this:
Prompt-first workflow
Harness-first workflow
Relies on one long instruction
Uses persistent project rules
Agent guesses the real scope
Scope is written down before work starts
Tests are often added after implementation
Acceptance scenarios exist before implementation
"Done" means the agent says it is done
"Done" means evidence exists
Context lives in chat
Context lives in project state
Each session starts cold
Each session compounds
The agent optimises for output
The system optimises for reliable delivery
A prompt can start the work.
A harness makes the work reliable.
The operating loop
At a high level, my harness moves every meaningful change through the same loop.
Request
|
vRequirements
|
vHardening panel
|
vAcceptance scenarios
|
vWork record
|
vImplementation
|
vVerification
|
vRuntime validation
|
vEvidence bundle
|
vProject memory
|
+-- back into the next request
Request
|
vRequirements
|
vHardening panel
|
vAcceptance scenarios
|
vWork record
|
vImplementation
|
vVerification
|
vRuntime validation
|
vEvidence bundle
|
vProject memory
|
+-- back into the next request
Request
|
vRequirements
|
vHardening panel
|
vAcceptance scenarios
|
vWork record
|
vImplementation
|
vVerification
|
vRuntime validation
|
vEvidence bundle
|
vProject memory
|
+-- back into the next request
This is the part most prompts try to fake.
A prompt might say "build this properly." The workflow defines what properly means.
The agent is not allowed to jump straight from request to code. First, the work has to become clear enough to build. Then it has to become testable. Then it has to be scoped. Then it can be implemented. Then it has to prove itself.
The hardening phase is the most important part of that loop.
A draft requirement is passed through a panel of specialist reviewers. Each one looks at it through a different professional lens: product, architecture, security, QA, frontend, backend, design, and copy.
They are not all trying to be agreeable. That is the point.
The product reviewer cares about clarity and measurable outcomes. The architect cares about whether the shape fits the existing system. The security reviewer is intentionally paranoid. The QA reviewer looks for ambiguity. The designer looks for interaction problems. The copywriter looks for language that sounds generic, misleading, or off-brand.
That tension is what you want. It is the same tension you get in a real delivery team when the PM, architect, designer, QA lead, and security reviewer all look at the same feature and see different risks.
The hardening panel turns that into a repeatable step.
The output is not a brainstorm. It is a hard gate.
A minimum hardening result should answer four questions:
Required fixes: what must change before implementation?
Nice-to-haves: what would improve the work but should not block it?
Risks accepted: what are we knowingly carrying forward?
Readiness verdict: pass, proceed with concerns, or stop.
Required fixes block the work. Nice-to-haves are recorded but do not block. Once the required fixes are resolved, the requirement becomes hardened and the agent can move on.
That is where a lot of cost gets saved.
The expensive failure is not bad code. It is good code for a weak requirement.
The five parts of the harness
The harness I have built has five core parts.
Instructions -> what the agent readsState -> what the agent remembersScope -> what the agent is allowed to touchVerification -> what proves the workLifecycle -> how the work compounds
Instructions -> what the agent readsState -> what the agent remembersScope -> what the agent is allowed to touchVerification -> what proves the workLifecycle -> how the work compounds
Instructions -> what the agent readsState -> what the agent remembersScope -> what the agent is allowed to touchVerification -> what proves the workLifecycle -> how the work compounds
You can name these differently. The names do not matter much. The shape does.
1. Instructions
Instructions are the agent's onboarding.
Most people try to solve this with one giant system prompt. I do not think that works well for serious projects. A huge prompt becomes an employee handbook the agent is expected to memorise under pressure.
A better pattern is a small top-level file that acts as a map.
It tells the agent what kind of project this is, what rules are non-negotiable, and where deeper guidance lives. Planning has its own doc. Verification has its own doc. Runtime validation has its own doc. Design and content rules live separately.
The top-level instruction file should not contain everything. It should tell the agent where everything is.
A simple first version might say:
Read thisfirst.
Understandthe task before editing.
Findexisting patterns before creating newones.
Createor update a work record formeaningful changes.
Donot expand scope silently.
Runchecks.
Validateruntime behavior when it matters.
Captureevidence before calling the work done
Read thisfirst.
Understandthe task before editing.
Findexisting patterns before creating newones.
Createor update a work record formeaningful changes.
Donot expand scope silently.
Runchecks.
Validateruntime behavior when it matters.
Captureevidence before calling the work done
Read thisfirst.
Understandthe task before editing.
Findexisting patterns before creating newones.
Createor update a work record formeaningful changes.
Donot expand scope silently.
Runchecks.
Validateruntime behavior when it matters.
Captureevidence before calling the work done
To build this yourself, start with one short file in the repo root. Keep it readable. Add the rules that should apply every session. Then split deeper process into separate documents as the system grows.
The goal is simple: when the agent starts, it knows how to orient itself before touching the work.
2. State
Chat history is not state.
State is what survives when the session closes.
In a real team, this is obvious. Requirements live somewhere. Decisions live somewhere. Known risks live somewhere. Test evidence lives somewhere. Nobody serious runs a production project from memory and Slack scrolling alone.
Agents make this even more important because the next session may be a fresh context.
A useful state layer includes requirements, acceptance criteria, implementation records, known debt, runtime evidence, and handoff notes.
The key rule is this: if the agent is changing something meaningful, there is a live work record.
That work record says what the task is, what is in scope, what is out of scope, what checks should run, and what evidence will prove the work is complete.
This prevents the agent from silently deciding what the task means.
It also means the next session can pick up the work without you narrating the entire history again.
3. Scope
Scope is where agents create a lot of hidden cost.
If the task is vague, the agent will often choose the broadest useful interpretation. A small bug fix becomes a refactor. A copy change becomes a component redesign. A UI adjustment becomes a design system change.
The model is trying to help.
The harness needs to give it walls.
Every work record should include a readiness verdict before implementation starts.
Work request
|
vWork record
|
vReadiness verdict
|
+-- Pass -> proceed
+-- Concerns -> proceed withnamed risks
+-- Fail -> stop and clarify
Work request
|
vWork record
|
vReadiness verdict
|
+-- Pass -> proceed
+-- Concerns -> proceed withnamed risks
+-- Fail -> stop and clarify
Work request
|
vWork record
|
vReadiness verdict
|
+-- Pass -> proceed
+-- Concerns -> proceed withnamed risks
+-- Fail -> stop and clarify
Pass means the task is clear, scoped, and verifiable.
Concerns means the task can proceed, but the risks need to travel with the work.
Fail means stop before code.
This is not bureaucracy. It is cheaper than building the wrong thing well.
Scope control should also define what the agent is allowed to edit. If the task is a checkout bug, the agent should not casually rewrite the auth model. If the task is a visual polish pass, it should not invent new design tokens unless that is explicitly part of the work.
The harness should make scope expansion visible.
4. Verification
Verification is the part that turns agent output into software you can trust.
The agent saying "done" is not enough. The question is what proved it.
A useful verification stack has several layers, but the important thing is that verification starts before code. Before the agent implements anything, you want a behavioral contract that describes what the system must do.
This is where Acceptance Test Driven Development fits nicely.
ATDD forces the expected behavior to be written before implementation, in product language. A good acceptance test should not say "call this endpoint" or "write to this table." It should say what happens from the user's point of view.
That distinction matters.
Implementation details pull the agent toward a technical guess. Acceptance tests pull it toward observable behavior.
This gives you a second stream of truth. The agent cannot just write code and then write tests that bless its own assumptions. The acceptance scenarios come from the requirement, not from the implementation.
Anything a script can enforce should be enforced by a script. Do not ask the agent to remember things that the repo can check automatically.
The next layer is review.
Different reviewers catch different failures. Architecture checks system fit. Security checks exposure. QA checks whether the behavior is testable. Design checks usability. Copy checks whether the language sounds like the product.
The deepest layer is runtime validation. I have found this is what prompt-only approaches miss the most.
Tests prove what the implementer thought to test. Runtime validation proves what actually happened when the software ran.
For a web app, the agent should open the page, interact with the flow, check the console, and capture evidence. For a service, it should send a real request and inspect the response. For a CLI, it should run the command and capture output.
"It compiles" is not the same as "it works."
Production lives at runtime.
5. Lifecycle
Lifecycle is how the work compounds.
It covers how a project starts, how each session starts, how discovery happens, how work closes, and how the next agent continues from the state left behind.
A good lifecycle starts with bootstrap.
The project records its stack, commands, repo structure, validation method, design rules, content rules, and known risks. Until that exists, the agent is guessing.
Then every session follows a repeatable rhythm.
Understand the request. Read current state. Confirm scope. Execute. Verify. Review. Capture evidence. Update state.
The other lifecycle piece is context hygiene.
Large searches, noisy logs, and broad discovery should not pollute the main working context. Use a scout pass for that.
The scout reads widely and returns a compact report: files to inspect, likely edit surfaces, risks, and recommended next step. The main agent then works from the distilled findings instead of carrying every dead end.
Clean context improves output.
A lot.
Build the minimum viable AI agent harness
You do not need to build the full version on day one.
Start with a small harness that creates the right behavior.
The public version of the contract can be simple. Your private advantage can live in the exact reviewers, scripts, checks, and project-specific rules. The first public contract only needs to describe the shape clearly enough that a fresh agent can build it without guessing.
Define meaningful work
Define "meaningful work" first.
Meaningful work is any code, user-facing behavior, architecture, security, design, content, or process change that could affect future work.
Tiny typo fixes probably do not need the full loop.
The task slug can come from a ticket, a short feature name, or a dated change name. The naming convention matters less than consistency.
Create AGENTS.md
Start with a single AGENTS.md at the root of the repo.
It can be simple:
# AGENTS.mdThis file is the operating guide forAI agents workinginthisrepository.
Readthisbefore making changes.
## Rules
- Understand the task before editing.
- Find existing patterns before creating newones.
- Create or update a work record formeaningful changes.
- Do not expand scope silently.
- Write acceptance scenarios before implementation when behavior changes.
- Run the relevant checks.
- Validate runtime behavior when it matters.
- Capture evidence before calling the work done.
## DefinitionOf DoneWork is not done until:
- the requested change is implemented
- relevant checks have run
- runtime behavior has been validated or explicitly marked not applicable
- skipped checks are named withreasons
- evidence is recorded
- remaining risks are documented
## HandoffLeave enough written state fora fresh agent to continue without reading the chat.
Chathistory is not project memory
# AGENTS.mdThis file is the operating guide forAI agents workinginthisrepository.
Readthisbefore making changes.
## Rules
- Understand the task before editing.
- Find existing patterns before creating newones.
- Create or update a work record formeaningful changes.
- Do not expand scope silently.
- Write acceptance scenarios before implementation when behavior changes.
- Run the relevant checks.
- Validate runtime behavior when it matters.
- Capture evidence before calling the work done.
## DefinitionOf DoneWork is not done until:
- the requested change is implemented
- relevant checks have run
- runtime behavior has been validated or explicitly marked not applicable
- skipped checks are named withreasons
- evidence is recorded
- remaining risks are documented
## HandoffLeave enough written state fora fresh agent to continue without reading the chat.
Chathistory is not project memory
# AGENTS.mdThis file is the operating guide forAI agents workinginthisrepository.
Readthisbefore making changes.
## Rules
- Understand the task before editing.
- Find existing patterns before creating newones.
- Create or update a work record formeaningful changes.
- Do not expand scope silently.
- Write acceptance scenarios before implementation when behavior changes.
- Run the relevant checks.
- Validate runtime behavior when it matters.
- Capture evidence before calling the work done.
## DefinitionOf DoneWork is not done until:
- the requested change is implemented
- relevant checks have run
- runtime behavior has been validated or explicitly marked not applicable
- skipped checks are named withreasons
- evidence is recorded
- remaining risks are documented
## HandoffLeave enough written state fora fresh agent to continue without reading the chat.
Chathistory is not project memory
This file will evolve, but it is enough to change the agent's behavior on day one.
Create the file contracts
The instruction file tells the agent how to behave.
The work record tells it what it is allowed to change.
The acceptance record tells it what behavior must be true before implementation starts.
The evidence record tells it what proof must exist before the task can be called complete.
The decision log stops the same architectural choices being rediscovered every session.
Work record contract
The work record should include the goal, scope, out of scope, readiness verdict, risks, planned checks, and expected evidence.
Create this as docs/work-records/_template.md:
---
task:<task-slug>
status: draft
owner_agent: <agent-or-human>
created: <YYYY-MM-DD>
last_updated: <YYYY-MM-DD>
---
# <TaskTitle>
## Goal
<Whatoutcomeshouldexistwhenthisworkiscomplete?>
## In Scope
- <Whattheagentisallowedtochange>
## Out Of Scope
- <Whattheagentmustnottouch, even if it looks related>
## Readiness Verdict
**Verdict:** `<PASS | CONCERNS | FAIL>`
**Findings:**
- <finding>
**Blocking Items:**
- <itemornone>
**Concerns To Carry Forward:**
- <riskornone>
## Planned Checks
- [ ] <test, lint, build, review, or runtime check>
## Expected Evidence
- <commandoutput
---
task:<task-slug>
status: draft
owner_agent: <agent-or-human>
created: <YYYY-MM-DD>
last_updated: <YYYY-MM-DD>
---
# <TaskTitle>
## Goal
<Whatoutcomeshouldexistwhenthisworkiscomplete?>
## In Scope
- <Whattheagentisallowedtochange>
## Out Of Scope
- <Whattheagentmustnottouch, even if it looks related>
## Readiness Verdict
**Verdict:** `<PASS | CONCERNS | FAIL>`
**Findings:**
- <finding>
**Blocking Items:**
- <itemornone>
**Concerns To Carry Forward:**
- <riskornone>
## Planned Checks
- [ ] <test, lint, build, review, or runtime check>
## Expected Evidence
- <commandoutput
---
task:<task-slug>
status: draft
owner_agent: <agent-or-human>
created: <YYYY-MM-DD>
last_updated: <YYYY-MM-DD>
---
# <TaskTitle>
## Goal
<Whatoutcomeshouldexistwhenthisworkiscomplete?>
## In Scope
- <Whattheagentisallowedtochange>
## Out Of Scope
- <Whattheagentmustnottouch, even if it looks related>
## Readiness Verdict
**Verdict:** `<PASS | CONCERNS | FAIL>`
**Findings:**
- <finding>
**Blocking Items:**
- <itemornone>
**Concerns To Carry Forward:**
- <riskornone>
## Planned Checks
- [ ] <test, lint, build, review, or runtime check>
## Expected Evidence
- <commandoutput
Acceptance scenario contract
The acceptance record should include scenarios, observable behavior, edge cases, and non-goals.
Create this as docs/acceptance/_template.md:
# Acceptance Scenarios
Feature:<featureortaskname>
Primary user or system: <whoobservesthebehavior>
## Scenarios
### AC-1 - <one-linebehaviorsummary>
GIVEN <preconditionstatedinuser-domainlanguage>.
WHEN <actiontheuserorsystemtakes>.
THEN <observableoutcome>.
### AC-2 - <one-linebehaviorsummary>
GIVEN <preconditionstatedinuser-domainlanguage>.
WHEN <actiontheuserorsystemtakes>.
THEN <observableoutcome>.
## Edge Cases
- <boundarycondition, unhappy path, or failure mode>
## Non-Goals
- <behaviorthischangeshouldnotsolve>
# Acceptance Scenarios
Feature:<featureortaskname>
Primary user or system: <whoobservesthebehavior>
## Scenarios
### AC-1 - <one-linebehaviorsummary>
GIVEN <preconditionstatedinuser-domainlanguage>.
WHEN <actiontheuserorsystemtakes>.
THEN <observableoutcome>.
### AC-2 - <one-linebehaviorsummary>
GIVEN <preconditionstatedinuser-domainlanguage>.
WHEN <actiontheuserorsystemtakes>.
THEN <observableoutcome>.
## Edge Cases
- <boundarycondition, unhappy path, or failure mode>
## Non-Goals
- <behaviorthischangeshouldnotsolve>
# Acceptance Scenarios
Feature:<featureortaskname>
Primary user or system: <whoobservesthebehavior>
## Scenarios
### AC-1 - <one-linebehaviorsummary>
GIVEN <preconditionstatedinuser-domainlanguage>.
WHEN <actiontheuserorsystemtakes>.
THEN <observableoutcome>.
### AC-2 - <one-linebehaviorsummary>
GIVEN <preconditionstatedinuser-domainlanguage>.
WHEN <actiontheuserorsystemtakes>.
THEN <observableoutcome>.
## Edge Cases
- <boundarycondition, unhappy path, or failure mode>
## Non-Goals
- <behaviorthischangeshouldnotsolve>
Keep this file in product language. If the acceptance scenario names a component, table, endpoint, implementation file, or framework detail, check whether it has drifted out of behavior and into design-by-accident.
Evidence record contract
The evidence record should include commands run, results, skipped checks with reasons, runtime proof, changes made, and remaining risks.
The decision record should capture choices that future agents should not rediscover.
Not every implementation detail needs a decision record. Use it when a choice affects architecture, security, product behavior, migration, compatibility, or repeated future work.
Every task should move through the same simple lifecycle.
Draft -> Ready -> In Progress -> Verified -> Closed
Draft -> Ready -> In Progress -> Verified -> Closed
Draft -> Ready -> In Progress -> Verified -> Closed
Draft means the request exists but is not ready.
Ready means the work is clear, scoped, and testable.
In progress means the agent is implementing inside the work record.
Verified means checks and runtime validation are complete.
Closed means evidence has been captured and project memory has been updated.
Add the first guardrails
The first scripts can be plain.
Your check-work-record should fail if meaningful work has no active work record or no readiness verdict.
Active means the work record matches the current task and is not closed.
It can start with rules like this:
fail ifmeaningful work has no work recordfail ifstatus is Ready or later but there is no readiness verdictfail ifbehavior work has no acceptance recordwarn ifrisks are listed but no planned checks address themfail ifscope is empty or out of scope is empty
fail ifmeaningful work has no work recordfail ifstatus is Ready or later but there is no readiness verdictfail ifbehavior work has no acceptance recordwarn ifrisks are listed but no planned checks address themfail ifscope is empty or out of scope is empty
fail ifmeaningful work has no work recordfail ifstatus is Ready or later but there is no readiness verdictfail ifbehavior work has no acceptance recordwarn ifrisks are listed but no planned checks address themfail ifscope is empty or out of scope is empty
Your check-evidence should fail if completed work has no evidence record, no checks listed, or skipped checks without reasons.
It can start with rules like this:
fail ifstatus is Verified or Closed but no evidence record existsfail ifno checks are listedfail ifskipped checks have no reasonsfail ifruntime validation is required but missingwarn ifremaining risks are blank
fail ifstatus is Verified or Closed but no evidence record existsfail ifno checks are listedfail ifskipped checks have no reasonsfail ifruntime validation is required but missingwarn ifremaining risks are blank
fail ifstatus is Verified or Closed but no evidence record existsfail ifno checks are listedfail ifskipped checks have no reasonsfail ifruntime validation is required but missingwarn ifremaining risks are blank
Those rules are not magic. They are guardrails. Their job is to make the agent stop at the same points a good delivery lead would stop.
That is enough for a fresh agent to build a real first version.
Not a toy prompt file. A small system with rules, state, gates, and proof.
Example: a small password reset change
Imagine the task is small:
Add confirmation copy after a user requests a password reset.
The work record says the goal is to show clear confirmation after the reset request. In scope is the success message and any existing tests around that state. Out of scope is changing authentication, changing email delivery, or redesigning the whole login flow.
The acceptance scenario says what the user should observe.
The decision record explains why the copy must not reveal whether the account exists.
The evidence record proves the change was checked.
Now the agent has walls.
If it starts rewriting the auth flow, the work record says no. If it writes copy but never opens the screen, the evidence gate says no. If it changes behavior that was never described in the acceptance scenario, the review step has something concrete to push against.
Work record example
# Work Record
Status:Ready
Task:password-reset-confirmation-copy
Owner:agent
Created:2026-06-04
## GoalAfter a user requests a password reset,they see clear confirmation copy that tells them what to check next.
## InScope
- Update the confirmation copy shown after password reset request submission.
- Update existing tests or snapshots forthat state ifthey exist.
- Validate the password reset screeninthe running app.
## OutOf Scope
- Do not change authentication logic.
- Do not change email delivery.
- Do not redesign the login or password reset flow.
- Do not expose whether the submitted email belongs to an account.
## ReadinessVerdictPass.
## Risks
- Copy could accidentally reveal account existence.
- Agent could expand scope into auth behavior.
## PlannedChecks
- Run relevant UI or acceptance tests.
- Open the password reset flow and submit an email.
- Confirm the browser console has no newerrors.
## ExpectedEvidence
- Test command and result.
- Screenshot or runtime note showing the confirmation state.
- Note confirming no auth or email delivery changes were made
# Work Record
Status:Ready
Task:password-reset-confirmation-copy
Owner:agent
Created:2026-06-04
## GoalAfter a user requests a password reset,they see clear confirmation copy that tells them what to check next.
## InScope
- Update the confirmation copy shown after password reset request submission.
- Update existing tests or snapshots forthat state ifthey exist.
- Validate the password reset screeninthe running app.
## OutOf Scope
- Do not change authentication logic.
- Do not change email delivery.
- Do not redesign the login or password reset flow.
- Do not expose whether the submitted email belongs to an account.
## ReadinessVerdictPass.
## Risks
- Copy could accidentally reveal account existence.
- Agent could expand scope into auth behavior.
## PlannedChecks
- Run relevant UI or acceptance tests.
- Open the password reset flow and submit an email.
- Confirm the browser console has no newerrors.
## ExpectedEvidence
- Test command and result.
- Screenshot or runtime note showing the confirmation state.
- Note confirming no auth or email delivery changes were made
# Work Record
Status:Ready
Task:password-reset-confirmation-copy
Owner:agent
Created:2026-06-04
## GoalAfter a user requests a password reset,they see clear confirmation copy that tells them what to check next.
## InScope
- Update the confirmation copy shown after password reset request submission.
- Update existing tests or snapshots forthat state ifthey exist.
- Validate the password reset screeninthe running app.
## OutOf Scope
- Do not change authentication logic.
- Do not change email delivery.
- Do not redesign the login or password reset flow.
- Do not expose whether the submitted email belongs to an account.
## ReadinessVerdictPass.
## Risks
- Copy could accidentally reveal account existence.
- Agent could expand scope into auth behavior.
## PlannedChecks
- Run relevant UI or acceptance tests.
- Open the password reset flow and submit an email.
- Confirm the browser console has no newerrors.
## ExpectedEvidence
- Test command and result.
- Screenshot or runtime note showing the confirmation state.
- Note confirming no auth or email delivery changes were made
Acceptance example
# Acceptance Scenarios
Feature:Password reset confirmationPrimary user or system:User who has requested a password reset
## Scenarios
### User sees confirmation after submitting an emailGiven a user is on the password reset screen.
Whenthey submit an email address.
Thenthey see a confirmation message telling them to check their email fornext steps.
### Confirmationdoes not expose account existenceGiven a user submits an email address.
Whenthe password reset request is accepted.
Thenthe confirmation message does not reveal whether the email belongs to an account.
## EdgeCases
- The confirmation works forknown and unknown email addresses.
- The confirmation is visible without requiring a page refresh.
- The user can still navigate back to signin.
## Non-Goals
- Do not change authentication rules.
- Do not change email delivery.
- Do not redesign the login flow
# Acceptance Scenarios
Feature:Password reset confirmationPrimary user or system:User who has requested a password reset
## Scenarios
### User sees confirmation after submitting an emailGiven a user is on the password reset screen.
Whenthey submit an email address.
Thenthey see a confirmation message telling them to check their email fornext steps.
### Confirmationdoes not expose account existenceGiven a user submits an email address.
Whenthe password reset request is accepted.
Thenthe confirmation message does not reveal whether the email belongs to an account.
## EdgeCases
- The confirmation works forknown and unknown email addresses.
- The confirmation is visible without requiring a page refresh.
- The user can still navigate back to signin.
## Non-Goals
- Do not change authentication rules.
- Do not change email delivery.
- Do not redesign the login flow
# Acceptance Scenarios
Feature:Password reset confirmationPrimary user or system:User who has requested a password reset
## Scenarios
### User sees confirmation after submitting an emailGiven a user is on the password reset screen.
Whenthey submit an email address.
Thenthey see a confirmation message telling them to check their email fornext steps.
### Confirmationdoes not expose account existenceGiven a user submits an email address.
Whenthe password reset request is accepted.
Thenthe confirmation message does not reveal whether the email belongs to an account.
## EdgeCases
- The confirmation works forknown and unknown email addresses.
- The confirmation is visible without requiring a page refresh.
- The user can still navigate back to signin.
## Non-Goals
- Do not change authentication rules.
- Do not change email delivery.
- Do not redesign the login flow
Decision example
# Decision Record
Date:2026-06-04
Decision:Use neutral password reset confirmation copy
Status:Accepted
## ContextThe password reset flow needs confirmation copy after a user submits an email address. Themessage must guide the user without revealing whether the submitted email belongs to an account.
## DecisionUse neutral copy:"Check your email for password reset instructions. If an account exists for that address, instructions will arrive shortly."
## WhyThis supports the user's next step while avoiding account-enumeration risk.
## Alternatives Considered
- "We sent you an email."Rejected because it implies the account exists.
- "That email is not registered."Rejected because it exposes account existence.
## ImpactThe agent can update copy and related tests,but must not change authentication,account lookup behavior,or email delivery
# Decision Record
Date:2026-06-04
Decision:Use neutral password reset confirmation copy
Status:Accepted
## ContextThe password reset flow needs confirmation copy after a user submits an email address. Themessage must guide the user without revealing whether the submitted email belongs to an account.
## DecisionUse neutral copy:"Check your email for password reset instructions. If an account exists for that address, instructions will arrive shortly."
## WhyThis supports the user's next step while avoiding account-enumeration risk.
## Alternatives Considered
- "We sent you an email."Rejected because it implies the account exists.
- "That email is not registered."Rejected because it exposes account existence.
## ImpactThe agent can update copy and related tests,but must not change authentication,account lookup behavior,or email delivery
# Decision Record
Date:2026-06-04
Decision:Use neutral password reset confirmation copy
Status:Accepted
## ContextThe password reset flow needs confirmation copy after a user submits an email address. Themessage must guide the user without revealing whether the submitted email belongs to an account.
## DecisionUse neutral copy:"Check your email for password reset instructions. If an account exists for that address, instructions will arrive shortly."
## WhyThis supports the user's next step while avoiding account-enumeration risk.
## Alternatives Considered
- "We sent you an email."Rejected because it implies the account exists.
- "That email is not registered."Rejected because it exposes account existence.
## ImpactThe agent can update copy and related tests,but must not change authentication,account lookup behavior,or email delivery
Evidence example
# Evidence RecordWork record:docs/work-records/password-reset-confirmation-copy.md
Date:2026-06-04Agent/session:codex
## Checks Run
Command:npm test -- password-reset
Result:Passed
Command:npm run lint
Result:Passed
## Runtime ValidationOpened the password reset screeninthe browser,submitted a test email address,and confirmed the neutral confirmation message appeared.
Browserconsole showed no newerrors.
## SkippedChecksEmail delivery test skipped because thischange does not alter email sending behavior.
## ChangesMade
- Updated password reset confirmation copy.
- Updated the existing test expectation forthe confirmation state.
## RemainingRisksNone known.
## HandoffFuture agents should not change the confirmation into account-specific language unless the security decision is revisited
# Evidence RecordWork record:docs/work-records/password-reset-confirmation-copy.md
Date:2026-06-04Agent/session:codex
## Checks Run
Command:npm test -- password-reset
Result:Passed
Command:npm run lint
Result:Passed
## Runtime ValidationOpened the password reset screeninthe browser,submitted a test email address,and confirmed the neutral confirmation message appeared.
Browserconsole showed no newerrors.
## SkippedChecksEmail delivery test skipped because thischange does not alter email sending behavior.
## ChangesMade
- Updated password reset confirmation copy.
- Updated the existing test expectation forthe confirmation state.
## RemainingRisksNone known.
## HandoffFuture agents should not change the confirmation into account-specific language unless the security decision is revisited
# Evidence RecordWork record:docs/work-records/password-reset-confirmation-copy.md
Date:2026-06-04Agent/session:codex
## Checks Run
Command:npm test -- password-reset
Result:Passed
Command:npm run lint
Result:Passed
## Runtime ValidationOpened the password reset screeninthe browser,submitted a test email address,and confirmed the neutral confirmation message appeared.
Browserconsole showed no newerrors.
## SkippedChecksEmail delivery test skipped because thischange does not alter email sending behavior.
## ChangesMade
- Updated password reset confirmation copy.
- Updated the existing test expectation forthe confirmation state.
## RemainingRisksNone known.
## HandoffFuture agents should not change the confirmation into account-specific language unless the security decision is revisited
That is the point of the harness.
Not more ceremony.
Less guessing.
How to grow the harness
Once the minimum loop works, add the next layer.
Don't start by automating everything.
Start by making the correct path obvious.
Then automate the parts the agent keeps getting wrong.
1. Split the top-level guidance into deeper docs
Once AGENTS.md starts getting long, stop adding everything to one file.
Keep the root file as a map, then split deeper guidance into focused docs:
planning rules
verification rules
runtime validation rules
design and content rules
security rules
handoff rules
The top-level file should tell the agent where to look. It should not become another giant prompt.
2. Add a hardening checklist
Create a simple hardening step before implementation.
Take the requirement and review it through the lenses of product, architecture, security, QA, design, and copy.
Ask:
What is unclear?
What is unsafe?
What is untestable?
What could cause rework later?
What must be fixed before implementation?
What risks can proceed if they are named?
At first, this can be a manual checklist. Later, it can become specialist review agents.
3. Add deterministic checks
Anything a script can enforce should become a script.
Start small:
block completed work with no evidence
block behavior work with no acceptance scenario
block work records with empty scope
warn when risks have no planned checks
block skipped checks with no reason
Do not ask the agent to remember rules the repo can check.
4. Add runtime validation for your stack
Keep it practical.
For web apps, open the page, interact with the flow, inspect the console, and capture screenshots or notes.
For services, send a real request and inspect the response.
For CLIs, run the command and capture output.
For mobile, use the simulator or target device when the change affects real interaction.
Runtime validation should match the thing you are actually building.
5. Add specialist review passes
Add reviewers where mistakes are expensive.
You do not need a panel for every tiny change.
But when work touches product behavior, architecture, security, frontend interaction, design, or public copy, separate review lenses catch different failures.
The useful question is not "how many agents can I add?"
The useful question is "which failures are expensive enough to deserve a repeatable reviewer?"
6. Add context scouting
Large searches, log reviews, dependency audits, and impact mapping can pollute the main working context.
Use a scout workflow when discovery is broad.
The scout should return:
files to inspect
likely edit surfaces
relevant existing patterns
risks
recommended next step
The main agent should work from that compact report, not from every raw search result.
7. Add project memory and handoff
Every finished task should leave behind enough state for a fresh agent to continue.
That means:
work record updated
acceptance scenario updated
evidence recorded
decisions logged
risks named
follow-up work separated from completed work
A prompt resets.
A harness compounds.
Common mistakes when building AI agent workflows
The most common mistake is trying to fix unreliable agent work with a bigger prompt.
A few others show up quickly:
keeping project memory in chat history
letting agents start coding before the requirement is testable
treating passing tests as proof when no runtime flow was checked
letting agents expand scope silently
writing acceptance tests from the implementation instead of the desired behavior
calling work done without evidence
adding reviewers before you know which mistakes are actually expensive
automating the whole process before the manual loop works
creating so much process that small work becomes slower than human review
Most of these are not model problems.
They are operating system problems.
Why this matters
The real advantage is not that agents write code quickly.
That part is already obvious.
The advantage is that the system around the agent can compound.
Every finished task leaves behind requirements, acceptance scenarios, decisions, checks, evidence, and lessons that make the next task easier.
A prompt resets. A harness compounds.
That is why I think the next serious wave of AI software work will not be about who has the best prompt library. It will be about who builds the best operating system around their agents.
The model will keep changing.
The tools will keep changing.
The prompt tricks will keep expiring.
But the system around the work will keep getting stronger.
After nearly 20 years building digital products, this just feels like the same lesson in a new form.
Reliable production software does not come from asking one person, human or machine, to hold the whole system in their head.
It comes from building a system that makes good work the default path.