How To Build Your Own AI Harness
8 June 2026 • 0 min read
How To Build Your Own AI Harness
8 June 2026 • 0 min read

STRATEGIC
SOFTWARE
DESIGN

Field Notes

0 min read
AI Harness
AI Harness

How To Build Your Own AI Harness

How To Build Your Own AI Harness

I have been building digital products for 20 years. Now my harness builds for me. Here's how I do it.


20 years of me building for clients some projects were small and fast: two-month builds for $30k, others were multi-million dollar engagements with engineers, designers, product managers, QA, delivery leads, security reviewers, and stakeholders who needed the software to work in production.

After enough of those projects, I have learnt a simple lesson:

Quality software rarely comes from one person delivering brilliant instructions. It comes from many roles contributing to the system around the work.

That is why I think most people are approaching AI agents from the wrong direction.

The teams I see struggling with mastering an AI product workflow are usually trying to write the perfect prompt. We've all done it.

You add more examples. You add more constraints. You add "act like a senior engineer." You add "don't over-engineer." You add "follow existing patterns." You add "write production-ready code."

The prompt gets longer, but the result doesn't get better.

You're asking one paragraph to do the job of an engineering team.

Real teams do not work that way.

They have requirements, design rules, review processes, testing standards, release gates, handoff notes, and shared memory.

An AI agent needs the same kind of operating system around it.

This article covers the system I have built and use daily to make AI agent work more scoped, testable, reviewable, and repeatable.

Read on to learn how to build your own, or just bookmark it and point your agent at it to do it for you. You do you.


What the heck is a harness and why should you care?

An AI agent harness is the system around an AI agent that makes its work scoped, testable, reviewable, and repeatable.

A prompt tells the agent what to do.

A harness gives the agent the conditions to do the work properly: context before it starts, boundaries while it works, proof before it finishes, and memory after it leaves.

If you are using Claude, Cursor, Codex, or another AI coding agent to touch real software, the prompt is only the start.

The reliable output comes from the system around the model: the files it reads, the rules it follows, the checks it must run, the evidence it must capture, and the project memory it leaves behind.

It is the difference between an impressive demo and software you can trust in production.


What an AI agent harness actually does

A good AI agent harness does four jobs.

First, it gives the agent context before it starts, so it understands the project instead of relying only on the current chat.

Second, it gives the agent boundaries while it works, so a small change does not quietly become a refactor.

Third, it defines proof before the agent finishes, so "done" means more than "the agent says it is done."

Fourth, it leaves memory behind, so the next session does not start from zero.

The model still matters. The prompt still matters. But for serious work, the system around the agent matters more.


Prompt-first vs harness-first

Think about it like this:


Prompt-first workflow


Harness-first workflow


Relies on one long instruction


Uses persistent project rules


Agent guesses the real scope


Scope is written down before work starts


Tests are often added after implementation


Acceptance scenarios exist before implementation


"Done" means the agent says it is done


"Done" means evidence exists


Context lives in chat


Context lives in project state


Each session starts cold


Each session compounds


The agent optimises for output


The system optimises for reliable delivery

A prompt can start the work.

A harness makes the work reliable.


The operating loop

At a high level, my harness moves every meaningful change through the same loop.

Request
  |
  v
Requirements
  |
  v
Hardening panel
  |
  v
Acceptance scenarios
  |
  v
Work record
  |
  v
Implementation
  |
  v
Verification
  |
  v
Runtime validation
  |
  v
Evidence bundle
  |
  v
Project memory
  |
  +-- back into the next request
Request
  |
  v
Requirements
  |
  v
Hardening panel
  |
  v
Acceptance scenarios
  |
  v
Work record
  |
  v
Implementation
  |
  v
Verification
  |
  v
Runtime validation
  |
  v
Evidence bundle
  |
  v
Project memory
  |
  +-- back into the next request
Request
  |
  v
Requirements
  |
  v
Hardening panel
  |
  v
Acceptance scenarios
  |
  v
Work record
  |
  v
Implementation
  |
  v
Verification
  |
  v
Runtime validation
  |
  v
Evidence bundle
  |
  v
Project memory
  |
  +-- back into the next request

This is the part most prompts try to fake.

A prompt might say "build this properly." The workflow defines what properly means.

The agent is not allowed to jump straight from request to code. First, the work has to become clear enough to build. Then it has to become testable. Then it has to be scoped. Then it can be implemented. Then it has to prove itself.

The hardening phase is the most important part of that loop.

A draft requirement is passed through a panel of specialist reviewers. Each one looks at it through a different professional lens: product, architecture, security, QA, frontend, backend, design, and copy.

They are not all trying to be agreeable. That is the point.

The product reviewer cares about clarity and measurable outcomes. The architect cares about whether the shape fits the existing system. The security reviewer is intentionally paranoid. The QA reviewer looks for ambiguity. The designer looks for interaction problems. The copywriter looks for language that sounds generic, misleading, or off-brand.

That tension is what you want. It is the same tension you get in a real delivery team when the PM, architect, designer, QA lead, and security reviewer all look at the same feature and see different risks.

The hardening panel turns that into a repeatable step.

The output is not a brainstorm. It is a hard gate.

A minimum hardening result should answer four questions:

  • Required fixes: what must change before implementation?

  • Nice-to-haves: what would improve the work but should not block it?

  • Risks accepted: what are we knowingly carrying forward?

  • Readiness verdict: pass, proceed with concerns, or stop.


Required fixes block the work. Nice-to-haves are recorded but do not block. Once the required fixes are resolved, the requirement becomes hardened and the agent can move on.

That is where a lot of cost gets saved.

The expensive failure is not bad code. It is good code for a weak requirement.


The five parts of the harness

The harness I have built has five core parts.

Instructions -> what the agent reads
State -> what the agent remembers
Scope -> what the agent is allowed to touch
Verification -> what proves the work
Lifecycle -> how the work compounds
Instructions -> what the agent reads
State -> what the agent remembers
Scope -> what the agent is allowed to touch
Verification -> what proves the work
Lifecycle -> how the work compounds
Instructions -> what the agent reads
State -> what the agent remembers
Scope -> what the agent is allowed to touch
Verification -> what proves the work
Lifecycle -> how the work compounds

You can name these differently. The names do not matter much. The shape does.


1. Instructions

Instructions are the agent's onboarding.

Most people try to solve this with one giant system prompt. I do not think that works well for serious projects. A huge prompt becomes an employee handbook the agent is expected to memorise under pressure.

A better pattern is a small top-level file that acts as a map.

It tells the agent what kind of project this is, what rules are non-negotiable, and where deeper guidance lives. Planning has its own doc. Verification has its own doc. Runtime validation has its own doc. Design and content rules live separately.

The top-level instruction file should not contain everything. It should tell the agent where everything is.

A simple first version might say:

Read this first.
Understand the task before editing.
Find existing patterns before creating new ones.
Create or update a work record for meaningful changes.
Do not expand scope silently.
Run checks.
Validate runtime behavior when it matters.
Capture evidence before calling the work done

Read this first.
Understand the task before editing.
Find existing patterns before creating new ones.
Create or update a work record for meaningful changes.
Do not expand scope silently.
Run checks.
Validate runtime behavior when it matters.
Capture evidence before calling the work done

Read this first.
Understand the task before editing.
Find existing patterns before creating new ones.
Create or update a work record for meaningful changes.
Do not expand scope silently.
Run checks.
Validate runtime behavior when it matters.
Capture evidence before calling the work done

To build this yourself, start with one short file in the repo root. Keep it readable. Add the rules that should apply every session. Then split deeper process into separate documents as the system grows.

The goal is simple: when the agent starts, it knows how to orient itself before touching the work.


2. State

Chat history is not state.

State is what survives when the session closes.

In a real team, this is obvious. Requirements live somewhere. Decisions live somewhere. Known risks live somewhere. Test evidence lives somewhere. Nobody serious runs a production project from memory and Slack scrolling alone.

Agents make this even more important because the next session may be a fresh context.

A useful state layer includes requirements, acceptance criteria, implementation records, known debt, runtime evidence, and handoff notes.

The key rule is this: if the agent is changing something meaningful, there is a live work record.

That work record says what the task is, what is in scope, what is out of scope, what checks should run, and what evidence will prove the work is complete.

This prevents the agent from silently deciding what the task means.

It also means the next session can pick up the work without you narrating the entire history again.


3. Scope

Scope is where agents create a lot of hidden cost.

If the task is vague, the agent will often choose the broadest useful interpretation. A small bug fix becomes a refactor. A copy change becomes a component redesign. A UI adjustment becomes a design system change.

The model is trying to help.

The harness needs to give it walls.

Every work record should include a readiness verdict before implementation starts.

Work request
  |
  v
Work record
  |
  v
Readiness verdict
  |
  +-- Pass      -> proceed
  +-- Concerns  -> proceed with named risks
  +-- Fail      -> stop and clarify
Work request
  |
  v
Work record
  |
  v
Readiness verdict
  |
  +-- Pass      -> proceed
  +-- Concerns  -> proceed with named risks
  +-- Fail      -> stop and clarify
Work request
  |
  v
Work record
  |
  v
Readiness verdict
  |
  +-- Pass      -> proceed
  +-- Concerns  -> proceed with named risks
  +-- Fail      -> stop and clarify

Pass means the task is clear, scoped, and verifiable.

Concerns means the task can proceed, but the risks need to travel with the work.

Fail means stop before code.

This is not bureaucracy. It is cheaper than building the wrong thing well.

Scope control should also define what the agent is allowed to edit. If the task is a checkout bug, the agent should not casually rewrite the auth model. If the task is a visual polish pass, it should not invent new design tokens unless that is explicitly part of the work.

The harness should make scope expansion visible.


4. Verification

Verification is the part that turns agent output into software you can trust.

The agent saying "done" is not enough. The question is what proved it.

A useful verification stack has several layers, but the important thing is that verification starts before code. Before the agent implements anything, you want a behavioral contract that describes what the system must do.

This is where Acceptance Test Driven Development fits nicely.

ATDD forces the expected behavior to be written before implementation, in product language. A good acceptance test should not say "call this endpoint" or "write to this table." It should say what happens from the user's point of view.

That distinction matters.

Implementation details pull the agent toward a technical guess. Acceptance tests pull it toward observable behavior.

This gives you a second stream of truth. The agent cannot just write code and then write tests that bless its own assumptions. The acceptance scenarios come from the requirement, not from the implementation.

After that, verification continues in layers.

Acceptance scenarios
  |
  v
Deterministic checks
  |
  v
Specialist review
  |
  v
Runtime validation
  |
  v
Evidence bundle
Acceptance scenarios
  |
  v
Deterministic checks
  |
  v
Specialist review
  |
  v
Runtime validation
  |
  v
Evidence bundle
Acceptance scenarios
  |
  v
Deterministic checks
  |
  v
Specialist review
  |
  v
Runtime validation
  |
  v
Evidence bundle

The cheapest layer is deterministic checks.

Anything a script can enforce should be enforced by a script. Do not ask the agent to remember things that the repo can check automatically.

The next layer is review.

Different reviewers catch different failures. Architecture checks system fit. Security checks exposure. QA checks whether the behavior is testable. Design checks usability. Copy checks whether the language sounds like the product.

The deepest layer is runtime validation. I have found this is what prompt-only approaches miss the most.

Tests prove what the implementer thought to test. Runtime validation proves what actually happened when the software ran.

For a web app, the agent should open the page, interact with the flow, check the console, and capture evidence. For a service, it should send a real request and inspect the response. For a CLI, it should run the command and capture output.

"It compiles" is not the same as "it works."

Production lives at runtime.


5. Lifecycle

Lifecycle is how the work compounds.

It covers how a project starts, how each session starts, how discovery happens, how work closes, and how the next agent continues from the state left behind.

A good lifecycle starts with bootstrap.

The project records its stack, commands, repo structure, validation method, design rules, content rules, and known risks. Until that exists, the agent is guessing.

Then every session follows a repeatable rhythm.

Understand the request. Read current state. Confirm scope. Execute. Verify. Review. Capture evidence. Update state.

The other lifecycle piece is context hygiene.

Large searches, noisy logs, and broad discovery should not pollute the main working context. Use a scout pass for that.

The scout reads widely and returns a compact report: files to inspect, likely edit surfaces, risks, and recommended next step. The main agent then works from the distilled findings instead of carrying every dead end.

Clean context improves output.

A lot.


Build the minimum viable AI agent harness

You do not need to build the full version on day one.

Start with a small harness that creates the right behavior.

The public version of the contract can be simple. Your private advantage can live in the exact reviewers, scripts, checks, and project-specific rules. The first public contract only needs to describe the shape clearly enough that a fresh agent can build it without guessing.


Define meaningful work

Define "meaningful work" first.

Meaningful work is any code, user-facing behavior, architecture, security, design, content, or process change that could affect future work.

Tiny typo fixes probably do not need the full loop.

Everything else does.


Create predictable paths

A minimum version can be as small as this:

your-repo/
  AGENTS.md
  docs/
    work-records/
      _template.md
    acceptance/
      _template.md
    evidence/
      _template.md
    decisions/
  scripts/
    check-work-record
    check-evidence
your-repo/
  AGENTS.md
  docs/
    work-records/
      _template.md
    acceptance/
      _template.md
    evidence/
      _template.md
    decisions/
  scripts/
    check-work-record
    check-evidence
your-repo/
  AGENTS.md
  docs/
    work-records/
      _template.md
    acceptance/
      _template.md
    evidence/
      _template.md
    decisions/
  scripts/
    check-work-record
    check-evidence

Use predictable names:

docs/work-records/<task-slug>.md
docs/acceptance/<task-slug>.md
docs/evidence/<task-slug>.md
docs/decisions/YYYY-MM-DD-<decision-slug>

docs/work-records/<task-slug>.md
docs/acceptance/<task-slug>.md
docs/evidence/<task-slug>.md
docs/decisions/YYYY-MM-DD-<decision-slug>

docs/work-records/<task-slug>.md
docs/acceptance/<task-slug>.md
docs/evidence/<task-slug>.md
docs/decisions/YYYY-MM-DD-<decision-slug>

The task slug can come from a ticket, a short feature name, or a dated change name. The naming convention matters less than consistency.


Create AGENTS.md

Start with a single AGENTS.md at the root of the repo.

It can be simple:

# AGENTS.md

This file is the operating guide for AI agents working in this repository.

Read this before making changes.

## Rules

- Understand the task before editing.
- Find existing patterns before creating new ones.
- Create or update a work record for meaningful changes.
- Do not expand scope silently.
- Write acceptance scenarios before implementation when behavior changes.
- Run the relevant checks.
- Validate runtime behavior when it matters.
- Capture evidence before calling the work done.

## Definition Of Done

Work is not done until:

- the requested change is implemented
- relevant checks have run
- runtime behavior has been validated or explicitly marked not applicable
- skipped checks are named with reasons
- evidence is recorded
- remaining risks are documented

## Handoff

Leave enough written state for a fresh agent to continue without reading the chat.

Chat history is not project memory

# AGENTS.md

This file is the operating guide for AI agents working in this repository.

Read this before making changes.

## Rules

- Understand the task before editing.
- Find existing patterns before creating new ones.
- Create or update a work record for meaningful changes.
- Do not expand scope silently.
- Write acceptance scenarios before implementation when behavior changes.
- Run the relevant checks.
- Validate runtime behavior when it matters.
- Capture evidence before calling the work done.

## Definition Of Done

Work is not done until:

- the requested change is implemented
- relevant checks have run
- runtime behavior has been validated or explicitly marked not applicable
- skipped checks are named with reasons
- evidence is recorded
- remaining risks are documented

## Handoff

Leave enough written state for a fresh agent to continue without reading the chat.

Chat history is not project memory

# AGENTS.md

This file is the operating guide for AI agents working in this repository.

Read this before making changes.

## Rules

- Understand the task before editing.
- Find existing patterns before creating new ones.
- Create or update a work record for meaningful changes.
- Do not expand scope silently.
- Write acceptance scenarios before implementation when behavior changes.
- Run the relevant checks.
- Validate runtime behavior when it matters.
- Capture evidence before calling the work done.

## Definition Of Done

Work is not done until:

- the requested change is implemented
- relevant checks have run
- runtime behavior has been validated or explicitly marked not applicable
- skipped checks are named with reasons
- evidence is recorded
- remaining risks are documented

## Handoff

Leave enough written state for a fresh agent to continue without reading the chat.

Chat history is not project memory

This file will evolve, but it is enough to change the agent's behavior on day one.


Create the file contracts

The instruction file tells the agent how to behave.

The work record tells it what it is allowed to change.

The acceptance record tells it what behavior must be true before implementation starts.

The evidence record tells it what proof must exist before the task can be called complete.

The decision log stops the same architectural choices being rediscovered every session.


Work record contract

The work record should include the goal, scope, out of scope, readiness verdict, risks, planned checks, and expected evidence.

Create this as docs/work-records/_template.md:

---
task: <task-slug>
status: draft
owner_agent: <agent-or-human>
created: <YYYY-MM-DD>
last_updated: <YYYY-MM-DD>
---

# <Task Title>

## Goal

<What outcome should exist when this work is complete?>

## In Scope

- <What the agent is allowed to change>

## Out Of Scope

- <What the agent must not touch, even if it looks related>

## Readiness Verdict

**Verdict:** `<PASS | CONCERNS | FAIL>`

**Findings:**
- <finding>

**Blocking Items:**
- <item or none>

**Concerns To Carry Forward:**
- <risk or none>

## Planned Checks

- [ ] <test, lint, build, review, or runtime check>

## Expected Evidence

- <command output

---
task: <task-slug>
status: draft
owner_agent: <agent-or-human>
created: <YYYY-MM-DD>
last_updated: <YYYY-MM-DD>
---

# <Task Title>

## Goal

<What outcome should exist when this work is complete?>

## In Scope

- <What the agent is allowed to change>

## Out Of Scope

- <What the agent must not touch, even if it looks related>

## Readiness Verdict

**Verdict:** `<PASS | CONCERNS | FAIL>`

**Findings:**
- <finding>

**Blocking Items:**
- <item or none>

**Concerns To Carry Forward:**
- <risk or none>

## Planned Checks

- [ ] <test, lint, build, review, or runtime check>

## Expected Evidence

- <command output

---
task: <task-slug>
status: draft
owner_agent: <agent-or-human>
created: <YYYY-MM-DD>
last_updated: <YYYY-MM-DD>
---

# <Task Title>

## Goal

<What outcome should exist when this work is complete?>

## In Scope

- <What the agent is allowed to change>

## Out Of Scope

- <What the agent must not touch, even if it looks related>

## Readiness Verdict

**Verdict:** `<PASS | CONCERNS | FAIL>`

**Findings:**
- <finding>

**Blocking Items:**
- <item or none>

**Concerns To Carry Forward:**
- <risk or none>

## Planned Checks

- [ ] <test, lint, build, review, or runtime check>

## Expected Evidence

- <command output


Acceptance scenario contract

The acceptance record should include scenarios, observable behavior, edge cases, and non-goals.

Create this as docs/acceptance/_template.md:

# Acceptance Scenarios

Feature: <feature or task name>
Primary user or system: <who observes the behavior>

## Scenarios

### AC-1 - <one-line behavior summary>

GIVEN <precondition stated in user-domain language>.
WHEN <action the user or system takes>.
THEN <observable outcome>.

### AC-2 - <one-line behavior summary>

GIVEN <precondition stated in user-domain language>.
WHEN <action the user or system takes>.
THEN <observable outcome>.

## Edge Cases

- <boundary condition, unhappy path, or failure mode>

## Non-Goals

- <behavior this change should not solve>
# Acceptance Scenarios

Feature: <feature or task name>
Primary user or system: <who observes the behavior>

## Scenarios

### AC-1 - <one-line behavior summary>

GIVEN <precondition stated in user-domain language>.
WHEN <action the user or system takes>.
THEN <observable outcome>.

### AC-2 - <one-line behavior summary>

GIVEN <precondition stated in user-domain language>.
WHEN <action the user or system takes>.
THEN <observable outcome>.

## Edge Cases

- <boundary condition, unhappy path, or failure mode>

## Non-Goals

- <behavior this change should not solve>
# Acceptance Scenarios

Feature: <feature or task name>
Primary user or system: <who observes the behavior>

## Scenarios

### AC-1 - <one-line behavior summary>

GIVEN <precondition stated in user-domain language>.
WHEN <action the user or system takes>.
THEN <observable outcome>.

### AC-2 - <one-line behavior summary>

GIVEN <precondition stated in user-domain language>.
WHEN <action the user or system takes>.
THEN <observable outcome>.

## Edge Cases

- <boundary condition, unhappy path, or failure mode>

## Non-Goals

- <behavior this change should not solve>

Keep this file in product language. If the acceptance scenario names a component, table, endpoint, implementation file, or framework detail, check whether it has drifted out of behavior and into design-by-accident.


Evidence record contract

The evidence record should include commands run, results, skipped checks with reasons, runtime proof, changes made, and remaining risks.

Create this as docs/evidence/_template.md:

---
work_record: docs/work-records/<task-slug>.md
acceptance: docs/acceptance/<task-slug>.md
date: <YYYY-MM-DD>
agent_session: <agent-or-session-id>
---

# Evidence Record

## Checks Run

- [ ] `<command or check>` - <passed | failed | warning> - <short result>

## Runtime Validation

<What was opened, clicked, requested, rendered, executed, or inspected?>

## Skipped Checks

- <check> - <reason, or `none`>

## Changes Made

- <high-level change>

## Evidence Artifacts

- <path to screenshot, trace, log, response sample, test report, or `none required - reason`>

## Remaining Risks

- <risk or `none`>

## Handoff

<What a fresh agent needs to know to continue without reading the chat

---
work_record: docs/work-records/<task-slug>.md
acceptance: docs/acceptance/<task-slug>.md
date: <YYYY-MM-DD>
agent_session: <agent-or-session-id>
---

# Evidence Record

## Checks Run

- [ ] `<command or check>` - <passed | failed | warning> - <short result>

## Runtime Validation

<What was opened, clicked, requested, rendered, executed, or inspected?>

## Skipped Checks

- <check> - <reason, or `none`>

## Changes Made

- <high-level change>

## Evidence Artifacts

- <path to screenshot, trace, log, response sample, test report, or `none required - reason`>

## Remaining Risks

- <risk or `none`>

## Handoff

<What a fresh agent needs to know to continue without reading the chat

---
work_record: docs/work-records/<task-slug>.md
acceptance: docs/acceptance/<task-slug>.md
date: <YYYY-MM-DD>
agent_session: <agent-or-session-id>
---

# Evidence Record

## Checks Run

- [ ] `<command or check>` - <passed | failed | warning> - <short result>

## Runtime Validation

<What was opened, clicked, requested, rendered, executed, or inspected?>

## Skipped Checks

- <check> - <reason, or `none`>

## Changes Made

- <high-level change>

## Evidence Artifacts

- <path to screenshot, trace, log, response sample, test report, or `none required - reason`>

## Remaining Risks

- <risk or `none`>

## Handoff

<What a fresh agent needs to know to continue without reading the chat


Decision record contract

The decision record should capture choices that future agents should not rediscover.

Not every implementation detail needs a decision record. Use it when a choice affects architecture, security, product behavior, migration, compatibility, or repeated future work.

Create this as docs/decisions/_template.md:

---
decision: <short-decision-slug>
status: proposed
created: <YYYY-MM-DD>
decided_by: <agent-or-human>
replaces: <older decision path or none>
---

# <Decision Title>

## Context

<What problem, risk, tradeoff, or repeated question forced this decision?>

## Decision

<What did we choose?>

## Why

<Why is this the right choice for now?>

## Alternatives Considered

- <Option>
  - Rejected because: <reason>

## Impact

- Existing work: <impact>
- Future agents: <what they should know before changing this again>
- Verification: <checks or evidence this decision implies>

## Review Trigger

<When should this decision be revisited

---
decision: <short-decision-slug>
status: proposed
created: <YYYY-MM-DD>
decided_by: <agent-or-human>
replaces: <older decision path or none>
---

# <Decision Title>

## Context

<What problem, risk, tradeoff, or repeated question forced this decision?>

## Decision

<What did we choose?>

## Why

<Why is this the right choice for now?>

## Alternatives Considered

- <Option>
  - Rejected because: <reason>

## Impact

- Existing work: <impact>
- Future agents: <what they should know before changing this again>
- Verification: <checks or evidence this decision implies>

## Review Trigger

<When should this decision be revisited

---
decision: <short-decision-slug>
status: proposed
created: <YYYY-MM-DD>
decided_by: <agent-or-human>
replaces: <older decision path or none>
---

# <Decision Title>

## Context

<What problem, risk, tradeoff, or repeated question forced this decision?>

## Decision

<What did we choose?>

## Why

<Why is this the right choice for now?>

## Alternatives Considered

- <Option>
  - Rejected because: <reason>

## Impact

- Existing work: <impact>
- Future agents: <what they should know before changing this again>
- Verification: <checks or evidence this decision implies>

## Review Trigger

<When should this decision be revisited


Give every task a lifecycle

Every task should move through the same simple lifecycle.

Draft -> Ready -> In Progress -> Verified -> Closed
Draft -> Ready -> In Progress -> Verified -> Closed
Draft -> Ready -> In Progress -> Verified -> Closed

Draft means the request exists but is not ready.

Ready means the work is clear, scoped, and testable.

In progress means the agent is implementing inside the work record.

Verified means checks and runtime validation are complete.

Closed means evidence has been captured and project memory has been updated.


Add the first guardrails

The first scripts can be plain.

Your check-work-record should fail if meaningful work has no active work record or no readiness verdict.

Active means the work record matches the current task and is not closed.

It can start with rules like this:

fail if meaningful work has no work record
fail if status is Ready or later but there is no readiness verdict
fail if behavior work has no acceptance record
warn if risks are listed but no planned checks address them
fail if scope is empty or out of scope is empty
fail if meaningful work has no work record
fail if status is Ready or later but there is no readiness verdict
fail if behavior work has no acceptance record
warn if risks are listed but no planned checks address them
fail if scope is empty or out of scope is empty
fail if meaningful work has no work record
fail if status is Ready or later but there is no readiness verdict
fail if behavior work has no acceptance record
warn if risks are listed but no planned checks address them
fail if scope is empty or out of scope is empty

Your check-evidence should fail if completed work has no evidence record, no checks listed, or skipped checks without reasons.

It can start with rules like this:

fail if status is Verified or Closed but no evidence record exists
fail if no checks are listed
fail if skipped checks have no reasons
fail if runtime validation is required but missing
warn if remaining risks are blank
fail if status is Verified or Closed but no evidence record exists
fail if no checks are listed
fail if skipped checks have no reasons
fail if runtime validation is required but missing
warn if remaining risks are blank
fail if status is Verified or Closed but no evidence record exists
fail if no checks are listed
fail if skipped checks have no reasons
fail if runtime validation is required but missing
warn if remaining risks are blank

Those rules are not magic. They are guardrails. Their job is to make the agent stop at the same points a good delivery lead would stop.

That is enough for a fresh agent to build a real first version.

Not a toy prompt file. A small system with rules, state, gates, and proof.


Example: a small password reset change

Imagine the task is small:

Add confirmation copy after a user requests a password reset.

The work record says the goal is to show clear confirmation after the reset request. In scope is the success message and any existing tests around that state. Out of scope is changing authentication, changing email delivery, or redesigning the whole login flow.

The acceptance scenario says what the user should observe.

The decision record explains why the copy must not reveal whether the account exists.

The evidence record proves the change was checked.

Now the agent has walls.

If it starts rewriting the auth flow, the work record says no. If it writes copy but never opens the screen, the evidence gate says no. If it changes behavior that was never described in the acceptance scenario, the review step has something concrete to push against.


Work record example

# Work Record

Status: Ready
Task: password-reset-confirmation-copy
Owner: agent
Created: 2026-06-04

## Goal

After a user requests a password reset, they see clear confirmation copy that tells them what to check next.

## In Scope

- Update the confirmation copy shown after password reset request submission.
- Update existing tests or snapshots for that state if they exist.
- Validate the password reset screen in the running app.

## Out Of Scope

- Do not change authentication logic.
- Do not change email delivery.
- Do not redesign the login or password reset flow.
- Do not expose whether the submitted email belongs to an account.

## Readiness Verdict

Pass.

## Risks

- Copy could accidentally reveal account existence.
- Agent could expand scope into auth behavior.

## Planned Checks

- Run relevant UI or acceptance tests.
- Open the password reset flow and submit an email.
- Confirm the browser console has no new errors.

## Expected Evidence

- Test command and result.
- Screenshot or runtime note showing the confirmation state.
- Note confirming no auth or email delivery changes were made

# Work Record

Status: Ready
Task: password-reset-confirmation-copy
Owner: agent
Created: 2026-06-04

## Goal

After a user requests a password reset, they see clear confirmation copy that tells them what to check next.

## In Scope

- Update the confirmation copy shown after password reset request submission.
- Update existing tests or snapshots for that state if they exist.
- Validate the password reset screen in the running app.

## Out Of Scope

- Do not change authentication logic.
- Do not change email delivery.
- Do not redesign the login or password reset flow.
- Do not expose whether the submitted email belongs to an account.

## Readiness Verdict

Pass.

## Risks

- Copy could accidentally reveal account existence.
- Agent could expand scope into auth behavior.

## Planned Checks

- Run relevant UI or acceptance tests.
- Open the password reset flow and submit an email.
- Confirm the browser console has no new errors.

## Expected Evidence

- Test command and result.
- Screenshot or runtime note showing the confirmation state.
- Note confirming no auth or email delivery changes were made

# Work Record

Status: Ready
Task: password-reset-confirmation-copy
Owner: agent
Created: 2026-06-04

## Goal

After a user requests a password reset, they see clear confirmation copy that tells them what to check next.

## In Scope

- Update the confirmation copy shown after password reset request submission.
- Update existing tests or snapshots for that state if they exist.
- Validate the password reset screen in the running app.

## Out Of Scope

- Do not change authentication logic.
- Do not change email delivery.
- Do not redesign the login or password reset flow.
- Do not expose whether the submitted email belongs to an account.

## Readiness Verdict

Pass.

## Risks

- Copy could accidentally reveal account existence.
- Agent could expand scope into auth behavior.

## Planned Checks

- Run relevant UI or acceptance tests.
- Open the password reset flow and submit an email.
- Confirm the browser console has no new errors.

## Expected Evidence

- Test command and result.
- Screenshot or runtime note showing the confirmation state.
- Note confirming no auth or email delivery changes were made


Acceptance example

# Acceptance Scenarios

Feature: Password reset confirmation
Primary user or system: User who has requested a password reset

## Scenarios

### User sees confirmation after submitting an email

Given a user is on the password reset screen.
When they submit an email address.
Then they see a confirmation message telling them to check their email for next steps.

### Confirmation does not expose account existence

Given a user submits an email address.
When the password reset request is accepted.
Then the confirmation message does not reveal whether the email belongs to an account.

## Edge Cases

- The confirmation works for known and unknown email addresses.
- The confirmation is visible without requiring a page refresh.
- The user can still navigate back to sign in.

## Non-Goals

- Do not change authentication rules.
- Do not change email delivery.
- Do not redesign the login flow

# Acceptance Scenarios

Feature: Password reset confirmation
Primary user or system: User who has requested a password reset

## Scenarios

### User sees confirmation after submitting an email

Given a user is on the password reset screen.
When they submit an email address.
Then they see a confirmation message telling them to check their email for next steps.

### Confirmation does not expose account existence

Given a user submits an email address.
When the password reset request is accepted.
Then the confirmation message does not reveal whether the email belongs to an account.

## Edge Cases

- The confirmation works for known and unknown email addresses.
- The confirmation is visible without requiring a page refresh.
- The user can still navigate back to sign in.

## Non-Goals

- Do not change authentication rules.
- Do not change email delivery.
- Do not redesign the login flow

# Acceptance Scenarios

Feature: Password reset confirmation
Primary user or system: User who has requested a password reset

## Scenarios

### User sees confirmation after submitting an email

Given a user is on the password reset screen.
When they submit an email address.
Then they see a confirmation message telling them to check their email for next steps.

### Confirmation does not expose account existence

Given a user submits an email address.
When the password reset request is accepted.
Then the confirmation message does not reveal whether the email belongs to an account.

## Edge Cases

- The confirmation works for known and unknown email addresses.
- The confirmation is visible without requiring a page refresh.
- The user can still navigate back to sign in.

## Non-Goals

- Do not change authentication rules.
- Do not change email delivery.
- Do not redesign the login flow


Decision example

# Decision Record

Date: 2026-06-04
Decision: Use neutral password reset confirmation copy
Status: Accepted

## Context

The password reset flow needs confirmation copy after a user submits an email address. The message must guide the user without revealing whether the submitted email belongs to an account.

## Decision

Use neutral copy:

"Check your email for password reset instructions. If an account exists for that address, instructions will arrive shortly."

## Why

This supports the user's next step while avoiding account-enumeration risk.

## Alternatives Considered

- "We sent you an email."
  Rejected because it implies the account exists.

- "That email is not registered."
  Rejected because it exposes account existence.

## Impact

The agent can update copy and related tests, but must not change authentication, account lookup behavior, or email delivery

# Decision Record

Date: 2026-06-04
Decision: Use neutral password reset confirmation copy
Status: Accepted

## Context

The password reset flow needs confirmation copy after a user submits an email address. The message must guide the user without revealing whether the submitted email belongs to an account.

## Decision

Use neutral copy:

"Check your email for password reset instructions. If an account exists for that address, instructions will arrive shortly."

## Why

This supports the user's next step while avoiding account-enumeration risk.

## Alternatives Considered

- "We sent you an email."
  Rejected because it implies the account exists.

- "That email is not registered."
  Rejected because it exposes account existence.

## Impact

The agent can update copy and related tests, but must not change authentication, account lookup behavior, or email delivery

# Decision Record

Date: 2026-06-04
Decision: Use neutral password reset confirmation copy
Status: Accepted

## Context

The password reset flow needs confirmation copy after a user submits an email address. The message must guide the user without revealing whether the submitted email belongs to an account.

## Decision

Use neutral copy:

"Check your email for password reset instructions. If an account exists for that address, instructions will arrive shortly."

## Why

This supports the user's next step while avoiding account-enumeration risk.

## Alternatives Considered

- "We sent you an email."
  Rejected because it implies the account exists.

- "That email is not registered."
  Rejected because it exposes account existence.

## Impact

The agent can update copy and related tests, but must not change authentication, account lookup behavior, or email delivery


Evidence example

# Evidence Record

Work record: docs/work-records/password-reset-confirmation-copy.md
Date: 2026-06-04
Agent/session: codex

## Checks Run

Command: npm test -- password-reset
Result: Passed

Command: npm run lint
Result: Passed

## Runtime Validation

Opened the password reset screen in the browser, submitted a test email address, and confirmed the neutral confirmation message appeared.

Browser console showed no new errors.

## Skipped Checks

Email delivery test skipped because this change does not alter email sending behavior.

## Changes Made

- Updated password reset confirmation copy.
- Updated the existing test expectation for the confirmation state.

## Remaining Risks

None known.

## Handoff

Future agents should not change the confirmation into account-specific language unless the security decision is revisited

# Evidence Record

Work record: docs/work-records/password-reset-confirmation-copy.md
Date: 2026-06-04
Agent/session: codex

## Checks Run

Command: npm test -- password-reset
Result: Passed

Command: npm run lint
Result: Passed

## Runtime Validation

Opened the password reset screen in the browser, submitted a test email address, and confirmed the neutral confirmation message appeared.

Browser console showed no new errors.

## Skipped Checks

Email delivery test skipped because this change does not alter email sending behavior.

## Changes Made

- Updated password reset confirmation copy.
- Updated the existing test expectation for the confirmation state.

## Remaining Risks

None known.

## Handoff

Future agents should not change the confirmation into account-specific language unless the security decision is revisited

# Evidence Record

Work record: docs/work-records/password-reset-confirmation-copy.md
Date: 2026-06-04
Agent/session: codex

## Checks Run

Command: npm test -- password-reset
Result: Passed

Command: npm run lint
Result: Passed

## Runtime Validation

Opened the password reset screen in the browser, submitted a test email address, and confirmed the neutral confirmation message appeared.

Browser console showed no new errors.

## Skipped Checks

Email delivery test skipped because this change does not alter email sending behavior.

## Changes Made

- Updated password reset confirmation copy.
- Updated the existing test expectation for the confirmation state.

## Remaining Risks

None known.

## Handoff

Future agents should not change the confirmation into account-specific language unless the security decision is revisited

That is the point of the harness.

Not more ceremony.

Less guessing.


How to grow the harness

Once the minimum loop works, add the next layer.

Don't start by automating everything.

Start by making the correct path obvious.

Then automate the parts the agent keeps getting wrong.


1. Split the top-level guidance into deeper docs

Once AGENTS.md starts getting long, stop adding everything to one file.

Keep the root file as a map, then split deeper guidance into focused docs:

  • planning rules

  • verification rules

  • runtime validation rules

  • design and content rules

  • security rules

  • handoff rules


The top-level file should tell the agent where to look. It should not become another giant prompt.


2. Add a hardening checklist

Create a simple hardening step before implementation.

Take the requirement and review it through the lenses of product, architecture, security, QA, design, and copy.

Ask:

  • What is unclear?

  • What is unsafe?

  • What is untestable?

  • What could cause rework later?

  • What must be fixed before implementation?

  • What risks can proceed if they are named?

At first, this can be a manual checklist. Later, it can become specialist review agents.


3. Add deterministic checks

Anything a script can enforce should become a script.

Start small:

  • block completed work with no evidence

  • block behavior work with no acceptance scenario

  • block work records with empty scope

  • warn when risks have no planned checks

  • block skipped checks with no reason

Do not ask the agent to remember rules the repo can check.


4. Add runtime validation for your stack

Keep it practical.

For web apps, open the page, interact with the flow, inspect the console, and capture screenshots or notes.

For services, send a real request and inspect the response.

For CLIs, run the command and capture output.

For mobile, use the simulator or target device when the change affects real interaction.

Runtime validation should match the thing you are actually building.


5. Add specialist review passes

Add reviewers where mistakes are expensive.

You do not need a panel for every tiny change.

But when work touches product behavior, architecture, security, frontend interaction, design, or public copy, separate review lenses catch different failures.

The useful question is not "how many agents can I add?"

The useful question is "which failures are expensive enough to deserve a repeatable reviewer?"


6. Add context scouting

Large searches, log reviews, dependency audits, and impact mapping can pollute the main working context.

Use a scout workflow when discovery is broad.

The scout should return:

  • files to inspect

  • likely edit surfaces

  • relevant existing patterns

  • risks

  • recommended next step

The main agent should work from that compact report, not from every raw search result.


7. Add project memory and handoff

Every finished task should leave behind enough state for a fresh agent to continue.

That means:

  • work record updated

  • acceptance scenario updated

  • evidence recorded

  • decisions logged

  • risks named

  • follow-up work separated from completed work


A prompt resets.

A harness compounds.


Common mistakes when building AI agent workflows

The most common mistake is trying to fix unreliable agent work with a bigger prompt.

A few others show up quickly:

  • keeping project memory in chat history

  • letting agents start coding before the requirement is testable

  • treating passing tests as proof when no runtime flow was checked

  • letting agents expand scope silently

  • writing acceptance tests from the implementation instead of the desired behavior

  • calling work done without evidence

  • adding reviewers before you know which mistakes are actually expensive

  • automating the whole process before the manual loop works

  • creating so much process that small work becomes slower than human review

Most of these are not model problems.

They are operating system problems.


Why this matters

The real advantage is not that agents write code quickly.

That part is already obvious.

The advantage is that the system around the agent can compound.

Every finished task leaves behind requirements, acceptance scenarios, decisions, checks, evidence, and lessons that make the next task easier.

A prompt resets. A harness compounds.

That is why I think the next serious wave of AI software work will not be about who has the best prompt library. It will be about who builds the best operating system around their agents.

The model will keep changing.

The tools will keep changing.

The prompt tricks will keep expiring.

But the system around the work will keep getting stronger.

After nearly 20 years building digital products, this just feels like the same lesson in a new form.

Reliable production software does not come from asking one person, human or machine, to hold the whole system in their head.

It comes from building a system that makes good work the default path.

STRATEGIC
SOFTWARE
DESIGN

Let's see if we're a good fit.

We'd love to have a chat about your needs and are happy to meet you at time that suits.

OVRFLO SSD DIVISION

©

2026

生き甲斐

headquartered in Fremantle, western australia

STRATEGIC
SOFTWARE
DESIGN

Let's see if we're a good fit.

We'd love to have a chat about your needs and are happy to meet you at time that suits.

OVRFLO SSD DIVISION

©

2026

生き甲斐

headquartered in Fremantle, western australia

STRATEGIC
SOFTWARE
DESIGN

Let's see if we're a good fit.

We'd love to have a chat about your needs and are happy to meet you at time that suits.

OVRFLO SSD DIVISION

©

2026

生き甲斐

headquartered in Fremantle, western australia