6 June 2026

Fixing intent drift: a spec + governance layer for AI-assisted development

Every serious user of AI coding agents knows this moment: three hours into a session, the agent confidently implements something you explicitly ruled out an hour earlier. It didn't get dumber. The decision just scrolled out of relevance. Chat history is where intent goes to die.

I call this failure mode intent drift, and after hitting it enough times (ask me how I know it takes "enough times") I stopped treating it as a prompt-engineering problem. It's a process problem. Human teams solved it decades ago with specs, acceptance criteria and audit trails. AI-assisted development mostly hasn't — so I built AppMaker, an open-source governance layer that sits on top of the agent runtime.

The three bets

Bet one: artifacts beat transcripts. Every feature gets a durable folder — interview, PRD, decomposition, per-slice execution records, retro. The agent reads artifacts before acting, not the conversation's vibes. A session can die, compact or hallucinate; the folder doesn't.

Bet two: traceability beats trust. Every PRD criterion carries a stable ID (pcrit-007-style) that threads through decomposition → acceptance criteria → test names → production code. The payoff is blunt: drift becomes a broken link a bash script can detect. My checklist gate doesn't ask the model whether requirements are covered — it greps.

Bet three: judgment is a scarce resource, so spend it last. Checks run in tiers: deterministic scripts first, documented human criteria second, LLM judgment only where the first two can't reach. And some judgments never get delegated at all — identity, money, brand, irreversible calls are structurally marked human_required, so the system routes them to a person instead of guessing.

What happened when it met production

The first real test was a clinical case-management platform. One session, about five and a half hours, took a feature (biopsychosocial risk scoring) from idea to production: 7 slices decomposed, 21 unit tests through honest RED→GREEN cycles, 37 acceptance checkboxes ticked, 4 library deploys. The glossary populated itself with domain terms carrying file:line references. The production code comments cite the PRD they implement — which sounds like a small thing until you're reading that code six months later.

The part I'm proudest of isn't in that list. The session's case-study file records what broke — skills that claimed to write files but didn't, a zsh variable collision, a too-old Python — and what was not yet validated, including the full lifecycle through archive. Weeks later that hypothesis list is closed: the same platform has now pushed 6 features through the complete lifecycle, 22 slices in production. Publishing your unvalidated hypotheses and then closing them beats any benchmark I could quote.

The meta-proof

This website, the one you're reading, was built with AppMaker too: grilled, PRD-ed with 16 testable criteria, decomposed into 9 slices, test-driven, gated. The artifacts sit in the site's own git history. When the agent that built it starts answering your questions about it (coming soon), the loop closes.

Intent drift isn't solved by smarter models. It's solved by giving the model the same thing we give human engineers: a system where decisions are written down, linked, and checked. The difference is that the model actually reads the spec every time.