Case study

AppMaker

A spec + governance layer for AI-assisted development — PRD-to-test traceability, constitution-enforced rules, and deterministic gates, shipped as an open-source Claude Code plugin.

AI coding agents are phenomenal at writing code and terrible at remembering why. AppMaker is my answer: an open-source plugin that gives an AI agent the thing it's missing — an engineering process with receipts.

View the source on GitHub →

Problem

AI-assisted development suffers from intent drift. Business rules decided early in a conversation become invisible later. Edge cases raised once live only in chat transcripts. Acceptance criteria don't exist because nobody wrote them down — so the AI improvises into the gaps, confidently.

The result: code that works in the demo and surprises you in production, with no audit trail of what was decided, by whom, or why. Everyone using AI agents seriously has felt this. Almost nobody has tooling for it.

What I built

AppMaker is a spec + governance layer that sits on top of Claude Code's runtime — 24 single-purpose skills orchestrating the full development lifecycle, from relentless requirement-grilling to archived retros.

The design bet that makes it work: don't fight the runtime, govern it. When Claude Code ships a better primitive, AppMaker delegates to it instead of re-implementing. The durable value is everything the runtime doesn't do:

Artifacts over chat history. Every feature lives in a durable folder: interview, PRD, decomposition, per-slice execution records, retro. The source of truth is on disk and in git — not in a transcript that dies with the session.
Traceable intent, end to end. Every PRD criterion gets a stable ID linked through decomposition → acceptance criteria → tests → production code. Drift stops being a nuance someone forgot and becomes a broken link a script can catch.
Determinism over judgment. Checks run in tiers: bash scripts first, documented human criteria second, LLM judgment last. The checklist gate emits PASS/FAIL/WARN with file-level evidence — never vibes.
The judgments that must stay human, stay human. Identity, money, brand, irreversible calls are structurally marked human_required. The system routes them to a person instead of guessing. There's even a four-voice decision council — three fresh subagents argue Skeptic, Pragmatist and Critic positions with zero access to the conversation, so they can't be anchored by what the main agent already believes.
Memory that compiles. Retros distill lessons into a project wiki that every later workflow reads before planning — the agent on feature five is measurably smarter about your project than it was on feature one.

Plays well with others

AppMaker is deliberately a thin layer with sharp integrations, each opt-in:

Claude Code runtime — skills, hooks, subagents, plugin system; AppMaker adds lifecycle, not infrastructure
Graphify (third-party knowledge graph) — instead of grepping blindly, the agent queries a dependency graph of modules, communities and docs; AppMaker consumes it strictly read-only and persists small versioned context packets
A real browser (gstack runtime) — UI changes ship with screenshot evidence, responsive checks and emulated-media tests, not "looks fine to me"
GitHub CLI — backlog can live as issues; research gates can pull from real repos
Ref / documentation MCP — architecture decisions require an options matrix sourced from current official docs, not the model's training data
Security scanners — a dedicated gate runs whatever you have (npm audit, gitleaks, semgrep) before anything ships, deterministic facts first, LLM overlay second
A local Studio cockpit — a small web UI over the project's JSON state: feature progress, evidence, parallel-execution waves

Proof

It runs a real production system. Cassie — a clinical case-management platform — is developed entirely with AppMaker: 6 features through the complete lifecycle (grill → PRD → slices → TDD → review → archive) and 22 production slices shipped in its first weeks. The very first feature (biopsychosocial risk scoring) went from idea to production in a single ~5.5-hour session: 7 slices decomposed, 21 unit tests through real RED→GREEN cycles, 37/37 acceptance criteria checked off, 4 library deploys — exercising every plugin command available at the time.
Side effects you can audit. The glossary self-populated with domain terms carrying file:line references; production code comments cite the PRD artifacts they implement — the why survives long after the session ends.
Tested like a product. 29 smoke-test suites (~2,000 lines of shell) cover hooks, gates, materialization, traceability checks and doc-drift detection.
Audited, with the findings published. The repo carries external audit reports — a method-vs-plugin gap analysis and an adoption review — findings tracked openly as numbered design decisions instead of buried in a changelog. The case-study file even lists what is not yet validated. I find that more convincing than any benchmark.
Dogfooded three times over. AppMaker develops itself — its own feature folders, backlogs and retros live in the repo. Cassie runs on it in production. And this very website was grilled, PRD-ed, decomposed and test-driven with AppMaker; the artifacts are part of the site's own git history.
Open source, MIT, self-contained. One init command materializes the whole governance tree — constitution, glossary, memory wiki, templates, hooks — into any project, greenfield or brownfield.

github.com/paweldobrzynski/AppMaker →