Four Agentic Design Workflows in Production: What Actually Happened

Section 01

Why traced narratives beat tool reviews

Most writing about agentic design work falls into one of two genres. The first is the tool review: a feature list, a pricing table, and a verdict written after an afternoon of poking at a demo. The second is the capability claim: a screenshot of one good output presented as if it were the median output. Both genres are easy to produce and almost useless for the question a working designer actually has, which is what happens when you run one of these workflows for real — across days, with deadlines, with the boring parts included.

This article is an attempt at a third genre. It traces four agentic design workflows that ran to completion, or to a shippable state, in the same two-week window in May and June 2026: the editorial site you are reading, a multi-agent pipeline that produced a full-length book, a pipeline that turns book chapters into narrated videos, and the design-system layer that keeps the site's surfaces consistent. All four are the same practitioner's own projects. None of them involves a client, a budget approval, or anyone else's confidential material, which is exactly why they can be traced honestly: the git history, the configuration files, the audit output, and the failure record are all available to quote.

The four cases cover different media on purpose. A long-form content site, a print-and-EPUB document, a rendered video, and a token system are about as different as design deliverables get, and yet — this is the article's central finding — the workflow shape underneath them barely changes. Each one starts with a written contract, generates against it, accumulates executable gates where things break, and keeps a human approval step in writing. The differences are in the tools plugged into each layer, not in the layers themselves.

If you want the deeper how-to for any single layer, this site already has it: the harness pattern, the design-system maintenance loop, the design-to-code pipeline, and the research-packet workflow each have their own article. This one is the proof layer — what those patterns look like when they are not an example but a Tuesday.

Projects to inspect

Build a Design Harness Before You PromptThe file-by-file walkthrough of the harness pattern that case 1 and case 4 rely on.Research Packets for Agentic DesignThe research-and-evidence workflow this article itself was produced with.

Section 02

The evidence rules this article follows

Before the cases, the rules — because a case study is only as good as what it refuses to claim. Every timeline in this article comes from git history: calendar spans, commit counts, commit subjects, and the dates on audit files and review documents. Git records when work was committed and what it touched. It does not record hours of attention, the number of agent sessions, model or API spend, or how much of any file a human rewrote by hand. Those numbers are not in this article, because they are not in the evidence. Where a claim depends on something only the author can confirm, it is either omitted or labeled as such.

Failures are quoted from the record, not reconstructed for narrative effect: deploy-fix commits, a content-correction commit, an export-fix cluster, a template audit that flagged broken templates, an architecture review that counted hard-coded colors, and an integration that returned an HTTP 403 and forced a fallback. Numbers from point-in-time audits are date-stamped, because they describe the project on that day, not forever.

Two scope notes. First, the editorial plan behind this series originally listed a branded slide deck as one of the four cases; no deck project with a traceable history exists in this material, so the second case is the book and document pipeline that actually shipped. Second, the book pipeline and the video pipeline live in the author's own private repositories. They are described and excerpted here as first-party projects — sanitized, with no local paths or credentials — but they are not public repositories you can clone, and this article does not pretend otherwise.

Timelines: calendar spans and commit counts from git history only.
Failures: real fix commits, audit findings, and rejected integrations, quoted with dates.
Costs and hours: not claimed, because the evidence does not contain them.
No client, revenue, or commercial framing — none of the four projects has any.
Sibling projects are described as first-party work, not as public repositories.
Point-in-time numbers (template audits, compliance counts) carry their capture date.

Section 03

Case 1: The editorial site — placeholder to deployed in three calendar days

The first case is the most self-referential and the easiest to verify: it is this site. The repository's first commit is a placeholder page on May 31 at 22:27. Twenty-seven commits later, by the morning of June 2, the main branch held a deployed Next.js App Router site with Tailwind CSS v4 and shadcn/ui, a shared content manifest, a long-form article system, the first batch of articles, a Buttondown newsletter integration, a static verification script, and a 97-file scaffold that brought in the design-system tooling and skill catalog. A larger editorial layer — more article modules, research-run folders, and canvas-sync scripts — existed in the working tree at the time of writing and is called out separately wherever this article relies on it.

The tool choices were boring on purpose. Coding agents driven from the terminal did the generation; Next.js, Tailwind v4, and shadcn/ui were chosen because the agent tooling and the component scaffold already understood them; Railway hosts the deployment; Buttondown handles email. The interesting choice is what was written before any page existed: a short AGENTS.md with the stack, the source-of-truth pointers, and the component rules, and a DESIGN.md with exact OKLCH tokens, typography, spacing, radius ceilings, and anti-pattern rules. The morning of June 1 — before the visual scaffold landed — also added a plain Node verification script, so the structural contract of the site was executable from roughly the seventh commit onward.

The timeline strip below shows the texture. The first cluster, on the morning of June 1, is information architecture, content, and verification. Midday brings the 97-file scaffold integration and, immediately after it, three consecutive fix commits — the deploy did not survive first contact with the hosting platform. The evening cluster is the article system and the first five long-form pieces. The next morning starts with a fix commit titled "Stabilize local dev startup." That is what a fast agent-built project actually looks like in the log: dense feature bursts with small fix clusters stitched between them, not a clean staircase of features.

The pivotal artifact is AGENTS.md. It is short — under forty lines — and almost everything in it is either a fact the agent needs or a rule that is cheap to state and expensive to violate. The design-relevant excerpt below is the real file; the deeper walkthrough of the full harness, including DESIGN.md and the skills catalog, lives in the harness article and is not repeated here.

AGENTS.md (design-relevant excerpt from this repository)

## Project UI Stack

- The application stack is Next.js App Router, Tailwind CSS v4, shadcn/ui New York style, and lucide-react icons.
- DESIGN.md is the design source of truth. Keep color, typography, spacing, radius, shadow, and layout decisions aligned to it.
- .agentic-designer.json declares the active library and brand for agentic-designer tooling.
- Prefer shared data from content/site.ts instead of duplicating navigation, books, tracks, and newsletter details in page code.

## shadcn Rules

- Use shadcn source components from components/ui before custom primitives.
- Use semantic Tailwind tokens such as bg-primary, text-muted-foreground, border, and bg-card.
- Use gap-* for layout spacing; avoid space-x-* and space-y-*.
- Use Badge for labels/status, Card composition for content modules, Button for actions, Input and Label for forms, and Separator for dividers.

diagramFour production timelines, from git history

Calendar spans, commit counts, fix clusters, and gate events for all four cases, reconstructed from each repository's git history. Hours and costs are absent because git does not record them.

Section 04

Case 1: what broke, what caught it, what would change

The site's failure record is small but instructive, and all of it is in the log. The three consecutive deploy fixes after the scaffold integration were not design failures; they were packaging failures — a dependency that needed to come from the registry rather than a local path, and two native binaries (the CSS compiler and the Tailwind engine) that the hosting platform's build environment did not have. Nothing in the harness could have predicted them, and no local gate caught them, because the failing environment was the deployment platform itself. They were caught the only way they could be: by deploying early, while the site was still small enough that a broken build cost minutes.

The second class of failure is more interesting for designers: a commit on the evening of June 1 titled "Correct agent harness file guidance." One of the freshly generated articles described the project's own configuration files inaccurately — confidently, fluently, and wrong. The verification script passed it, the type checker passed it, the audit passed it, because none of them check whether prose is true. It was caught by the human editorial pass, which is the gate this site cannot automate. The next-day "Stabilize local dev startup" fix is the same lesson from the tooling side: agent-assembled infrastructure works until the first cold start on a different morning.

What would change on a rerun: deploy to the real platform before integrating the large scaffold, not after, so packaging surprises arrive one at a time; and commit the editorial layer in smaller, reviewable slices instead of letting article modules and research folders accumulate in the working tree. A fast project that leaves its most valuable content uncommitted has, in evidence terms, not produced it yet.

Section 05

Case 2: The book pipeline — thirteen roles in one configuration file

The second case is a document pipeline rather than the slide deck this series originally planned, because the document pipeline is the one that actually exists and shipped. The project, book-writer, is a multi-agent framework whose entire definition of a book lives in one JSON file. For the book it produced — a fourteen-chapter trade book on agentic design with three appendices — that file names thirteen agent roles, from architect and researcher through drafter, reviewer, reviser, and consistency checker, to a design planner, visual designer, assembler, exporter, and three screenshot roles for storyboarding, capturing, and integrating images. Every role runs on the OpenCode platform with the same model assigned in config, and the review loop is parameterized in the same file: a chapter must score at least seven from the reviewer, with at most three revision passes.

The output tree is the kind of evidence a case study wants: per-chapter HTML files, an assembled HTML edition, a Paged.js-rendered PDF, an EPUB, a generated cover, per-chapter audiobook MP3s, and a first-chapter course video. The git history covers fifty-three commits between May 18 and May 30, in two distinct clusters. The first, May 18 to 21, builds the framework, the cover system, the export path, and — this is where the fix commits concentrate — the publication gates. The second, May 28 to 30, adds something that was not in the original design at all: an approval-gated pipeline for folding new source clips into an already-published book, with an explicit run manifest and a written rule that nothing is ever published directly from raw clips.

The pivotal artifact is the configuration excerpt below, sanitized to the fields that matter for the workflow shape. Read it less as a recipe and more as a statement about where the author decided judgment should live: the roles, the pass threshold, and the iteration cap are all in config, which means changing the editorial standard is a reviewable diff rather than a vibe.

book-config.json (sanitized excerpt: roles, platform, review thresholds)

{
  "platform": "opencode",
  "models": {
    "architect": "glm-5.1",
    "researcher": "glm-5.1",
    "drafter": "glm-5.1",
    "reviewer": "glm-5.1",
    "reviser": "glm-5.1",
    "consistency_checker": "glm-5.1",
    "design_planner": "glm-5.1",
    "visual_designer": "glm-5.1",
    "assembler": "glm-5.1",
    "exporter": "glm-5.1",
    "screenshot_storyboarder": "glm-5.1",
    "screenshot_capturer": "glm-5.1",
    "screenshot_integrator": "glm-5.1"
  },
  "formats": ["pdf", "html", "epub"],
  "chapters": 16,
  "word_target_per_chapter": 3000,
  "max_review_iterations": 3,
  "review_pass_score": 7,
  "screenshots": { "enabled": true, "max_per_chapter": 8, "min_per_chapter": 1 }
}

Section 06

Case 2: the hard part was never the prose

If you read only the commit subjects, you would struggle to tell this was a writing project. Eighteen of the fifty-three commits begin with "fix," and almost none of them are about chapters. They are about the export and publication boundary: PDF images that did not render, duplicate footers in the paged output, pagination that would not stabilize, a preflight check for missing images, a guardrail to publish only final release artifacts, a cover contract for the public edition, stripped base hrefs in the published HTML. The drafting agents produced chapter prose that passed its review threshold inside the configured loop; the thing that consumed two days of fixes was turning that prose into a PDF a person would accept.

The second failure is quieter and more important. The May 28–30 cluster — the content-update pipeline for folding new clipped material into the book — exists because the first attempts at updating a published book showed how easily generated updates drift into source phrasing, repeated boilerplate, and internal workflow metadata leaking into public prose. The response was not a better prompt. It was governance written into the repository's AGENTS.md: a six-step flow in which clips are analyzed into candidates, a human reviews and approves them, only approved candidates are applied to drafts, and only then are artifacts exported, published, and archived. The guardrail list even bans specific words and metadata patterns from public prose. Approval gates were added after the failure, in writing, where every future agent session reads them.

What would change on a rerun: build the export preflight and the publication guardrails before generating fourteen chapters, not after. The pipeline's review loop guarded prose quality from the start; nothing guarded the boundary where prose becomes a deliverable until that boundary had already broken a few times.

Projects to inspect

From Canvas to Production: The Design-to-Code Pipeline, HonestlyThe same lesson — gates cluster at the boundary where artifacts become deliverables — applied to UI code.

Section 07

Case 3: The book-to-video pipeline and its almost one-to-one fix ratio

The third case turns the same book into video. The agentic-video project is a pipeline plus an editor: pipeline stages for ingesting chapter content, matching it to templates, composing shots, generating narration and text-to-speech, selecting music, producing subtitles, rendering, and validating the result; a library of 190 HTML-and-GSAP video templates; a music library of generated underscore tracks whose creation prompts are stored alongside the audio; and a timeline editor for assembling and reviewing the result. Rendering happens by driving the HTML templates in a headless browser and assembling frames with ffmpeg. The project deliberately stayed HTML-native rather than adopting a React-based video framework; the trade-offs of that choice belong to the upcoming motion article and are not re-litigated here.

The git history is six days, May 27 to June 1, and seventy-eight commits — thirty-five of which start with "fix" against thirty-four that start with "feat." That near one-to-one ratio is not a confession; it is what building a render pipeline and an editor simultaneously looks like, with dated plan documents in the repository recording three successive rebuilds of the editor's timeline model along the way. The pivotal artifact is the project's DESIGN.md, because it does something the other cases do not need: it defines two complete brand themes — a dark primary and a light secondary — as CSS custom properties, so any of the 190 templates can switch theme at render time, and the editor chrome reuses the same tokens.

The reason that file is the pivotal artifact is also the reason this case has the best failure data of the four, covered in the next section: the project wrote down its own design contract and then measured itself against it.

agentic-video DESIGN.md (sanitized excerpt: the two-brand token contract)

## 1. Identity

- Base System: shadcn (Tailwind v4) + Hyperframes templates
- Description: Book-to-video pipeline + studio editor. Two brand themes ship
  together: a premium dark primary theme and a clean light secondary theme.
  The editor chrome follows the dark theme; the light theme is used for
  rendered video scenes that prefer a light aesthetic.
- Philosophy:
  - Pill-style rounded buttons for primary CTAs.
  - No gradients on text; reserve gradients for background drift fields only.
  - Real seeded data in every state — empty states are bug-hiders.

## 2. Color

The video templates use CSS custom properties (--primary, --accent, etc.)
so that any template can switch themes at render time.

| Token              | Dark theme value           | Role                        |
| ------------------ | -------------------------- | --------------------------- |
| --bg               | #0b0b0f                    | Background, video canvas    |
| --text             | #f7f4ef                    | Primary foreground          |
| --text-secondary   | #a8a29e                    | Captions, labels            |
| --primary          | #d1e7dd                    | Primary accent              |
| --accent           | #e0a3c9                    | Secondary accent            |
| --accent-secondary | #00d9ff                    | Tertiary highlight          |

Section 08

Case 3: the audit that graded its own template library

On May 30 the project ran an automated audit over its template library: render each template headlessly, check the frames for signal, and record the result. The output file, generated that evening, counts 156 templates as ok, 26 as empty, and 8 as having dead animations — meaning roughly one in six templates in the library would have produced a blank or static segment if the matching stage had selected it. Each entry carries the measured evidence: frame-difference scores and the percentage of dark or background-only pixels. Without that gate, the failure mode is brutal because it is silent — a finished video with a few seconds of nothing in the middle, discovered after rendering, narration, and music are already assembled.

Two days later, on June 1, an architecture review document recorded the slower-burning problems: three diverging copies of the timeline type definitions, a hard-coded substitution table that fails silently when a binding is missing, and — most relevant to this site's readers — the gap between the project's design contract and its implementation. DESIGN.md defines the tokens, but the templates carry 115 hard-coded color literals as fallbacks plus raw hex values, hard-coded spacing and radii, and dozens of override rules per template, leaving token compliance at roughly twenty percent. The review's first recommendation is the obvious one: a single token source feeding the templates, the editor theme, and DESIGN.md, with a lint gate to keep it that way. Both numbers are point-in-time measurements from those dates and will drift as the project moves.

What would change on a rerun: tokenize the template library from the first template, not the hundred-and-ninetieth, and define the timeline type once. Both findings repeat the meta-pattern from case 2 — generation was never the bottleneck. A library of 190 templates and a six-stage pipeline existed within days; the rework went into the boundaries where generated parts have to agree with each other.

tableWhat the May 30 template audit recorded

1Templates audited

190

2Passing (ok)

156

3Empty output

4Dead animation

5Render errors

6Evidence per entry

frame-difference score, dark-pixel %, background-only %

Counts from the project's own template-audit output, generated 2026-05-30. Each template entry carries measured frame evidence; the audit ran before the templates were trusted in composed videos.

Section 09

Case 4: The design-system layer and the gate that caught the error

The fourth case is the layer underneath case 1, and it is the smallest in scope but the clearest demonstration of why gates matter. The site's design system is not a Figma library; it is a set of repository artifacts: DESIGN.md with exact OKLCH tokens and anti-pattern rules, the semantic CSS variables in globals.css, the shadcn primitives plus the site's own editorial card components, a machine-readable component inventory, and two executable commands — the structural verify script and a token audit that fails on hard-coded colors. On top of that sit canvas-sync scripts that regenerate design-canvas artifacts from the codebase, so the inspectable design surfaces follow the code rather than the other way around.

Two pieces of evidence anchor this case. The first is an integration failure: the plan was to push the generated canvas payload to Figma through the remote MCP endpoint, and in this environment, at the time of writing, that endpoint rejected the OAuth registration with an HTTP 403. The fallback, documented in the repository's design-canvas notes, is a small local Figma plugin that reads the same generated payload and writes native frames. The lesson is not about any one vendor; it is that an agentic workflow needs a degraded path for every external integration, because the integration's availability is not under your control.

The second piece of evidence is an executed token-propagation run, performed on a working copy on June 1 and written up in full in the design-system maintenance article. The accent token was changed in DESIGN.md and globals.css; an agent propagated the change with explicit scope rules; then the run was repeated with one rule removed, and the agent did the helpful, wrong thing — it wrote the new color directly into a component as a hard-coded hex value. The audit gate flagged the file and line, named the expected fix, and exited non-zero. The same run also surfaced the gate's blind spot: an equivalent raw value written in a different color syntax passed at current settings. The output below is quoted from that captured run.

Projects to inspect

Design Systems That Maintain Themselves (Almost)The full write-up of the token-propagation case, including the prompt, the diff, and the gate's blind spot.

Audit gate output from the executed propagation run (working copy, 2026-06-01)

$ npx @imehr/agentic-designer audit app --fail-on error

app/workflows/page.tsx
  L67: [token:color] #1f6e57
    Expected: A CSS variable reference (e.g., var(--color-primary))

Total: 1 violations (0 slop, 1 token)
exit code: 1

# after reverting the stripe to the lab-green token reference

$ npx @imehr/agentic-designer audit app --fail-on error
No violations found.

$ npx @imehr/agentic-designer audit components --fail-on error
No violations found.
exit code: 0

Section 10

The tool-selection pattern: same four layers, four times over

Lay the four cases side by side and the tool choices stop looking like preferences and start looking like a pattern. Every project, regardless of medium, filled the same four slots: a harness or contract layer that is written before generation starts; a generation layer where agents do the volume work; a review-gate layer of executable checks that accumulate where things have already broken; and a delivery or export layer where the artifact leaves the repository and meets the world.

The matrix below names what filled each slot in each case, and every cell points at something inspectable — a file, a command, or an output artifact. That inspectability is the selection criterion that mattered most in practice. Tools earned their place by being legible to an agent and checkable by a script: Markdown contracts over tribal knowledge, JSON configuration over UI settings, CLI gates over manual checklists, and file-based outputs over platform-locked ones. Where a tool could not be made inspectable or kept failing — the remote canvas push is the clearest example — it was replaced by something more boring that could be.

The matrix is also a useful planning device in reverse. If you are starting a new agentic workflow and cannot say what will fill the contract slot and the gate slot, the project is not ready for an agent to do volume work in it; the volume will arrive, and nothing will be in place to tell you which parts of it to trust.

A four-by-four matrix mapping the editorial site, book pipeline, video pipeline, and design-system layer against four workflow layers — harness and contract, generation, review gates, and delivery — with the actual files, commands, and artifacts named in each cell. — tableTool-selection pattern matrix

Section 11

Meta-patterns: what repeated across all four

The first repeated pattern is harness before content. Every project's earliest substantive commits are contract files, not deliverables: AGENTS.md and DESIGN.md before the site had pages, the book configuration and framework before any chapter, the video project's design contract and template metadata alongside its first templates, the token layer before the components that consume it. The contract is also the artifact that survives: sessions end, context windows reset, but the next session reads the same files and starts from the same rules. Output quality across the four projects tracked the specificity of those files far more closely than it tracked anything about prompting technique.

The second pattern is gates, not vibes. Each pipeline grew executable checks exactly where it had already failed: a structural verify script and a token audit on the site, image preflight and publication guardrails on the book, a template audit and a validation stage on the video pipeline, the audit-plus-typecheck pair on the design system. None of these gates was installed speculatively on day one; each was added after a real failure, and each then ran on every subsequent change. The corollary showed up in every case too: a passing gate only means the encoded checks found nothing. The site's gates passed prose that was factually wrong; the design-system audit missed a raw value in an unanticipated syntax; the video audit grades signal, not taste. Human review stayed in the loop in all four, and in writing.

The third pattern is the fix-commit reality. Across roughly 160 commits in the three repositories with full histories, fix commits make up between a third and a half, and they cluster at export, render, and deploy boundaries — the places where generated work meets an environment that does not negotiate. Drafting, code generation, and template generation were almost never the bottleneck. If you are budgeting an agentic project, budget for the boundaries.

The fourth pattern is compounding reuse. The book feeds the video pipeline and this site's article series; the design-system layer serves both the site and its canvas artifacts; the music library is built to be shared across video projects; the research-packet workflow that produced this article was itself built in case 1. None of the four projects would justify its harness investment in isolation. Together, each one starts further ahead because the previous one exists.

Harness before content: the contract files are the first substantive commits in every project.
Gates accumulate where failures already happened, then run on every change after.
A passing gate is not a design review — every project kept a written human approval step.
Fix commits cluster at export, render, and deploy boundaries, not at generation.
Plans, audits, and verification reports live in the repository, which is why these timelines could be reconstructed at all.
Artifacts compound across projects: harnesses, gates, libraries, and research carry forward.

Section 12

What did not work, and what the evidence cannot tell you

An honest accounting has to include the costs of the approach itself, not just the failures inside it. The most visible cost is rework concentrated where no local check could reach: deployment environments, PDF rendering engines, video render paths, remote integrations. The three deploy fixes, the two-day export-fix cluster, the rejected canvas push, and the template audit's twenty-six empty templates are all the same shape of cost — generated volume arriving faster than the boundaries could absorb it. The approach does not remove that cost; it moves it later and makes it visible in the log.

The second cost is governance debt. Both the book pipeline and the site added their approval rules after the first scare, not before. That is probably unavoidable — you cannot anticipate every failure mode of a workflow you have not run — but it means the early output of any new agentic pipeline should be treated as suspect by default, because the rules that will eventually protect it have not been written yet. The third cost is contract drift, and the video project is the cautionary example: having a DESIGN.md is not the same as complying with it, and at twenty percent token compliance the contract was closer to an aspiration than a constraint until the architecture review forced the issue.

Then there is what this article cannot tell you, because the evidence cannot. Git history shows calendar spans and commit density; it does not show hours of human attention, how many agent sessions ran, what the model usage cost, or which paragraphs and components the author rewrote by hand. Those would be genuinely useful numbers, and publishing them without measurement would be exactly the kind of claim this article exists to avoid. They are flagged for a follow-up where they can be measured rather than remembered. Finally, all four projects share a property that limits generalization: one practitioner, no client, no handoffs, no compliance regime. The patterns are real; the speed should not be quoted into a context that has constraints these projects did not.

Rework concentrates at boundaries no local gate can reach — budget for it.
Approval rules tend to be written after the first scare; treat early pipeline output accordingly.
A design contract without a compliance gate drifts — the video project measured its own drift at ~20% token compliance before correcting course.
Hours, session counts, and costs are not claimed here; they are unmeasured, not zero.
Single-practitioner, no-client conditions inflate apparent speed relative to team settings.

Section 13

How to run your own traced workflow

The reason these four narratives could be written at all is that the projects left evidence behind as a side effect of how they worked. You can set up the same property in an afternoon, and it costs almost nothing during the project — the payoff comes later, when you need to explain, audit, or repeat what happened. The practice is less about documentation discipline and more about defaults: keep the contract in the repository, keep plans and reviews as dated files, make at least one check executable, and write commit subjects that distinguish features from fixes, because that one habit is what makes the rework visible later.

The board below is the capture list this article was effectively written from, generalized. Treat it as the definition of done for any agentic project you might one day want to write up — for a portfolio case study, a team retrospective, or an internal argument about whether the workflow is worth keeping.

Start the trace before the project starts, not after it succeeds. The projects that fail or stall are the ones whose traces you will learn the most from, and they are exactly the ones nobody reconstructs from memory.

Production trace checklist (copy into your repo as docs/trace.md)

# Production trace — capture as you go

## Contract
- [ ] Agent instructions (AGENTS.md / CLAUDE.md) committed before volume work starts
- [ ] Design or quality contract (DESIGN.md, config thresholds) committed and dated

## Process evidence
- [ ] Plans and verification reports as dated files in the repo
- [ ] Commit subjects distinguish feat / fix / docs
- [ ] Pivotal prompts and configs saved verbatim (sanitized)

## Gates
- [ ] At least one executable check with a real exit code
- [ ] Each new failure class gets a gate or a written rule, with a date
- [ ] Known gate blind spots written down next to the gate

## Boundaries
- [ ] First deploy / export / render done early, while failures are cheap
- [ ] Every external integration has a documented fallback

## Human decisions
- [ ] Approval gates written down, not implied
- [ ] What only a human reviewed (taste, truth, hierarchy) is named

screenshotProduction trace evidence board

Review boardreference · implementation · state

Contract files in the repo: AGENTS.md or equivalent, DESIGN.md or a config that encodes the standard

Dated plan and review documents committed alongside the work, not kept in chat history

At least one executable gate with a real exit code, and the habit of running it before handoff

Commit subjects that distinguish feat from fix, so rework is countable later

Point-in-time audit outputs saved as files with their generation date

A note for every external integration failure and the fallback that replaced it

The pivotal prompt or config for each phase, saved verbatim and sanitized

A short list of what was decided by a human, and where that approval is recorded

What to capture, as you go, so a future case study can be written from evidence instead of memory. Every item here exists in at least one of the four traced projects.

Section 14

Where this sits in the wider conversation

Public accounts of comparable workflows exist, and they are worth reading next to this one — partly for corroboration, partly for contrast in rigor. A two-person agency has written about running its content and client-site work with a coding agent as a de facto third team member; an independent developer publishes a build-log-style account of constructing a production site with AI coding tools; practitioners have documented single-author book-writing agents and looser multi-agent book experiments; the programmatic-video ecosystem has official documentation for driving renders from coding agents and at least one open-source agentic video production system; and design-system practitioners have published accounts of wiring design tokens, component code, and agent context together. Their numbers are theirs and are not generalized here, but the pattern overlap — written contracts, executable checks, rework at the boundaries — is hard to miss.

What is still rare in public is the failure record: the fix-commit ratios, the audits that flagged a sixth of a template library, the integration that returned a 403 on the day it was needed. That gap is most of the reason this article exists. If the only traces that get published are the ones that flatter the workflow, the field will keep mistaking demos for evidence.

The four projects continue, and their numbers will drift away from the snapshots quoted here — that is what the date stamps are for. The follow-ups that would most improve this account are the ones flagged above: measured effort and cost, and a traced project run under team and client constraints rather than solo ones.

Projects to inspect

Build a Design Harness Before You PromptThe reference walkthrough for the contract layer all four cases start with.Design Systems That Maintain Themselves (Almost)Case 4 in full: the maintenance loop, the propagation prompt, and the gate evidence.From Canvas to Production: The Design-to-Code Pipeline, HonestlyThe gate-by-gate version of the export-boundary lesson for UI work.Research Packets for Agentic DesignHow the evidence behind articles like this one is collected and reviewed before drafting.

Sources

Sources & further reading

agents.md
The open AGENTS.md convention for project-level agent instructions used by the site and book projects.
anthropics/skills
Anthropic's public SKILL.md repository — the pattern behind the site's on-demand skills catalog.
shadcn/ui MCP server documentation
Official documentation for making the shadcn component registry legible to agents, used in cases 1 and 4.
Style Dictionary
The token transform pipeline relevant to the tokenization fix recommended in case 3's architecture review.
Remotion: prompting videos with coding agents
Official docs for agent-driven programmatic video — public context for the video pipeline's design space.
OpenMontage
An open-source agentic video production system; a public comparable for case 3.
RSL/A: running a marketing agency on Claude Code
A two-person agency's account of agent-assisted content and site work; their numbers are theirs.
Developers Digest: building a real site with AI coding tools
A public build-log case study of a production site built with AI coding tools.
A book-writing agent in practice
A practitioner's single-author multi-stage book-writing agent — public context for case 2.
Designing with Claude Code and Codex CLI (UX Collective)
A practitioner account of agent-driven design-system workflows — public context for case 4.

Four Agentic Design Workflows in Production: What Actually Happened

Why traced narratives beat tool reviews

The evidence rules this article follows

Case 1: The editorial site — placeholder to deployed in three calendar days

Case 1: what broke, what caught it, what would change

Case 2: The book pipeline — thirteen roles in one configuration file

Case 2: the hard part was never the prose

Case 3: The book-to-video pipeline and its almost one-to-one fix ratio

Case 3: the audit that graded its own template library

Case 4: The design-system layer and the gate that caught the error

The tool-selection pattern: same four layers, four times over

Meta-patterns: what repeated across all four

What did not work, and what the evidence cannot tell you

How to run your own traced workflow

Where this sits in the wider conversation

Sources & further reading

Keep reading on Case studies.

Pricing and Plan Selection for Design Teams

Claude Code for Designers: Zero to First Prototype in One Session

Prompt Library Teardown: 5 Design Prompts That Consistently Work

Get the next traced workflow, harness template, and review-gate checklist by email.

For deeper reading, explore the books behind the Agentic Design School curriculum.

The Agentic Designer

Claude Code for Designers

Open Design