AAgentic Design School
Module 5 of 6
45–55 minutes

Orchestrating Design Agent Teams

Canvas-to-Production Pipelines

The full pipeline from approved canvas design to production code: stage gates, parity checks, token sync, and the handoffs where intent historically leaked — now run as an orchestrated workflow with evidence at every stage.

Duration45–55 minutes

Slides13 slides with notes and narration

Learning objectives

  • Define the pipeline stages from canvas approval to merged production code.
  • Place gates with owners and evidence requirements at each stage.
  • Measure where defects enter the pipeline and tighten the responsible stage.
Slide deck

Work through the module

Each slide is shown in its 16:9 frame, exactly as it appears in the video version. Open the notes under any slide for the longer explanation, and the narration if you prefer to read along.

Slide 1 of 1316:9

Canvas-to-Production Pipelines

Orchestrating Design Agent Teams · Module 5 of 6

  • The pipeline replaces the handoff
  • Stages and gates from approved canvas to merged code
  • Token and component sync as steps, not assumptions
  • Measuring where defects enter, instead of blaming the end

Generation is the easy stage. The pipeline exists for everything that happens after the demo looks right.

Slide notes

This module takes the orchestration patterns from earlier in the course and applies them to the workflow most teams care about most: getting an approved design into production code without the intent leaking on the way. The framing to establish up front is that the pipeline replaces the handoff. In the old model, a designer finished a picture of the product and a developer rebuilt it, and the gap between those two artifacts is where spacing drifted, states got dropped, and the empty state nobody specified got invented under deadline pressure. In the pipeline model, the design artifact feeds an agent that builds in the product's real materials, and a sequence of gates verifies the result against the artifact and the intent.

Be clear about what this module assumes. It assumes the promotion decision has already been made — that this design should become production code at all — and it assumes a harness exists: a design-system file, semantic tokens, and at least one executable audit. Module 3 covered the MCP chains that connect canvas to code; this module covers the stages and gates those chains run through.

Also set the honest tone early. The export step has become genuinely good by mid-2026, and it is still the smallest part of the work. Most of the session time in the worked example later in this module went into gates, rebuilding on real primitives, and review — not generation. If participants leave with one habit, it should be accounting for that time honestly.

Narration for this slide

Welcome to Module 5. This is the module about the workflow everyone asks for: how an approved design on a canvas becomes production code, with agents doing the production work and humans holding the gates. The short version of the argument is that the pipeline replaces the handoff. Instead of a designer finishing a picture and a developer rebuilding it from scratch — losing detail at every step — the artifact feeds an agent, the agent builds on a branch, and a sequence of gates checks the result against the design and the intent. Generation is the easy part. The pipeline exists for everything after the demo looks right.

Slide 2 of 1316:9

Six stages, two owners

Every transition between stages is a place intent can leak. The pipeline names each one and gives it an owner.

  • Approved canvas — human-owned: structure, components, tokens, and written behaviour
  • Spec extraction — tokens and components resolved against the real library
  • Implementation — agent builds on a branch, against the harness, outside production paths
  • Parity and quality checks — type, tokens, accessibility, breakpoints, against the canvas
  • Review — a human judges severity against intent, not just whether checks passed
  • Merge — a human decision, with the defect log kept by stage

The agent owns the mechanical middle. The human owns the artifact going in and the judgment coming out.

Slide notes

Walk the six stages and emphasise the ownership split rather than the stage names, because the names vary by team and the split does not. The human owns the two ends: the artifact and intent going in, and the judgment at review plus the merge decision coming out. The agent owns the mechanical middle: resolving tokens and components, generating the implementation, producing breakpoint variants, and running every check that has a command line.

Stage one deserves the most attention because it is the cheapest place to buy quality. An artifact that carries structure — layout built with auto-layout or flex, real component references, token names rather than raw values, realistic content — gives the generator something to be faithful to. A flat picture gives it something to imitate, and imitation is where the plausible-but-wrong output comes from. Every credible teardown of design-to-code tooling this year converges on the same finding: artifact structure and component mapping decide more of the output quality than the choice of generator.

Note what is deliberately absent from the stages: there is no stage called handoff. The artifact does not get reinterpreted by a person; it gets read by an agent and verified by gates. That is the structural change, and it is also why the gates matter more here than in a single-agent workflow — nobody is manually rebuilding the design, so nobody is manually noticing what the rebuild missed.

Narration for this slide

Here are the six stages. An approved canvas, owned by a human, carrying real structure — components, tokens, and behaviour written down. Spec extraction, where canvas values get resolved against the actual token and component library. Implementation, where the agent builds on a branch against the harness. Then the checks: types, tokens, accessibility, and breakpoints, compared against the canvas. Then review, where a human judges what the checks cannot see. And finally merge, which is always a human decision. Notice the ownership split: the agent owns the mechanical middle, and you own the artifact going in and the judgment coming out. There is no stage called handoff — that is the point.

Slide 3 of 1316:9

The pipeline, end to end

Agent-run stages move the work forward. Gates decide whether it gets to.

Pipeline diagram with six stages. An approved canvas, a human-led stage, feeds token and component resolution and then code generation on a branch, both agent-run. The branch passes through an automated gate of executable checks and visual QA, with a dashed defect-return line back to the generation stage. Work that passes reaches the human-led PR review gate and then the production merge, which is marked as a human decision. A legend marks human-led stages in blue, agent-run stages in yellow, gates in ink, and production in green.
The approved canvas and both gates are human territory; resolution, generation, and check execution are agent-run. The dashed line is the rule that makes the pipeline honest: defects go back to the agent, never forward with a note attached.

Work that fails a gate goes back to the agent and re-enters at the gate it failed. It does not move forward with a note attached.

Slide notes

Walk the diagram in order and name the owner of each card. The approved canvas is human: it carries the structure, the component references, and the written behaviour. Token and component resolution is agent-run, and it is a real step — the agent maps canvas values to the project's semantic tokens and frames to the real component library, and flags anything it cannot map rather than inventing a substitute. Code generation happens on a branch, outside production paths, against the harness. The checks-and-visual-QA card is the automated gate: type check, token audit, accessibility scan, and breakpoint screenshots compared against the canvas. The PR review gate is human-led and uses the same P0–P3 severity vocabulary as the critique module earlier in the curriculum. The merge is a person's decision.

Spend time on the dashed return line, because it is the difference between a pipeline and a conveyor belt. Work that fails a gate does not continue with a comment attached; it returns to the agent, gets fixed or regenerated, and re-enters at the gate it failed — and everything earlier gets re-run, because a fix can introduce a new defect upstream of the gate that caught the old one.

The last bullet on the production card previews the measurement discipline later in the module: the defect log is kept by stage, so that recurring defects tighten the stage responsible for them instead of becoming permanent review burden at the end.

Narration for this slide

Here is the whole pipeline on one picture. The approved canvas on the left is yours — it carries the structure and the intent. Then the agent takes over the middle: it resolves tokens and components against your real library, generates code on a branch, and runs the executable checks and visual QA against the canvas. Then the gates hand control back to you: the PR review judges what the scans cannot see, and the merge is always a human decision. The most important line on this diagram is the dashed one. Work that fails a gate goes back to the agent and re-enters at the gate it failed. It never moves forward with a note attached.

Slide 4 of 1316:9

Gates: owner, evidence, and what each one exists to catch

A gate is a check with three properties: a tool or procedure that runs it, an owner who acts on the result, and a named defect class it exists to catch.

GateEvidenceOwnerDefect class it catches
Type and structuretsc, lint, structural verify outputAgent runs and fixesInvented props, broken wiring, runtime bugs
Token and design systemAudit report, zero violationsAgent runs and fixesHardcoded values, parallel visual systems
Responsive parityScreenshots at 360 / 768 / 1280 vs the canvasAgent captures, human judgesTask order lost, fixed widths, hidden content
AccessibilityAutomated scan plus a manual heuristic passSharedMissing names, heading logic, keyboard traps
Human reviewP0–P3 findings against the canvas and the user taskHumanEverything the automation cannot express

If a failed gate changes nothing, it is not a gate. It is theatre with logging.

Slide notes

This table is the working definition of the module. A gate is not a vibe check and not a dashboard; it is a check with a tool that runs it, an owner who acts on the result, and a named defect class it exists to catch. Naming the defect class in advance is what keeps gates honest — when nobody can say what a gate is for, it drifts into ritual, and ritual gates are how P1 defects get waved through alongside the noise.

Point out the owner column. The first two gates are fully agent-owned: the agent runs the type check and the token audit, and it fixes what they find without waiting to be asked. The responsive gate splits the work: the agent captures the evidence — screenshots at the agreed widths — and a human judges whether task order survived, not just whether things stacked without overflowing. The accessibility gate is shared because the automated scan is necessary and not sufficient: practitioner analyses consistently estimate that automated tools catch on the order of a third of real WCAG issues, so the gate has to be scan plus heuristics. The final review gate is human, full stop.

The highlight line is the cultural test to give teams: if the audit fails and the work merges anyway, or the scan is green and nobody does the heuristic pass, the pipeline exists on paper only. The fix is rarely more tooling — it is deciding, in advance, that a failed gate stops the work.

Narration for this slide

Let's define a gate properly. A gate has three properties: a tool or procedure that runs it, an owner who acts on the result, and a named defect class it exists to catch. The type and token gates are agent-owned — the agent runs them and fixes what they find. The responsive gate splits the work: the agent captures screenshots at three widths, and you judge whether the task order survived. Accessibility is shared, because the automated scan only sees part of the standard. And the final review is yours. Here is the test of whether you really have gates: when one fails, does anything change? If not, you have theatre with logging.

Slide 5 of 1316:9

Token and component sync are pipeline steps, not assumptions

The most common silent failure is a pipeline that assumes the canvas and the codebase already agree about tokens and components. They rarely do.

  • Resolve every canvas value to a named token before generation — flag what does not map
  • Map frames to the real component library; lookalike markup is the defect, not the deliverable
  • Investment in mapping (Code Connect, .pen variables, canvas conventions) beats generator choice
  • Unmapped values are a finding for a human, never a value for the agent to invent
  • Re-run the resolution step when the design system changes — sync decays

Every credible teardown lands in the same place: structured, mapped artifacts export well; unstructured ones export plausibly and degrade silently.

Slide notes

This slide covers the stage most teams skip and then pay for at review. The canvas and the codebase each have an opinion about what the design system is, and unless something actively reconciles them, the agent will resolve the disagreement silently — usually by hardcoding whatever value the canvas happens to show. That is how a pipeline produces code that compiles, looks right, and contains eleven hex colours standing in for semantic tokens, which is exactly what the worked example later in this module found on its first iteration.

The mechanics differ by stack but the principle is constant. With Figma's Dev Mode MCP server, the mapping investment is Code Connect: when components are mapped, the exported code imports the team's actual components with their real props instead of lookalike markup, and independent reviews consistently identify that mapping work — not the generator — as the difference between code you keep and code you rewrite. With connected canvases the mapping is closer to free: Pencil's .pen variables map to CSS custom properties, and Paper's HTML-native canvas means export is serialisation rather than translation. Either way, the pipeline should treat resolution as an explicit step with an output a human can read: which values mapped, which did not, and what the agent proposes to do about the gaps.

The last bullet is the operational point for teams running this at scale: sync decays. Tokens get renamed, components get deprecated, and a mapping that was complete in March is partial by June. Re-running resolution when the system changes is much cheaper than rediscovering the drift one review at a time.

Narration for this slide

Here is the silent failure that wrecks most canvas-to-code attempts: assuming the canvas and the codebase already agree about tokens and components. They rarely do, and when they disagree, the agent resolves it silently — usually by hardcoding whatever the canvas shows. So make the sync an explicit step. Resolve every canvas value to a named token before generation, and flag what does not map instead of inventing a substitute. Map frames to your real component library — with Figma that means Code Connect, with connected canvases the mapping is closer to free. The evidence keeps landing in the same place: the mapping work, not the generator, decides whether you keep the code or rewrite it.

Slide 6 of 1316:9

Parity checks: comparing the build against the canvas, with evidence

Parity is not a feeling. It is screenshots of the built page set against the approved frames, at the widths the design committed to.

  • Agent captures screenshots at 360, 768, and 1280 and pairs them with the canvas frames
  • The human judges intent: task order, hierarchy, and states — not pixel-for-pixel sameness
  • Document deliberate deviations; the canvas is the source of truth for appearance, not behaviour
  • Known fidelity gaps in mid-2026 exports: drifted colours and radii, placeholder data, unfinished interactions
  • Findings get the P0–P3 severity treatment, with evidence attached to each one

The agent captures the evidence. The human supplies the judgment about which differences matter.

Slide notes

Parity checks answer one question with evidence: does the built page match what was approved, at the sizes the design committed to? The mechanism is straightforward — the agent drives a browser, captures screenshots at the agreed widths, and pairs each one with the corresponding canvas frame so the reviewer is comparing like with like. What keeps this from becoming busywork is being clear about what kind of match you are checking. Pixel-for-pixel sameness is the wrong target: fonts render differently, content is real rather than placeholder, and some divergence is deliberate. The reviewer is checking intent — is the hierarchy the same, did the task order survive on the small width, do the states the canvas showed actually exist in the build.

It helps to know what to expect from the export step, because the fidelity gaps are well documented. Independent hands-on tests of the strongest export paths in 2026 still report drifted colours and border radii, the wrong number of repeated elements, placeholder data wiring, and interactions left unfinished. None of these are reasons to abandon the pipeline; all of them are reasons the parity gate exists. The agent can also do a useful first pass on the comparison itself — flagging obvious geometric and colour differences — but the severity call stays human, because only a human knows which deviation is a defect and which is a sensible adaptation to real content.

Findings from this gate use the same P0–P3 vocabulary as the rest of the curriculum, with evidence attached: the screenshot, the frame, and one sentence on why it matters. That keeps the return trip to the agent cheap and unambiguous.

Narration for this slide

Parity is not a feeling, it is evidence. The agent captures screenshots of the built page at three-sixty, seven-sixty-eight, and twelve-eighty, and pairs each one with the approved canvas frame. Then you judge — and you are judging intent, not pixels. Did the hierarchy survive? Did the task order survive on mobile? Do the states the canvas showed actually exist? Expect the documented gaps: drifted colours, placeholder data, unfinished interactions. That is what this gate is for. The agent gathers the evidence and can flag obvious differences; you decide which differences matter, and every finding gets a severity and a screenshot attached.

Slide 7 of 1316:9

Plausible but wrong: the defect classes you can name in advance

The defects are predictable enough to plan gates around. That is the only reason gates are worth building.

  • Code that compiles but ignores your tokens — caught by the audit, prevented by the harness
  • Markup that looks right and reads wrong — headings, names, and structure a glance never checks
  • Output that passes the scan and traps the keyboard — automated checks see roughly a third of WCAG issues
  • Responsive variants that stack DOM order instead of preserving task order
  • Product rules silently invented from whatever a static frame happens to show

Generated UI fails accessibility by default — not because the model is careless, but because nothing in "make it look like this frame" asks for semantics.

Slide notes

This slide is the evidence base for the gate sequence, and it is worth giving the numbers with their caveats. On accessibility, a practitioner benchmark published in 2025 and updated since had frontier models generate UI components with no accessibility instructions in the prompt and ran automated checks against the output: the average pass rate was roughly 12%. The same benchmark found that loading accessibility instructions or skills changed the result substantially — which is an argument for the harness as much as for the gate. Treat it as a practitioner benchmark rather than a peer-reviewed study, but the direction matches what accessibility teams report.

The second documented problem is that the automated checks themselves only see part of the standard: Deque's analyses and independent practitioner write-ups estimate automated tools such as axe-core catch on the order of 30–40% of real WCAG issues. Heading logic, focus order coherence, link text in context, and keyboard usability mostly are not in that set, which is why the accessibility gate is scan plus heuristics, never scan alone.

There is also a cost that never shows up in the generated files: review. A 2025 randomized study by METR found experienced open-source developers were a net 19% slower with AI assistance on their own mature codebases, largely because reviewing plausible-but-incorrect output ate the time generation saved. That population is not designers running this pipeline, so do not quote it as a design-to-code statistic — but the mechanism it documents, confident output that costs more to verify than to admire, is exactly the failure mode this pipeline is built to manage. The last bullet, product rules invented from a static frame, is the one no gate fully catches; it is prevented by writing behaviour down in the brief rather than letting the agent infer it.

Narration for this slide

The reason gates are worth building is that the defects are predictable. Code that compiles but ignores your tokens. Markup that looks right and reads wrong. Output that passes the automated scan and still traps the keyboard — because the scanners only see roughly a third of the standard. Responsive variants that just stack things instead of preserving the task order. And product rules quietly invented from whatever the static frame happened to show. Generated UI fails accessibility by default — not from carelessness, but because nothing in 'make it look like this frame' asks for semantics. Each of these classes maps to a gate. That is not a coincidence; it is the design of the pipeline.

Slide 8 of 1316:9

Where defects enter: a real defect log, by stage

From an executed run on this school's own repository (2026-06-01): one section, three iterations, every gate that could run in the environment.

DefectGate that caught itCaught bySeverity
Date string treated as a Date objectType checkAutomatedP2
11 hardcoded hex colours instead of tokensToken and design-system auditAutomatedP1
Fixed 1200px wrapper, rigid 3-column gridResponsive reviewHumanP1
Icon-only link with no accessible nameAccessibility heuristicsHumanP1
Card titles rendered as divs, not headingsHuman review — no automated check flagged itHumanP2
Orphan third card at 768pxHuman reviewHumanP3 — accepted

Automated gates caught two of the seven defect classes — quickly and cheaply. Everything else needed a checklist or a human with the intent in mind.

Slide notes

This defect log comes from a documented run in the school's own repository on 2026-06-01: a small lab-sessions section built through the pipeline, with the type check and the design-system audit executed for real and the responsive and accessibility findings coming from a checklist-driven manual review of the markup. It is one section and one session of work, so do not present it as a statistic — present it as a shape, because the shape matches the larger published picture. The automated gates caught the cheap, mechanical defects fast: the type error that would have thrown at runtime, and the eleven hardcoded colours. Everything else needed a heuristic checklist or a human looking at the result with the original intent in mind.

Two rows carry the lesson. The hardcoded-colour row is the one a decent harness largely prevents — the same audit run on the site's real component directories passes, because the harness teaches the agent to use semantic tokens in the first place. So a recurring defect at that gate is a signal to tighten the harness, not to review harder. The div-title row is the opposite case: a defect introduced by doing the right thing — reusing the site's card primitive, whose title slot renders as a div — invisible to the type checker, the token audit, and most automated accessibility scans, and meaningful to exactly the users least likely to be in the room when the demo happens. A pipeline that stopped at "both commands exited zero" would have shipped it.

The measurement discipline this slide introduces is the one to take back to a team: keep the defect log by stage. When you know which gate catches which class of defect, and how often, you know which stage to tighten — the harness, the artifact, the mapping, or the review — instead of letting every defect become permanent review burden at the end of the pipeline.

Narration for this slide

Let's look at a real defect log. This is from an executed run in this school's own repository: one section, three iterations. The type check caught a real runtime bug. The token audit caught eleven hardcoded colours. Then the human checks took over: a fixed-width wrapper that would break on phones, a link with no accessible name, a heading-level problem — and the most interesting one, card titles that were not headings at all, which no automated check flagged. Automated gates caught two of seven defect classes. The point of logging by stage is what you do next: recurring colour violations mean tighten the harness, not review harder. Measurement tells you which stage owns the fix.

Slide 9 of 1316:9

Partial automation: which stages run unattended

The question is not whether the pipeline can run unattended. It is which stages are safe to, and which evidence the unattended stages must leave behind.

  • Safe to run unattended: token and component resolution, generation on a branch, every executable check, evidence capture
  • Never unattended: the artifact going in, severity judgments, accepted findings, the merge
  • Unattended stages must leave evidence: check output, screenshots, and a report of what was flagged
  • CI is the natural home for the automated gates — budget assertions that fail the build, not reports nobody reads
  • Scope decides eligibility: content sections and simple surfaces, yes; checkout flows and permission screens, no

Automate the stages whose failures are cheap and detectable. Keep humans on the stages whose failures are expensive and silent.

Slide notes

Once a pipeline runs reliably with a human watching every stage, the next question is which stages can run without one. The useful dividing line is the cost and detectability of failure. Token resolution, generation on a branch, the executable checks, and evidence capture all fail cheaply and visibly: a failed audit or a missing screenshot blocks the work and tells you why. Those are safe to run unattended, and CI is their natural home — Lighthouse budgets, accessibility scans, and audits expressed as assertions that fail the build, rather than reports that depend on someone noticing. The stages that must keep a human are the ones whose failures are expensive and silent: a wrong artifact going in, a severity judgment made too generously, an accepted finding that never got written down, and the merge itself.

The condition for unattended stages is evidence. An unattended stage that leaves no artifact behind is just a gap in the pipeline with extra steps. The agent should leave the check output, the captured screenshots, and a short report of anything it flagged or could not resolve, so the human gates downstream are reviewing evidence rather than reconstructing what happened.

Scope is the other half of the eligibility decision, and it connects back to Module 1's cost accounting. Content and marketing surfaces, settings panels, and simple lists are good candidates for the mostly automated path. High-stakes flows — checkout, permissions, anything where the artifact cannot express the product rules — are places where the generated code is the least important part of the work, and where the gates cannot save you from a wrong understanding of the problem. Keep those on the slow path deliberately.

Narration for this slide

So how much of this can run without you? Use one dividing line: automate the stages whose failures are cheap and detectable, keep humans on the stages whose failures are expensive and silent. Resolution, generation on a branch, the executable checks, and screenshot capture can all run unattended — ideally in CI, as assertions that fail the build. The artifact going in, the severity calls, the accepted findings, and the merge stay human. And every unattended stage has to leave evidence behind: outputs, screenshots, and a report of what it flagged. One more filter: simple surfaces are good candidates for the fast path. Checkout flows and permission screens are not.

Slide 10 of 1316:9

Worked example: one section through the full pipeline

The same executed run, traced as a pipeline with timings. Generation was minutes per pass; the gates and the rebuild were the session.

StageWhat happenedTime
Artifact and briefSection frame plus a brief naming primitives, tokens, gates, and what not to invent~10 min
Iteration 1: generate + gatesExport-style first pass; type error, 11 token violations, fixed width, missing name, heading skipMinutes to generate; gates sent it back
Iteration 2: rebuild + gatesRebuilt on site primitives with semantic tokens; both automated gates passed~20 min
Human reviewCaught card titles rendered as divs — invisible to every automated checkPart of review pass
Iteration 3: fix + re-runh3 added inside card titles; gates re-run and passing; orphan card accepted as P3Minutes
OutcomeThree iterations end to end; gates and review consumed most of the sessionOne working session

Report generation time and gate-plus-review time separately. Counting only the five-minute generation books the review hours as free.

Slide notes

This trace is the same run as the defect log, viewed as a sequence with timings, and the honest accounting is the lesson. Generation was minutes per iteration. The brief took about ten minutes to write. The rebuild on real primitives took roughly twenty. The review passes, the heuristic checklists, and the re-runs consumed the rest of a working session. That ratio — fast generation, slow verification — matches the published evidence about where time actually goes with AI-assisted work, and it is why the time you report for design-to-code work has to include gate and review time, not just the satisfying first generation.

Walk the iterations briefly. Iteration one is what an unharnessed export looks like: self-contained markup, inline colour values, its own card structure. It compiled and looked broadly right, and the gates disagreed — which is the pipeline doing its job. Iteration two rebuilt the same content on the site's own primitives with semantic tokens, and both automated gates passed cleanly. Then human review caught the defect that justifies the whole final gate: card titles that were no longer headings at all because the card primitive renders its title slot as a div. Every automated gate was satisfied; a screen-reader user navigating by headings would have found nothing inside the section. Iteration three fixed it, re-ran the gates, and recorded one accepted finding — the orphan card at 768px — as a P3 rather than silently ignoring it.

If the gates routinely cost more than building the section by hand would have, treat that as a real signal rather than a failure of discipline: sometimes the answer is a better artifact or a tighter harness, and sometimes the answer is that this surface was not a good candidate for the pipeline.

Narration for this slide

Here is the same run traced as a pipeline with timings. The brief took about ten minutes. The first generation took minutes — and failed four gates: a type error, eleven token violations, a fixed-width layout, and accessibility problems. The rebuild on real primitives took about twenty minutes and passed both automated gates. Then human review caught the one that matters: card titles that were no longer headings at all. Every command had exited zero. The fix took minutes, the gates re-ran, and one finding was accepted and written down. Three iterations, one working session — and most of it was gates and review, not generation. Report your time that way, or you are booking the review hours as free.

Slide 11 of 1316:9

The pipeline as a checklist the agent can run

The gate sequence lives in the repository, not in someone's head. Include it in the brief, and have the agent run every gate it can run itself.

  • The checklist is a contract with the agent as much as with the team
  • The agent stops and reports at the marked decision points — it does not fix and merge in one motion
Gate checklist excerpt (adapt the commands to your stack; keep the order)
## 0. Entry: promotion decided; artifact structured; harness in place
## 1. Generation (agent) — output lands outside production paths
## 2. Type + design-system gate (agent runs, agent fixes)
   - npx tsc --noEmit · design-system audit · structural verify
## 3. Responsive gate (agent captures, human judges)
   - evidence at 360 / 768 / 1280 · task order preserved
## 4. Accessibility gate (shared)
   - automated scan · manual heuristics · accepted findings recorded
## 5. Performance gate (agent runs, budget decides)
## 6. Human review gate — P0–P3 against the canvas; ship / rebuild / keep as prototype
## 7. After merge — defect log by stage; recurring defects encoded into the harness

Re-verify the tool claims behind the checklist periodically — export capabilities, scan coverage, and CI syntax all change. As of June 2026, this is the shape that holds.

Slide notes

The pipeline only becomes a team capability when it stops living in one person's head. Writing the gate sequence into the repository — as a checklist file the brief points at — does three things at once. It makes the sequence the same regardless of which designer or which agent runs it. It lets the agent run every gate that has a command line without being asked, and report results instead of waiting for someone to remember the audit exists. And it creates the place where accepted findings and after-merge notes accumulate, which is what feeds the measurement discipline from the defect-log slide.

Two usage notes matter in practice. First, the checklist is a contract with the agent: include it in the brief, and instruct the agent to stop and report at the marked decision points — after the automated gates, and before anything merges — rather than fixing and merging in one motion. The stop-and-report behaviour is what keeps the human gates real when the agent is doing most of the running. Second, keep the commands yours: the excerpt on the slide uses this site's tools, and the right version for any team substitutes its own type checker, audit, scan, and budget tooling while keeping the order and the ownership.

Date-stamp the claims the checklist relies on. Export capabilities, scanner coverage, and CI syntax all change quickly in this space; the shape shown here was last verified against the underlying tools in June 2026, and a team adopting it should re-check before relying on the specifics. The structure — entry conditions, agent-run gates, shared gates, human review, after-merge notes — is the durable part.

Narration for this slide

To make this a team capability rather than a personal habit, the gate sequence has to live in the repository. Here is the shape: entry conditions, generation outside production paths, the type and design-system gate the agent runs and fixes, the responsive gate the agent captures and you judge, the shared accessibility gate, the performance budget, the human review, and the after-merge notes. Two rules make it work. The checklist goes into the brief, so the agent runs every gate it can run itself. And the agent stops and reports at the decision points — it never fixes and merges in one motion. Adapt the commands to your stack; keep the order.

Slide 12 of 1316:9

Exercise: draw your current path from canvas to production

Map the path one recent design actually took from approved canvas to production, then mark where this module's pipeline differs. Do it on paper before you change any tooling.

  • Pick one recently shipped design and list every step it actually went through, with who did each one
  • Mark the gates that existed — and for each, what evidence it required and what happened when it failed
  • Mark the gaps: transitions with no check, checks with no owner, and findings that were waved through
  • Note where the defects in that work were actually found — in review, in QA, or in production
  • Choose the one stage you would tighten first, and write down what evidence its gate would require

Most teams discover their pipeline already exists — informally, with missing gates and one overloaded reviewer at the end. Drawing it is the first fix.

Slide notes

The exercise is deliberately retrospective: rather than designing an ideal pipeline, participants reconstruct the path one real piece of work actually took. That keeps the answers honest, because the gaps show up as specific events — the audit that was red but merged anyway, the responsive check that was someone resizing a browser once, the accessibility issue found by a user rather than a gate. Steer people towards a recently shipped piece of work that was big enough to involve more than one person but small enough to trace in fifteen minutes.

The fourth bullet is the one that produces the most useful discussion: where were the defects actually found? Defects discovered at the end — in final review, in QA, or in production — are almost always defects that a named earlier gate would have caught more cheaply, and mapping them backwards to the stage that should have owned them is the measurement habit this module is trying to build. The fifth bullet forces a single choice. Most maps reveal several gaps at once, and the instinct is to fix everything; the discipline is to tighten one stage, give its gate an owner and an evidence requirement, and run the next piece of work through it before adding more.

If running this with a team rather than individually, have two people map the same piece of work independently and compare. The differences between the two maps are usually the undocumented steps — and the undocumented steps are exactly the ones an agent-run pipeline cannot inherit, because nobody ever wrote them down.

Narration for this slide

Time to map your own pipeline. Pick one design that recently shipped and trace the path it actually took — every step, and who did each one. Mark the gates that existed, and be honest about what each one required and what happened when it failed. Then mark the gaps: transitions with no check, checks with no owner, findings that were waved through. Now the key question: where were the defects in that work actually found? Anything found at the end probably belonged to an earlier gate. Finish by choosing the one stage you would tighten first, and write down what evidence its gate would require. One stage. Then run the next piece of work through it.

Slide 13 of 1316:9

Summary, and what comes next

  • The pipeline replaces the handoff: agents own the mechanical middle, humans own the artifact and the judgment
  • A gate is a tool, an owner, and a named defect class — and a failed gate stops the work
  • Token and component sync, and parity against the canvas, are explicit steps with evidence — not assumptions
  • Automated checks catch the cheap defects; the expensive ones need heuristics and a human with the intent in mind
  • Measure defects by stage and tighten the responsible stage, instead of piling review onto the end

Module 6 zooms out from one pipeline to the whole operation: cost, permissions, audit trails, and the review capacity that has to scale with everything your agents now produce.

Slide notes

Recap by connecting the bullets rather than repeating them. The pipeline replaces the handoff, which is why the ownership split matters: the agent runs resolution, generation, and every executable check, and the human owns the artifact going in, the severity calls, and the merge. Gates are what make that split safe — a tool, an owner, and a named defect class, with the rule that failed work returns to the agent rather than moving forward annotated. The sync and parity steps are where most informal pipelines leak, because both get assumed rather than executed. And the worked example showed the honest division of labour: automated checks caught the cheap defects fast, and the defect that mattered most needed a human asking a question no command knows how to ask.

The measurement habit is the bridge out of the module: a defect log kept by stage tells you which stage to tighten — harness, artifact, mapping, or review — and that is an operational practice, not a tooling one.

Which is exactly where Module 6 goes. One pipeline run by one designer is a workflow; several pipelines run by a team of designers and agents is an operation, and operations need cost visibility, permissions, audit trails, onboarding, and review capacity that scales with the volume the agents produce. The final module covers how to run the whole thing as a durable team capability rather than a collection of impressive individual setups.

Narration for this slide

Let's close the module. The pipeline replaces the handoff: the agent owns the mechanical middle, you own the artifact going in and the judgment coming out. A gate is a tool, an owner, and a named defect class — and when it fails, the work goes back, not forward. Token sync and parity checks are explicit steps with evidence, never assumptions. The automated checks catch the cheap defects; the expensive ones need a checklist and a human with the intent in mind. And measure defects by stage, so you tighten the stage responsible instead of piling review onto the end. Module 6 zooms out to the whole operation — cost, permissions, audit trails, and review capacity. See you there.

Module transcript
Module 5, narrated slide by slide

Slide 1Canvas-to-Production Pipelines

Welcome to Module 5. This is the module about the workflow everyone asks for: how an approved design on a canvas becomes production code, with agents doing the production work and humans holding the gates. The short version of the argument is that the pipeline replaces the handoff. Instead of a designer finishing a picture and a developer rebuilding it from scratch — losing detail at every step — the artifact feeds an agent, the agent builds on a branch, and a sequence of gates checks the result against the design and the intent. Generation is the easy part. The pipeline exists for everything after the demo looks right.

Slide 2Six stages, two owners

Here are the six stages. An approved canvas, owned by a human, carrying real structure — components, tokens, and behaviour written down. Spec extraction, where canvas values get resolved against the actual token and component library. Implementation, where the agent builds on a branch against the harness. Then the checks: types, tokens, accessibility, and breakpoints, compared against the canvas. Then review, where a human judges what the checks cannot see. And finally merge, which is always a human decision. Notice the ownership split: the agent owns the mechanical middle, and you own the artifact going in and the judgment coming out. There is no stage called handoff — that is the point.

Slide 3The pipeline, end to end

Here is the whole pipeline on one picture. The approved canvas on the left is yours — it carries the structure and the intent. Then the agent takes over the middle: it resolves tokens and components against your real library, generates code on a branch, and runs the executable checks and visual QA against the canvas. Then the gates hand control back to you: the PR review judges what the scans cannot see, and the merge is always a human decision. The most important line on this diagram is the dashed one. Work that fails a gate goes back to the agent and re-enters at the gate it failed. It never moves forward with a note attached.

Slide 4Gates: owner, evidence, and what each one exists to catch

Let's define a gate properly. A gate has three properties: a tool or procedure that runs it, an owner who acts on the result, and a named defect class it exists to catch. The type and token gates are agent-owned — the agent runs them and fixes what they find. The responsive gate splits the work: the agent captures screenshots at three widths, and you judge whether the task order survived. Accessibility is shared, because the automated scan only sees part of the standard. And the final review is yours. Here is the test of whether you really have gates: when one fails, does anything change? If not, you have theatre with logging.

Slide 5Token and component sync are pipeline steps, not assumptions

Here is the silent failure that wrecks most canvas-to-code attempts: assuming the canvas and the codebase already agree about tokens and components. They rarely do, and when they disagree, the agent resolves it silently — usually by hardcoding whatever the canvas shows. So make the sync an explicit step. Resolve every canvas value to a named token before generation, and flag what does not map instead of inventing a substitute. Map frames to your real component library — with Figma that means Code Connect, with connected canvases the mapping is closer to free. The evidence keeps landing in the same place: the mapping work, not the generator, decides whether you keep the code or rewrite it.

Slide 6Parity checks: comparing the build against the canvas, with evidence

Parity is not a feeling, it is evidence. The agent captures screenshots of the built page at three-sixty, seven-sixty-eight, and twelve-eighty, and pairs each one with the approved canvas frame. Then you judge — and you are judging intent, not pixels. Did the hierarchy survive? Did the task order survive on mobile? Do the states the canvas showed actually exist? Expect the documented gaps: drifted colours, placeholder data, unfinished interactions. That is what this gate is for. The agent gathers the evidence and can flag obvious differences; you decide which differences matter, and every finding gets a severity and a screenshot attached.

Slide 7Plausible but wrong: the defect classes you can name in advance

The reason gates are worth building is that the defects are predictable. Code that compiles but ignores your tokens. Markup that looks right and reads wrong. Output that passes the automated scan and still traps the keyboard — because the scanners only see roughly a third of the standard. Responsive variants that just stack things instead of preserving the task order. And product rules quietly invented from whatever the static frame happened to show. Generated UI fails accessibility by default — not from carelessness, but because nothing in 'make it look like this frame' asks for semantics. Each of these classes maps to a gate. That is not a coincidence; it is the design of the pipeline.

Slide 8Where defects enter: a real defect log, by stage

Let's look at a real defect log. This is from an executed run in this school's own repository: one section, three iterations. The type check caught a real runtime bug. The token audit caught eleven hardcoded colours. Then the human checks took over: a fixed-width wrapper that would break on phones, a link with no accessible name, a heading-level problem — and the most interesting one, card titles that were not headings at all, which no automated check flagged. Automated gates caught two of seven defect classes. The point of logging by stage is what you do next: recurring colour violations mean tighten the harness, not review harder. Measurement tells you which stage owns the fix.

Slide 9Partial automation: which stages run unattended

So how much of this can run without you? Use one dividing line: automate the stages whose failures are cheap and detectable, keep humans on the stages whose failures are expensive and silent. Resolution, generation on a branch, the executable checks, and screenshot capture can all run unattended — ideally in CI, as assertions that fail the build. The artifact going in, the severity calls, the accepted findings, and the merge stay human. And every unattended stage has to leave evidence behind: outputs, screenshots, and a report of what it flagged. One more filter: simple surfaces are good candidates for the fast path. Checkout flows and permission screens are not.

Slide 10Worked example: one section through the full pipeline

Here is the same run traced as a pipeline with timings. The brief took about ten minutes. The first generation took minutes — and failed four gates: a type error, eleven token violations, a fixed-width layout, and accessibility problems. The rebuild on real primitives took about twenty minutes and passed both automated gates. Then human review caught the one that matters: card titles that were no longer headings at all. Every command had exited zero. The fix took minutes, the gates re-ran, and one finding was accepted and written down. Three iterations, one working session — and most of it was gates and review, not generation. Report your time that way, or you are booking the review hours as free.

Slide 11The pipeline as a checklist the agent can run

To make this a team capability rather than a personal habit, the gate sequence has to live in the repository. Here is the shape: entry conditions, generation outside production paths, the type and design-system gate the agent runs and fixes, the responsive gate the agent captures and you judge, the shared accessibility gate, the performance budget, the human review, and the after-merge notes. Two rules make it work. The checklist goes into the brief, so the agent runs every gate it can run itself. And the agent stops and reports at the decision points — it never fixes and merges in one motion. Adapt the commands to your stack; keep the order.

Slide 12Exercise: draw your current path from canvas to production

Time to map your own pipeline. Pick one design that recently shipped and trace the path it actually took — every step, and who did each one. Mark the gates that existed, and be honest about what each one required and what happened when it failed. Then mark the gaps: transitions with no check, checks with no owner, findings that were waved through. Now the key question: where were the defects in that work actually found? Anything found at the end probably belonged to an earlier gate. Finish by choosing the one stage you would tighten first, and write down what evidence its gate would require. One stage. Then run the next piece of work through it.

Slide 13Summary, and what comes next

Let's close the module. The pipeline replaces the handoff: the agent owns the mechanical middle, you own the artifact going in and the judgment coming out. A gate is a tool, an owner, and a named defect class — and when it fails, the work goes back, not forward. Token sync and parity checks are explicit steps with evidence, never assumptions. The automated checks catch the cheap defects; the expensive ones need a checklist and a human with the intent in mind. And measure defects by stage, so you tighten the stage responsible instead of piling review onto the end. Module 6 zooms out to the whole operation — cost, permissions, audit trails, and review capacity. See you there.