AAgentic Design School
Module 4 of 6
40–50 minutes

Agentic Prototyping

Screenshot to Implementation Parity

When the design exists and the implementation must match it: parity checks with screenshot evidence, the spacing and state details agents reliably miss, and an iteration loop that converges instead of oscillating.

Duration40–50 minutes

Slides13 slides with notes and narration

Learning objectives

  • Set up a parity workflow: reference, implementation, and diff evidence side by side.
  • Identify the failure classes agents repeat: spacing drift, font fallbacks, missing states.
  • Run a convergent iteration loop with one named gap per round.
Slide deck

Work through the module

Each slide is shown in its 16:9 frame, exactly as it appears in the video version. Open the notes under any slide for the longer explanation, and the narration if you prefer to read along.

Slide 1 of 1316:9

Screenshot to Implementation Parity

Agentic Prototyping · Module 4 of 6

  • Parity is measured against evidence, not asserted in a chat reply
  • The setup: reference, spec, implementation capture, and the diff between them
  • The drift agents repeat: spacing, type details, missing states
  • A loop that converges — one named gap per round, exit only through the gate

The previous modules explored directions. This one is about discipline: the design exists, and the build must match it in ways you can show.

Slide notes

Position this module against the two before it. Modules 2 and 3 were about generating from references and exploring directions, where the reference was an influence to borrow from. This module is the opposite situation: the design already exists — an approved Figma frame, a signed-off mockup, a page being rebuilt during a migration — and the implementation has to match it. The skill is no longer interpretation; it is verification.

Name the core failure up front: agents are good at producing something that looks roughly right and confidently calling it done. The phrase the build closely matches the design is not evidence. Parity in this module means a comparison a third person could check: the reference screenshot, the implementation screenshot, and a list of mismatches with severities and fixes.

Set expectations on tooling. The worked examples assume a coding agent that can run a browser capture script — Playwright in our examples — but the loop itself is tool-agnostic. If the audience cannot script captures yet, the manual version of the same loop still works: capture by hand, compare side by side, log mismatches in a file. The discipline is the content; the automation is the convenience.

Narration for this slide

Welcome to Module 4. The last two modules were about exploring — turning references into briefs, running several directions at once. This module is about the opposite moment: the design already exists, it has been approved, and the implementation has to match it. The trap here is that agents are very good at producing something that looks roughly right and then declaring success. So the theme of this module is simple: parity is measured, not asserted. We will set up a workflow where the reference, the implementation, and the differences between them sit side by side as evidence, and the loop only ends when a gate passes. Let's start with the setup.

Slide 2 of 1316:9

The parity setup: reference, spec, capture, diff

Drift happens because the agent builds from an impression of the image. The setup forces it to build from a specification and be measured against the reference.

  • Reference: the image at full resolution, plus the viewport it was designed at
  • Spec: regions, grid, type scale, spacing rhythm, colour roles — extracted before any code
  • Capture: scripted screenshots of the build at the same widths, every round
  • Diff: a parity report naming observable mismatches, not an overall impression

Two structural fixes carry the workflow: a spec extracted before the build, and a parity gate after it.

Slide notes

Walk the four artifacts and why each one exists. The reference alone is not enough, because an image cannot be argued with precisely — people end up debating whether the spacing feels right. The spec turns the image into something checkable: layout regions, the grid, the type scale, the spacing rhythm, colour roles, and the components the agent can identify. Force measurements where they can be measured and ranges where they cannot; a spec that says generous padding will drift, a spec that says 24px vertical rhythm with 32px section gaps will not.

The spec is reviewed by a human before the build. That review is cheap — ten minutes on a text file — and it is where misreadings get caught. In one of the worked runs later in this module, the spec review caught that a plan card's emphasis was a border plus a background shift, not just a badge. Catching that in the spec cost one comment; catching it after the build would have cost a round.

The capture must be scripted so the comparison is like-for-like on every iteration: same route, same viewport sizes, same wait conditions. Hand-taken screenshots at slightly different widths produce noise that the parity check then reports as mismatches. And the diff is a report, not a feeling — the next slides define exactly what goes in it.

Narration for this slide

Here is the setup that stops drift. Four artifacts. The reference image, at full resolution, with the viewport it was designed at written down. A spec the agent extracts from that image before writing any code: regions, grid, type scale, spacing rhythm, colour roles. You review the spec — that is a ten-minute read of a text file, and it is where misinterpretations get caught cheaply. Then a scripted capture of the build, taken the same way every round so comparisons are like for like. And finally the diff: a parity report that names observable mismatches, instead of an overall impression. Spec before the build, gate after it. Those two fixes carry the whole workflow.

Slide 3 of 1316:9

The parity check loop

One agent, one session, one gate. The loop only exits through a passing verdict or an explicit human decision.

Loop diagram of the parity check workflow. A reference screenshot with its extracted spec and a scripted capture of the built implementation both feed an agent-run parity check, which compares them side by side and logs observable mismatches with severities from P0 to P3 and the smallest patch for each. A verdict gate decides the outcome: a pass or pass with accepted differences goes to a human sign-off where the parity report is kept with the build, while a fail sends the work to an agent patch pass that fixes one layer per round — structure, then spacing and type, then colour and states — recaptures at the same widths, and re-runs the check.
The reference and the build capture feed the parity check; mismatches are logged with severity and the smallest patch. The gate is the only exit: pass goes to human sign-off, fail goes to a patch pass and a recapture.

The agent runs the comparison and the patches. The human owns the spec, the accepted-differences list, and the sign-off.

Slide notes

Walk the diagram from the left. The reference and spec are the human-owned contract. The build capture is agent-run and scripted, taken at the reference viewport and the other standard widths. Both feed the parity check, which is also agent-run: it compares the captures against the reference and the spec and reports observable mismatches only — a region in the wrong place, a type size off the scale, a gap that breaks the rhythm, a colour that left its role. Each mismatch gets a severity and a proposed smallest patch.

The verdict gate is the structural point of the diagram. The gate passes when no P0 or P1 mismatches remain and every P2 has either been fixed or accepted by a human. Differences that come from deliberate token mapping are listed as accepted, not hidden. Crucially, the agent never declares parity itself; it produces the report and the verdict recommendation, and a human signs.

The fail path goes to a patch pass — one layer per round, which the convergence slide covers — then a recapture with the same script and a re-run of the check. This is a single-agent loop because the steps are sequential: the spec informs the build, the build produces the captures, the captures feed the check, the check decides whether to loop. There is nothing to parallelise for one screen; if you later need this across many screens, that becomes an orchestration problem outside this module's scope.

Narration for this slide

This is the loop the rest of the module fills in. On the left, the two inputs: the reference with its spec — that is your contract — and a scripted capture of the build, taken the same way every round. Both feed the parity check, where the agent compares them side by side and logs observable mismatches: region, value, severity, and the smallest patch that would fix it. Then the verdict gate. Pass, or pass with accepted differences, goes to your sign-off, and the report stays with the build. Fail goes to a patch pass — one layer per round — a recapture, and another check. The agent runs the comparison and the fixes. You own the spec, the accepted differences, and the signature.

Slide 4 of 1316:9

What agents miss: the predictable failure classes

Screenshot builds fail in repeatable ways. Knowing the classes makes the parity check faster to read and the spec easier to write.

  • Spacing drift — padding and gaps relax to something more comfortable than the reference
  • Type drift — sizes off the scale, wrong weights, fallback fonts that flatten the hierarchy
  • State gaps — hover, focus, loading, empty, error, and disabled states the image never showed
  • Hierarchy drift — secondary actions get louder, primary content slips below supporting content
  • System drift — local one-off components and invented values instead of the team's tokens
  • Responsive drift — mobile stacks in DOM order instead of task order

These are not random mistakes. They are the centre of the training distribution pulling the build towards generic — which is exactly what the spec and the gate exist to resist.

Slide notes

Each of these classes shows up consistently in traced runs, which is good news: predictable failures are checkable failures. Spacing drift is the most common — row heights, section gaps, and card padding become more comfortable than the reference, and the density that made the design work disappears. In one of the worked runs, card padding came back at 24px against a 32px spec, which compressed the price block and weakened the emphasised plan. Type drift is similar: an 18px body where the spec maps body to the 16px token quietly breaks vertical rhythm everywhere.

State gaps deserve their own emphasis because the reference physically cannot show them. A screenshot is one moment of one state. Hover, focus, loading, empty, error, disabled, and long-content behaviour all have to come from the spec's not visible in the reference section, or the agent will either omit them or invent them. Either way, that is a decision a human should have made.

Hierarchy and system drift are the subtler ones. The build can match colours while changing what reads first; it can look right while being made of local components instead of the design system's, which costs you later when the prototype graduates. Responsive drift — mobile preserving visual stacking but losing task priority — is covered properly on the responsive slide, but it belongs in this list because it is the one that most often survives review on a desktop monitor.

Narration for this slide

Agents fail at screenshot builds in predictable ways, and predictable is good — it means you can check for them. Spacing drift: padding relaxes, gaps loosen, and the density that made the design work evaporates. Type drift: sizes that are off the scale, fallback fonts, weights that flatten the hierarchy. State gaps: the screenshot shows one moment of one state, so hover, focus, loading, empty, and error either go missing or get invented. Hierarchy drift: secondary things get louder. System drift: one-off components and made-up values instead of your tokens. And responsive drift: mobile stacks in DOM order instead of task order. None of this is bad luck. It is the pull towards generic, and the spec and the gate are how you resist it.

Slide 5 of 1316:9

Tokens and components are the parity backbone

Parity of intent, not a pixel clone. Every value in the reference either maps to the existing system or becomes a recorded human decision.

Reference valueWhat happens to it
Body text colour #1A1A2EMaps cleanly to text.primary
17px body copySnaps to the 16px body token
12-column grid at 1280pxNeeds a decision — our system grid is 1200px
Card shadow with no token equivalentNeeds a decision — nearest elevation token or accept the gap
Reference brand colours and photographyOut of scope — brand questions, not build questions

The mapping step is what makes the workflow safe on competitor-inspired references: structure and density carry over, the brand does not.

Slide notes

This is the slide that separates parity from cloning. The goal is parity of intent: the structure, density, hierarchy, and rhythm of the reference, expressed in the team's own materials. Colours map to semantic tokens, type sizes snap to the nearest step in the existing scale, spacing snaps to the existing rhythm. Where the reference cannot be expressed in the current system, the spec records the gap as a decision for a human rather than letting the agent invent a one-off value.

Walk the table rows as the three possible outcomes. Maps cleanly is the easy case and should be most of the rows. Snaps to scale involves a small, deliberate difference from the reference — and that difference will later appear in the parity report as an accepted difference, which is exactly right; it is visible and signed for, not hidden. Needs a decision is the valuable category: a grid width mismatch or a missing elevation token is a real system question that the screenshot has surfaced, and someone with authority over the system should answer it, not the agent mid-build.

The last row matters for module 2's clone-trap discussion: when the reference is competitor-inspired, the mapping step is the legal and taste firewall. The spec deliberately records only structure and density, and every colour and type value is mapped to the team's system before the build starts. Parity scoped that way is honest about what is being borrowed.

Narration for this slide

Here is what keeps parity from turning into pixel worship. Every value in the reference goes through a mapping step, and there are only three outcomes. It maps cleanly to an existing token — that is most rows. It snaps to the nearest step in your scale — a 17-pixel body becomes your 16-pixel token, and that difference shows up later as an accepted difference in the report, visible and signed for. Or it needs a decision — the reference grid is 1280, yours is 1200; there is no matching elevation token. Those go to a human, because they are system questions, not build questions. And this mapping is exactly what makes the workflow safe on competitor-inspired references: the structure and density carry over, the brand does not.

Slide 6 of 1316:9

One gap per round: keeping the loop convergent

Broad fix everything feedback makes the build oscillate. Naming one layer per round makes it converge.

  • Pass 1 — layout regions, ordering, and responsive structure
  • Pass 2 — spacing rhythm and the type scale
  • Pass 3 — colour roles, borders, shadows, and interaction states
  • After every pass: recapture with the same script, re-run the parity check
  • A screen that does not converge in two or three passes usually has a spec problem, not a build problem

Fixing colour before structure is wasted work — the structure fix will move everything anyway. Order the passes, and hold each round to its layer.

Slide notes

The failure mode this slide prevents is oscillation: the designer reports six problems at once, the agent fixes four, breaks two other things in the process, and the next report has six different problems. Each round looks busy and the build never settles. The fix is to constrain each round to one layer and hold the line on it: structure first, because every later measurement depends on regions being in the right place; then spacing rhythm and type, because they define the texture the colour work sits on; then colour, borders, shadows, and states last, because they are the cheapest to change and the most likely to be invalidated by earlier fixes.

The parity report supports this by carrying severities. P0 task-breaking and P1 hierarchy-or-rhythm items get fixed in their layer's pass; P2 polish waits its turn or gets accepted; P3 judgment calls go to a human rather than into the loop at all. Resist the temptation to let the agent fix a quick colour issue while it is doing the structure pass — that is how like-for-like comparison between rounds gets lost.

The last bullet is the diagnostic: most screens converge in two or three passes. A screen still failing after that usually has a spec problem — an ambiguous measurement, a region the spec never named, a token gap nobody decided — and the right move is to stop patching and fix the spec. That is also the signal that recurring fixes should be encoded upstream, in the spec template or the project's instruction file, so the next screen does not repeat them.

Narration for this slide

Now the discipline that makes the loop converge instead of oscillate. The temptation is to report everything at once — spacing, colours, a missing state, a font issue — and let the agent fix it all. What you get is a build that wobbles: four things fixed, two new things broken, and a different list next round. So: one layer per round. Structure and ordering first, because everything else depends on it. Then spacing rhythm and type. Then colour, borders, and states last. Recapture with the same script after every pass and re-run the check. Most screens converge in two or three passes. If yours does not, stop patching — the problem is almost always in the spec, not the build.

Slide 7 of 1316:9

The parity check prompt

The check is itself a brief: compare these two captures against the spec, report observable mismatches, and do not patch anything yet.

Parity check prompt
Compare parity/reference/desktop.png with
parity/build/reference-1440.png against parity-spec.md.

Report only observable mismatches. For each:
- region
- what differs, with measured or estimated values
- severity: P0 task-breaking, P1 hierarchy or rhythm,
  P2 polish, P3 judgment call
- the smallest patch that would resolve it

List separately any differences that are expected results
of token mapping and should be accepted.
Do not patch anything yet.
End with a verdict: PASS, PASS WITH ACCEPTED DIFFERENCES,
or FAIL with the blocking items.

Do not patch anything yet is the line that keeps the report honest — comparison and correction are separate steps with a human between them.

Slide notes

Read the prompt as a structure rather than as magic words. Observable mismatches only is what keeps the report out of adjective territory: the agent must point at a region and a value, not say the spacing feels tight. Measured or estimated values make the report checkable — a reviewer can open the two images and verify the claim. The severity scale is the triage that the convergence loop depends on: P0 means the task breaks, P1 means hierarchy or rhythm is wrong, P2 is polish, P3 is a judgment call that belongs with a human rather than in the loop.

The smallest patch per mismatch matters because it keeps fixes proportional. Without it, agents tend to propose rebuilding a section to fix a padding value. The accepted-differences list is the partner of the token-mapping slide: differences that exist because a value was deliberately snapped to the system are recorded, not hidden, and a human signs them.

The two closing instructions carry the governance. Do not patch anything yet separates comparison from correction, so the human sees the full picture before anything moves. The three-valued verdict — pass, pass with accepted differences, fail with blocking items — gives the gate a vocabulary. Without it, the agent's natural ending is an upbeat paragraph saying the build closely matches the design, which is exactly the assertion this module exists to replace.

Narration for this slide

Here is the prompt that runs the gate. A few lines do all the work. Observable mismatches only — the agent has to point at a region and a value, not offer an impression. Each mismatch gets a severity, from P0 task-breaking down to P3 judgment call, and the smallest patch that would fix it — smallest, so a padding issue does not turn into a rebuilt section. Differences that exist because you deliberately mapped to your own tokens get listed as accepted, not buried. And then the two lines that keep it honest: do not patch anything yet, and end with a verdict — pass, pass with accepted differences, or fail with the blocking items. Comparison first, correction after, with you in between.

Slide 8 of 1316:9

Good vs bad parity output

A weak report says the build is close. A strong one names regions, values, severities, and the smallest patch.

Weak reportStrong report
"The build closely matches the reference"PASS WITH ACCEPTED DIFFERENCES: 0 P0, 0 P1, 2 accepted token-mapping differences listed
"Spacing is a bit off in the cards"P1: card padding is 24px; spec requires 32px — compresses the price block, weakens the emphasised plan
"Fonts look different"P2: body renders at 18/28; spec maps body to the 16/24 token, breaking rhythm in the FAQ section
"Mobile looks fine"P1: at 390px the summary panel drops below the fold; spec orders it second

The gate only works if its output is specific enough to act on — and specific enough to disagree with.

Slide notes

The left column is what agents produce by default, and it reads like progress: positive, fluent, and unfalsifiable. None of those statements can be checked, none can be assigned to a patch pass, and none can be disagreed with on evidence. The right column is the same information forced through the prompt from the previous slide: a verdict with counts, then mismatches with a region, a measured value, the spec value, the consequence, and a severity.

Point out what specificity buys beyond accuracy. It makes the review conversation short — the team discusses three named items instead of trading screenshots and adjectives in a thread. It makes the fix assignable — a 24px-versus-32px padding mismatch goes straight into the spacing pass. And it makes disagreement productive: a designer can look at the P2 about body size and decide the 18px rendering is actually fine for this page, accept it, and record that acceptance, which is a design decision rather than an oversight.

If you want a single habit from this slide, it is to reject the weak form when it appears. The first time an agent answers a parity check with the build closely matches the design, send it back with the report structure. Agents follow the format they are held to; the quality of the gate is set by what you accept through it.

Narration for this slide

Here is the difference between a report you can act on and one you cannot. The build closely matches the reference — that sentence sounds like progress and tells you nothing you can check, fix, or disagree with. Compare it with: P1, card padding is 24 pixels, the spec says 32, and the consequence is that the price block compresses and the emphasised plan loses its weight. That is checkable, it is assignable to the spacing pass, and you can even disagree with it and accept the difference deliberately — which is a design decision, not an oversight. The gate is only as good as the output you accept through it. The first time you get the weak version, send it back.

Slide 9 of 1316:9

When parity is the wrong goal

Sometimes the right move is to depart from the reference. The discipline is doing it deliberately and recording it, not drifting into it.

  • The reference has a known flaw — contrast, density, or a state it never considered
  • The reference predates the current design system and the system has moved on
  • The reference is competitor-inspired — parity is scoped to structure and density only
  • Improvements are recorded in the spec as decisions, so the gate measures the right target
  • Whether to follow or improve is a human call — the gate cannot make it

Departing from the reference is fine. Departing from it accidentally, and discovering that in stakeholder review, is not.

Slide notes

A module about matching a reference needs this counterweight, otherwise it teaches pixel worship. There are legitimate reasons to depart from the reference: it has a contrast failure the brand colours cannot fix, it was designed before the current token set existed, it never considered an empty state that real data makes common, or it is a competitor-inspired concept where matching the brand would be both a legal and a taste problem. The parity gate cannot make that call — it can only measure against whatever target it is given. Choosing the target is design judgment and stays human.

The mechanism for departing safely is the spec. An improvement is written into the spec as a decision — we are using our 4.5:1 contrast pair here even though the reference does not — and the gate then measures against the amended spec. The difference shows up in the accepted-differences list with a reason attached. That keeps the door open for an honest conversation with whoever approved the original design, because the departures are enumerable rather than discovered.

Contrast that with the failure mode: the build drifts away from the reference through a series of small unexamined choices, nobody records them, and the gap surfaces in a stakeholder review where the designer cannot say which differences were intentional. The cost is credibility, not just rework. Deliberate departure, recorded; accidental drift, prevented. That is the whole position.

Narration for this slide

One caution before the worked example: parity is not always the right goal. Sometimes the reference has a known flaw — a contrast failure, a missing empty state. Sometimes it predates your design system. Sometimes it is competitor-inspired and parity should only ever cover structure and density, never the brand. Departing from the reference in those cases is good design. The discipline is to do it deliberately: write the improvement into the spec as a decision, let the gate measure against the amended spec, and let the difference appear in the accepted list with a reason. What you must not do is drift away from the reference by accident and find out in a stakeholder review which differences nobody can explain. The gate measures the target. Choosing the target is your job.

Slide 10 of 1316:9

Responsive parity: the breakpoints the reference never showed

Most references are one image at one width. Everything below that width is a decision, and undocumented decisions get made by the agent.

  • Record the reference viewport in the spec — parity claims only hold at captured widths
  • Write down the responsive expectations the image cannot show: what stacks, what collapses, what stays visible first
  • Mobile must preserve task order, not just stack blocks in DOM order
  • Capture and check the standard widths every round — desktop, tablet, mobile — not just the reference width
  • Gaps the spec did not cover come back as questions, not inventions

The gate cannot judge widths that were never captured. If 390px is not in the capture script, 390px is not checked.

Slide notes

This is the failure class most likely to survive review, because review usually happens on a desktop monitor at roughly the reference width. The reference is typically a single 1440px frame; everything below it — what stacks, what collapses behind a disclosure, what remains visible without scrolling, which panel comes first — is a decision the image cannot express. If the spec does not make those decisions, the agent will, and its default is to stack content in DOM order, which preserves the visual blocks while quietly losing the task priority.

Two concrete cases from the traced runs are worth telling. A dashboard build passed the check at the reference width and dropped a summary panel below the fold at 1280px; the gate caught it only because 1280px was in the capture script, and the patch reflowed the grid. And a conference-demo build had a headline that wrapped to three lines at the projector's 1280px width — caught the night before because the capture included that width, not because anyone thought to look.

The operational rule is mechanical: the capture script defines what parity means. Desktop at the reference width, a tablet width, and a mobile width on every round, with the spec carrying a short responsive section — even three lines of what stacks and in what order is enough to turn agent inventions into checkable claims. And keep the honest limit visible: a passing gate says nothing about widths and states that were never captured.

Narration for this slide

Here is the gap that most often survives review: responsiveness. Your reference is one image at one width — usually 1440. Everything below that width is a decision the image cannot show: what stacks, what collapses, what the user must still see first. If the spec does not make those decisions, the agent will, and its default is to stack things in DOM order — the blocks survive, the task priority does not. So the spec gets a short responsive section, and the capture script includes desktop, tablet, and mobile on every round. In one traced run, the gate caught a summary panel dropping below the fold at 1280 only because 1280 was in the script. The rule is blunt: if a width is not captured, it is not checked.

Slide 11 of 1316:9

Worked example: a marketing page to parity in four rounds

An approved 1440px pricing-page frame, an existing component library, and a deadline. Times are from the traced run, not a benchmark.

RoundWhat happenedOutcome
1 — SpecAgent extracted the spec; ten-minute human review caught that the middle plan's emphasis was border plus background, not just a badgeSpec corrected before any code
2 — Build + first checkBuild passed structure; gate returned FAIL with 9 mismatches: 18px body vs 16px token, 24px card padding vs 32px spec, generic accordion instead of the system onePatch list with severities
3 — Spacing and type passPadding, body size, and rhythm fixed; recapture; accordion swapped to the design-system component2 mismatches remaining
4 — Final checkGate returned PASS WITH ACCEPTED DIFFERENCES — one shadow mapped to the nearest elevation tokenHuman sign-off; report kept with the build

About two hours end to end — and the review conversation was three named decisions instead of a thread of screenshots and adjectives.

Slide notes

This run is drawn from the school's published parity workflow, and the numbers are from that traced run rather than a controlled benchmark — say so. The team had an approved 1440px pricing frame and a deadline, which is the typical context for this workflow: the design conversation was over, the implementation conversation had not started.

The round worth dwelling on is the first one. The spec extraction took the agent one pass, and the human review took about ten minutes — and that review caught the most expensive potential miss of the run: the middle plan's emphasis was achieved by a border plus a background shift, not just the badge. An agent building from an impression of the image would likely have shipped the badge alone, and the mismatch would have surfaced as a vague the middle card does not pop comment late in review. One spec comment versus a late aesthetic argument is the economics of the whole module.

The first gate run returning nine mismatches is normal, not a failure: the point of the gate is to find them while they are cheap. Note also what the failures were — body size off the token, padding compressed, a generic accordion instead of the system one — exactly the failure classes from earlier in the module. Two patch passes later the gate returned pass with accepted differences, the only accepted item being a shadow mapped to the nearest elevation token, signed by the designer. The deliverable at the end is a pair: the built screen and a parity report someone can sign. That pair is what makes the handoff in Module 6 honest.

Narration for this slide

Let's trace a real run. An approved 1440-pixel pricing page, an existing component library, a deadline. Round one: the agent extracts the spec, and a ten-minute human review catches that the middle plan's emphasis is a border plus a background shift, not just a badge — one comment that would have been an expensive argument later. Round two: the build passes structure but the gate fails it with nine mismatches — 18-pixel body against the 16-pixel token, card padding compressed, a generic accordion instead of the system one. Round three fixes spacing and type and swaps the component. Round four: pass with accepted differences — one shadow mapped to the nearest elevation token. About two hours, and the review was three named decisions, not a thread of adjectives.

Slide 12 of 1316:9

Exercise: run one parity round on an existing implementation

Take a screen you or your team has already built from a design, and run a single round of the loop on it. One round, not the whole convergence.

  • Pick a built screen that has a reference: a Figma frame, a mockup, or the old page it replaced
  • Write a minimal spec from the reference: regions, type scale, spacing rhythm, colour roles — measurements where you can
  • Capture the implementation at the reference width, and one mobile width
  • Run the parity check prompt and require severities, values, and a verdict
  • Sort the result: which mismatches are real, which are acceptable token mapping, and which the spec should have decided

Keep the report. Module 5 turns this single check into a repeatable QA sweep across routes, states, and breakpoints.

Slide notes

The exercise deliberately starts from an existing implementation rather than a fresh build, because that removes the build time and isolates the skill being practised: extracting a spec, running the check, and reading the report. Most participants are surprised by what one round surfaces on a screen the team considered finished — typically two or three rhythm or type mismatches and at least one missing state.

Steer people towards a bounded screen with a real reference: a settings page, a pricing page, a dashboard panel. The minimal spec does not need to be exhaustive — regions, type scale, spacing rhythm, and colour roles are enough for one round — but push for measurements over adjectives, because the quality of the report tracks the precision of the spec. The mobile capture is non-negotiable even in the minimal version; it is where the most instructive findings usually are.

The sorting step is the actual learning objective. A raw list of mismatches is not the output; the output is the triage — real defects to fix, acceptable differences from deliberate token mapping, and gaps that exist because the spec never made a decision. That last category is the one to discuss if running this live, because it shows participants that most parity arguments are missing constraints rather than agent mistakes. Keep the report: Module 5 builds the QA matrix on top of exactly this kind of evidence.

Narration for this slide

Your turn. Take a screen that has already been built from a design — something with a real reference, a Figma frame or the old page it replaced. Write a minimal spec from that reference: regions, type scale, spacing rhythm, colour roles, with measurements where you can get them. Capture the implementation at the reference width and at one mobile width. Run the parity check prompt, and insist on severities, values, and a verdict. Then do the part that matters: sort the findings into real mismatches, acceptable token-mapping differences, and gaps that exist because the spec never decided. One round is enough. Keep the report — Module 5 turns this into a repeatable sweep.

Slide 13 of 1316:9

Summary, and the bridge to visual QA loops

  • Parity is measured: reference, spec, scripted capture, and a diff with severities — not an asserted close enough
  • The spec before the build and the gate after it are the two structural fixes; the spec review is the cheapest correction
  • Agents drift in predictable classes — spacing, type, states, hierarchy, system, responsive — so check for exactly those
  • Map reference values to your tokens; record gaps and improvements as human decisions, not inventions
  • Converge with one layer per round, and exit only through a verdict a human signs

Module 5 widens this from one screen against one reference to a QA matrix: routes, states, and breakpoints swept with screenshot evidence on every change.

Slide notes

Recap by connecting the bullets back to the loop diagram rather than repeating them flatly. The setup — reference, spec, capture, diff — exists because drift happens when the agent builds from an impression. The failure classes are predictable, which is what makes the parity check a checklist rather than an open-ended judgment. The token mapping keeps parity honest about whose system the screen belongs to, and the one-layer-per-round discipline is what makes the loop converge in two or three passes instead of oscillating indefinitely.

Restate the limits so the module does not overclaim. The gate proves the build matches the reference in observable ways at the captured widths. It cannot prove the design works for users, cannot validate hover, motion, or loading behaviour the reference never showed, cannot prove accessibility, and cannot decide whether parity was even the right goal. Those stay with the humans who own the spec and the sign-off.

Preview Module 5 concretely: this module checked one screen against one reference at the moment of building it. Visual QA loops generalise that into a matrix — routes, states, breakpoints, and themes — captured and compared against baselines so regressions are caught on every change, not just on the day the screen was built. The parity report from this module's exercise is the seed of that baseline.

Narration for this slide

Let's close the module. Parity is measured, not asserted: the reference, a spec extracted from it, scripted captures of the build, and a diff with severities and a verdict. The spec review before the build is the cheapest correction you will make; the gate after it is the only honest exit. The drift is predictable — spacing, type, states, hierarchy, system, responsive — so the check looks for exactly that. Reference values map to your tokens, and departures are recorded decisions, not accidents. And the loop converges because each round fixes one layer. The limits are real too: the gate proves observable parity at captured widths, nothing more. In Module 5 we widen the lens — from one screen at build time to a visual QA matrix that sweeps routes, states, and breakpoints on every change. See you there.

Module transcript
Module 4, narrated slide by slide

Slide 1Screenshot to Implementation Parity

Welcome to Module 4. The last two modules were about exploring — turning references into briefs, running several directions at once. This module is about the opposite moment: the design already exists, it has been approved, and the implementation has to match it. The trap here is that agents are very good at producing something that looks roughly right and then declaring success. So the theme of this module is simple: parity is measured, not asserted. We will set up a workflow where the reference, the implementation, and the differences between them sit side by side as evidence, and the loop only ends when a gate passes. Let's start with the setup.

Slide 2The parity setup: reference, spec, capture, diff

Here is the setup that stops drift. Four artifacts. The reference image, at full resolution, with the viewport it was designed at written down. A spec the agent extracts from that image before writing any code: regions, grid, type scale, spacing rhythm, colour roles. You review the spec — that is a ten-minute read of a text file, and it is where misinterpretations get caught cheaply. Then a scripted capture of the build, taken the same way every round so comparisons are like for like. And finally the diff: a parity report that names observable mismatches, instead of an overall impression. Spec before the build, gate after it. Those two fixes carry the whole workflow.

Slide 3The parity check loop

This is the loop the rest of the module fills in. On the left, the two inputs: the reference with its spec — that is your contract — and a scripted capture of the build, taken the same way every round. Both feed the parity check, where the agent compares them side by side and logs observable mismatches: region, value, severity, and the smallest patch that would fix it. Then the verdict gate. Pass, or pass with accepted differences, goes to your sign-off, and the report stays with the build. Fail goes to a patch pass — one layer per round — a recapture, and another check. The agent runs the comparison and the fixes. You own the spec, the accepted differences, and the signature.

Slide 4What agents miss: the predictable failure classes

Agents fail at screenshot builds in predictable ways, and predictable is good — it means you can check for them. Spacing drift: padding relaxes, gaps loosen, and the density that made the design work evaporates. Type drift: sizes that are off the scale, fallback fonts, weights that flatten the hierarchy. State gaps: the screenshot shows one moment of one state, so hover, focus, loading, empty, and error either go missing or get invented. Hierarchy drift: secondary things get louder. System drift: one-off components and made-up values instead of your tokens. And responsive drift: mobile stacks in DOM order instead of task order. None of this is bad luck. It is the pull towards generic, and the spec and the gate are how you resist it.

Slide 5Tokens and components are the parity backbone

Here is what keeps parity from turning into pixel worship. Every value in the reference goes through a mapping step, and there are only three outcomes. It maps cleanly to an existing token — that is most rows. It snaps to the nearest step in your scale — a 17-pixel body becomes your 16-pixel token, and that difference shows up later as an accepted difference in the report, visible and signed for. Or it needs a decision — the reference grid is 1280, yours is 1200; there is no matching elevation token. Those go to a human, because they are system questions, not build questions. And this mapping is exactly what makes the workflow safe on competitor-inspired references: the structure and density carry over, the brand does not.

Slide 6One gap per round: keeping the loop convergent

Now the discipline that makes the loop converge instead of oscillate. The temptation is to report everything at once — spacing, colours, a missing state, a font issue — and let the agent fix it all. What you get is a build that wobbles: four things fixed, two new things broken, and a different list next round. So: one layer per round. Structure and ordering first, because everything else depends on it. Then spacing rhythm and type. Then colour, borders, and states last. Recapture with the same script after every pass and re-run the check. Most screens converge in two or three passes. If yours does not, stop patching — the problem is almost always in the spec, not the build.

Slide 7The parity check prompt

Here is the prompt that runs the gate. A few lines do all the work. Observable mismatches only — the agent has to point at a region and a value, not offer an impression. Each mismatch gets a severity, from P0 task-breaking down to P3 judgment call, and the smallest patch that would fix it — smallest, so a padding issue does not turn into a rebuilt section. Differences that exist because you deliberately mapped to your own tokens get listed as accepted, not buried. And then the two lines that keep it honest: do not patch anything yet, and end with a verdict — pass, pass with accepted differences, or fail with the blocking items. Comparison first, correction after, with you in between.

Slide 8Good vs bad parity output

Here is the difference between a report you can act on and one you cannot. The build closely matches the reference — that sentence sounds like progress and tells you nothing you can check, fix, or disagree with. Compare it with: P1, card padding is 24 pixels, the spec says 32, and the consequence is that the price block compresses and the emphasised plan loses its weight. That is checkable, it is assignable to the spacing pass, and you can even disagree with it and accept the difference deliberately — which is a design decision, not an oversight. The gate is only as good as the output you accept through it. The first time you get the weak version, send it back.

Slide 9When parity is the wrong goal

One caution before the worked example: parity is not always the right goal. Sometimes the reference has a known flaw — a contrast failure, a missing empty state. Sometimes it predates your design system. Sometimes it is competitor-inspired and parity should only ever cover structure and density, never the brand. Departing from the reference in those cases is good design. The discipline is to do it deliberately: write the improvement into the spec as a decision, let the gate measure against the amended spec, and let the difference appear in the accepted list with a reason. What you must not do is drift away from the reference by accident and find out in a stakeholder review which differences nobody can explain. The gate measures the target. Choosing the target is your job.

Slide 10Responsive parity: the breakpoints the reference never showed

Here is the gap that most often survives review: responsiveness. Your reference is one image at one width — usually 1440. Everything below that width is a decision the image cannot show: what stacks, what collapses, what the user must still see first. If the spec does not make those decisions, the agent will, and its default is to stack things in DOM order — the blocks survive, the task priority does not. So the spec gets a short responsive section, and the capture script includes desktop, tablet, and mobile on every round. In one traced run, the gate caught a summary panel dropping below the fold at 1280 only because 1280 was in the script. The rule is blunt: if a width is not captured, it is not checked.

Slide 11Worked example: a marketing page to parity in four rounds

Let's trace a real run. An approved 1440-pixel pricing page, an existing component library, a deadline. Round one: the agent extracts the spec, and a ten-minute human review catches that the middle plan's emphasis is a border plus a background shift, not just a badge — one comment that would have been an expensive argument later. Round two: the build passes structure but the gate fails it with nine mismatches — 18-pixel body against the 16-pixel token, card padding compressed, a generic accordion instead of the system one. Round three fixes spacing and type and swaps the component. Round four: pass with accepted differences — one shadow mapped to the nearest elevation token. About two hours, and the review was three named decisions, not a thread of adjectives.

Slide 12Exercise: run one parity round on an existing implementation

Your turn. Take a screen that has already been built from a design — something with a real reference, a Figma frame or the old page it replaced. Write a minimal spec from that reference: regions, type scale, spacing rhythm, colour roles, with measurements where you can get them. Capture the implementation at the reference width and at one mobile width. Run the parity check prompt, and insist on severities, values, and a verdict. Then do the part that matters: sort the findings into real mismatches, acceptable token-mapping differences, and gaps that exist because the spec never decided. One round is enough. Keep the report — Module 5 turns this into a repeatable sweep.

Slide 13Summary, and the bridge to visual QA loops

Let's close the module. Parity is measured, not asserted: the reference, a spec extracted from it, scripted captures of the build, and a diff with severities and a verdict. The spec review before the build is the cheapest correction you will make; the gate after it is the only honest exit. The drift is predictable — spacing, type, states, hierarchy, system, responsive — so the check looks for exactly that. Reference values map to your tokens, and departures are recorded decisions, not accidents. And the loop converges because each round fixes one layer. The limits are real too: the gate proves observable parity at captured widths, nothing more. In Module 5 we widen the lens — from one screen at build time to a visual QA matrix that sweeps routes, states, and breakpoints on every change. See you there.