AAgentic Design School
Module 5 of 6
40–50 minutes

Agentic Prototyping

Visual QA Loops

Making visual quality checkable: screenshot sweeps across states and breakpoints, regression comparisons against a baseline, and findings that arrive with evidence a designer can act on in minutes.

Duration40–50 minutes

Slides14 slides with notes and narration

Learning objectives

  • Define a screenshot matrix of routes, states, and breakpoints worth checking.
  • Run agent-driven visual QA sweeps and read the results efficiently.
  • Establish baselines so regressions are caught against something real.
Slide deck

Work through the module

Each slide is shown in its 16:9 frame, exactly as it appears in the video version. Open the notes under any slide for the longer explanation, and the narration if you prefer to read along.

Slide 1 of 1416:9

Visual QA Loops

Agentic Prototyping · Module 5 of 6

  • If it is not screenshotted, it is not checked
  • The QA matrix: routes, states, breakpoints, themes
  • Baselines, sweeps, and findings that name their evidence
  • The human gate on what counts as done

Module 4 measured one page against one reference. This module makes visual quality checkable across the whole prototype, repeatedly.

Slide notes

Position this module against the previous one. Module 4 was about parity: one page, one reference, one convergent loop. Visual QA is the wider net — every route the prototype has, every state that matters, every breakpoint, checked the same way every time. The skills overlap heavily; what changes is scale and repeatability.

The core claim of the module is in the first bullet: a generated interface can compile, load, and pass every functional test while still failing the design. Spacing rhythm, density, contrast, the empty state nobody designed, the layout that holds at 1440 pixels and collapses at 390 — none of that shows up in a type check, and most of it slides past code review because the diff looks reasonable. The only honest check is looking at what a user would actually see, and the only scalable way to look is screenshots an agent captures and compares on your behalf.

Flag the boundary early because it shapes everything else: visual QA with an agent is a review activity, not a license to change the UI. The agent captures, compares, and reports. A human decides what counts as a defect, what counts as taste, and what counts as done. That gate is not overhead; it is the difference between a QA loop and an agent quietly redesigning the prototype one confident finding at a time.

Narration for this slide

Welcome to module five. By now you can get a prototype built and hold one page to parity with its reference. This module is about everything else — the routes you did not stare at, the states you forgot existed, the phone width you never opened. The principle is blunt: if it is not screenshotted, it is not checked. We will define a QA matrix of routes, states, and breakpoints, set up baselines so regressions are caught against something real, run agent-driven sweeps that return findings with evidence, and keep one rule fixed throughout: the agent reports, and you decide what counts as done.

Slide 2 of 1416:9

Why visual QA is its own loop

A prototype can compile, load, and pass functional tests while still failing the design.

  • Visual failures hide from code review: the diff looks reasonable, the page does not
  • Most bugs appear between states and widths, not on the desktop default view
  • Evidence beats opinion: screenshots, accessibility snapshots, and check output the agent can read
  • Repetition is the agent's advantage — the fortieth capture is identical to the first

Visual QA gives the agent evidence to compare against intent, instead of asking it to judge its own work from code alone.

Slide notes

The failure modes worth naming concretely: a page that uses the right component library but misses the intended hierarchy; copy that matches the brief while the rhythm is broken; a layout that holds at 1440 pixels and falls apart at 390; an empty state that was never designed so the agent invented one. None of these surface in a type check or a unit test, and they rarely surface in code review either, because the code is plausible. They surface in front of stakeholders, which is the most expensive place to find them.

The second point is where most teams under-invest. One desktop screenshot of the happy path is the default evidence, and it is nowhere near enough. The bugs live in the combinations — the error state at phone width, the dense list at tablet width, the dark theme nobody re-checked after the token change.

The last bullet explains why this pairs so well with agents specifically. Capturing the same routes at the same three widths, re-running the same accessibility scan, and re-reading the same checklist after every fix pass is exactly the kind of repetitive, evidence-heavy work humans do badly on the fourth iteration and agents do identically on the fortieth. The human contribution is not the repetition; it is deciding what the evidence means.

Narration for this slide

Here is why visual QA needs its own loop. A prototype can compile, load, and pass every functional test and still fail the design. The hierarchy is wrong, the rhythm is broken, the empty state was invented, the phone layout collapses — and none of that shows up in a type check or a code review, because the diff looks fine. Visual bugs live in the combinations you did not look at: states crossed with widths crossed with themes. So we give the agent evidence — screenshots, accessibility snapshots, check output — and a procedure. The agent is good at the boring part: capturing the same things the same way, every single run. You stay good at the part that matters: judging what the evidence means.

Slide 3 of 1416:9

The QA matrix: routes × states × breakpoints

The sweep is only as good as its list. Write the matrix down and keep it in the repo, reviewed like code.

  • Routes: every page the prototype claims to have, plus the one you keep forgetting
  • States: empty, loading, error, selected, expanded, dense — the ones nobody designs
  • Breakpoints: fix a small set and never vary them — 390, 768, 1440 covers most products
  • Themes and density modes if the prototype has them
  • One intent line per route: what the page is for, so differences can be judged
sweep-manifest.json (excerpt)
{
  "viewports": [
    { "name": "mobile", "width": 390 },
    { "name": "tablet", "width": 768 },
    { "name": "desktop", "width": 1440 }
  ],
  "pages": [
    { "id": "dashboard", "route": "/dashboard",
      "intent": "Dense triage view; queue stays above summaries on mobile.",
      "states": ["default", "empty", "error"] },
    { "id": "billing", "route": "/settings/billing",
      "intent": "Plan comparison; pricing figures must stay aligned.",
      "states": ["default", "past-due"] }
  ]
}

The intent line is the cheapest item in the matrix and the one most teams forget. Without it, every difference looks equally important.

Slide notes

The matrix is a design artifact, not test infrastructure. Writing it forces the questions a prototype review needs answered anyway: which routes actually matter, which states exist, and what each page is for. Keep it in the repository as a manifest file so it is reviewed like code and grows with the prototype — a page that is not in the manifest is a page that will never be checked.

On breakpoints: the specific numbers matter less than fixing them and never varying them. This school's own captures use 390, 768, and 1440 pixels; add 1024 if tablets matter to your audience. If every run uses slightly different widths, findings stop being comparable across runs and the whole baseline idea collapses.

States are where prototypes are weakest, because prototypes are usually built against the happy path with friendly demo data. Empty, loading, error, and dense are the four to insist on; selected, expanded, and disabled follow for interactive surfaces. The intent line deserves its own emphasis: one sentence per route describing the user job. It is what lets a compare agent — or a human — distinguish a difference that breaks the page's purpose from a difference that is merely a difference.

Narration for this slide

The sweep is only as good as its list, so the first artifact is a matrix: routes, crossed with states, crossed with breakpoints. Routes — every page the prototype claims to have, including the one you keep forgetting. States — empty, loading, error, dense; the ones nobody designs and agents invent. Breakpoints — pick a small set, like 390, 768, and 1440, and never vary them, because comparability is the whole point. And for every route, write one intent line: what this page is for. That sentence is the cheapest thing in the matrix, and it is what turns a pile of differences into findings someone can judge. Keep the manifest in the repo and review it like code.

Slide 4 of 1416:9

Baselines: what the comparison is against

A regression is only meaningful relative to something. Decide what that something is before the first sweep.

  • Reference baseline: design exports or the parity reference from Module 4
  • Approved baseline: captures from the last build a human signed off
  • Intent baseline: DESIGN.md rules, tokens, and the route's intent line
  • Keep baselines on disk with a manifest — not in anyone's memory
  • Promote new captures to baseline only at the human gate, never automatically

Without a baseline, every sweep is a one-time subjective review nobody can reproduce three weeks later.

Slide notes

Three kinds of baseline cover most situations, and they answer different questions. A reference baseline — design exports or the screenshots you held parity against in Module 4 — answers does the build still match the design. An approved baseline — the captures from the last build a human signed off — answers did anything change that nobody intended; this is the regression question, and it is the one that matters most once the prototype is moving fast. The intent baseline — the design-system rules, the tokens, and the per-route intent line — answers does the page still do its job, and it catches the failures the other two cannot, because it is not pinned to any particular set of pixels.

The mechanics matter as much as the concept. Baselines live on disk, next to a manifest recording route, viewport, and capture date, so the evidence survives the session and the next review compares against the same thing rather than a memory of it. Saving captures to disk also avoids the token cost of pasting full-page screenshots into a conversation, which practitioner reports have measured in the six figures for a single image.

The last bullet is the discipline point: the baseline only moves when a human promotes it at the gate. If the agent updates the baseline whenever a sweep passes, regressions get silently absorbed into the new normal and the baseline stops meaning anything.

Narration for this slide

A regression is only a regression relative to something, so decide what that something is before the first sweep. There are three useful baselines. The reference baseline — your design exports or the parity reference from last module — asks whether the build still matches the design. The approved baseline — captures from the last version a human signed off — asks whether anything changed that nobody intended. And the intent baseline — your design system rules and that one-line statement of what each page is for — catches failures the pixels cannot. Whichever you use, keep it on disk with a manifest, and only promote new captures to baseline at the human gate. If the baseline moves automatically, regressions just become the new normal.

Slide 5 of 1416:9

The visual QA loop

Capture, compare, findings, gate, fix, recapture. The agent runs the evidence steps; the human decides what counts as done.

Loop diagram of an agent-driven visual QA cycle. The agent captures screenshots and accessibility checks across routes, states, and breakpoints, compares them against the stated intent and design tokens, and files prioritized findings with named evidence. A human approval gate decides what counts as done and approves the fix plan. The agent applies one scoped fix pass, recaptures with the same widths and commands, and a dashed line loops back to the compare step until no P0 or P1 finding remains. Capture, compare, findings, fix, and recapture are marked agent-run; the approval gate and the final done decision are marked human-led.
Capture across breakpoints and states, compare against intent and tokens, file findings with evidence, pass the human gate, fix in scoped passes, recapture with the same commands. Done is a human decision: no P0 or P1 remains and the open judgment calls have an owner.

Evidence comes before opinion, and approval comes before changes. Skip either order and the loop stops protecting you.

Slide notes

Walk the diagram in order and name who owns each step. Capture is agent-run: screenshots at the fixed widths, the states from the manifest, an accessibility snapshot, and automated check output, all saved to disk with a manifest. Compare is agent-run but bounded — observable differences only, judged against the baseline, the tokens, and the intent line, never against the agent's own taste. Findings is the report: every finding carries a severity, the evidence it rests on, the likely cause, and a proposed fix, with verified and candidate findings clearly separated. The approval gate is human and is where this module's title earns its keep — the human decides which findings are defects, which are taste, and what done means for this prototype. The fix pass is agent-run and scoped to one concern at a time, and recapture re-runs the same commands at the same widths so the team knows the fix did what was asked and nothing else.

Two failure modes the order prevents. Teams that skip the evidence step ask the agent to critique from code, and it obliges with confident guesses about rendering it never saw. Teams that skip the gate let the agent fix its own findings immediately, and taste decisions get made silently inside a diff.

Point at the dashed line: the loop runs until no P0 or P1 remains, and the new captures become the baseline only when the human at the gate says so.

Narration for this slide

Here is the loop the rest of this module fills in. The agent captures — screenshots across your breakpoints and states, plus accessibility output — and saves it all to disk. It compares that evidence against the intent and the tokens, reporting observable differences only. It files findings, each with a severity, the evidence it rests on, and a proposed fix. Then the loop stops at the gate: a human reads the report, decides what is a defect and what is taste, and approves a fix plan. The agent fixes one concern at a time, recaptures with the same commands, and the loop runs again until nothing serious remains. Notice the two orderings: evidence before opinion, approval before changes. Both gates are yours.

Slide 6 of 1416:9

Agent-run sweeps: capture and compare at scale

A single-page review answers one question. A sweep answers a harder one: does the whole prototype still hold, after a change touched many files at once?

  • A capture script walks the manifest and produces baseline and current folders — stable names, stable widths
  • Playwright MCP and Chrome DevTools MCP let the agent drive a real browser: resize, navigate, screenshot, snapshot
  • Fan out one compare agent per page so each review stays grounded in its own evidence
  • Merge findings into one ranked report; repetition across pages points to a shared cause
  • Save the working sweep as a reusable workflow the team runs before every review

The human sees one merged, ranked report instead of forty separate conversations.

Slide notes

The capture layer is no longer something you invent. Microsoft's Playwright MCP server gives an agent browser navigation, viewport resizing, screenshots, and — often more usefully — a structured accessibility snapshot of the page, which answers layout-order and labelling questions at a fraction of the token cost of an image. Google's Chrome DevTools MCP server adds console messages, network requests, and performance traces for the cases where a page looks wrong for non-visual reasons, such as a font that never loaded. A small capture script in the repository wins when you want the identical evidence set every run; driving the MCP tools interactively wins when the agent needs to poke at one state. As of June 2026, both servers are the standard way agents reach a real browser.

The sweep pattern from this school's regression-sweep workflow scales the single review by separating orchestration from judgment: a script walks the manifest, captures everything, then fans out one narrow compare agent per page. Each compare agent sees only its own pair of captures and its route's intent line, which keeps its findings grounded rather than averaged across the whole product. The orchestration merges and ranks the results, so the human reads one report.

The most useful signal a sweep produces is repetition. If eleven pages report the same loosened card padding, the cause is almost certainly one shared token or component, and one fix clears eleven findings. The merged report should group repeated findings by likely shared cause for exactly that reason.

Narration for this slide

Reviewing one page in chat is fine. Reviewing forty pages in chat is not, and that is what a sweep is for. The shape is simple. A capture script walks your manifest and screenshots every route, state, and width, with stable file names — Playwright MCP or Chrome DevTools MCP if you want the agent driving the browser directly. Then you fan out one compare agent per page, each looking only at its own captures and its own intent line, so its judgment stays grounded. The findings get merged into a single ranked report. And watch for repetition — if eleven pages show the same padding change, that is one token, not eleven bugs. Save the sweep once it works, and it becomes something the team runs before every review.

Slide 7 of 1416:9

Reading findings: severity by user impact

Severity is assigned by what it costs the user, never by how easy the fix looks.

LevelMeaningResponse
P0Blocks the main task or fails accessibility seriouslyStop; fix before anything else ships
P1Breaks hierarchy, hides content, or breaks a key responsive stateFix in this pass; recapture before closing
P2Weakens polish, consistency, or programmatic stateLog it, give it an owner, batch into a later pass
P3Subjective refinement the evidence cannot decideRoute to a human; the agent does not fix taste

The rubric gives the approval gate something to act on: P0 stops the work, P1 gets fixed now, P2 gets scheduled, P3 gets decided by a person.

Slide notes

The rubric exists to stop two failure modes: polishing details while the product task is still broken, and treating every difference as equally urgent so nothing gets prioritised at all. Assigning severity by user impact rather than fix effort is the discipline that keeps it honest — a one-line layout fix can be a P1 and a finding that needs two new assets can be a P2, and conflating effort with severity is how trivial-but-cosmetic fixes crowd out cheap-but-important ones.

Give the levels concrete faces from real reviews. A P0 is a save button that falls below an overflowing panel at 390 pixels, or a chart legend rendering white on white after a theme variable rename — both real findings from a sweep on a fourteen-page product, both invisible to functional tests. A P1 is a sticky header whose stacked navigation pushes the page title toward the fold on phones. A P2 is navigation that exposes no current-page state. A P3 is whether a display type scale earns its cost on small screens when the design system explicitly specifies it.

Reading findings also means separating real regressions from acceptable change. Some differences are the design behaving exactly as specified, and some are deliberate improvements that simply differ from the baseline. The report should mark each finding as verified or candidate, and as objective mismatch or design judgment — a report containing only confirmable defects has usually been filtered to look objective, and a report that is mostly taste has not done the comparison work.

Narration for this slide

When the findings come back, severity is what makes them actionable. P0 means the user cannot complete the task, or there is a serious accessibility failure — that stops everything. P1 means the hierarchy or a key responsive state is broken — fix it in this pass and recapture. P2 weakens polish or consistency — log it, give it an owner, batch it. P3 is taste — the evidence cannot decide it, so a human does. Two rules keep this honest. Severity follows user impact, never fix effort. And not every difference is a regression — some changes are the design working as intended, and the report has to say which is which.

Slide 8 of 1416:9

Findings that name their evidence

A weak finding is another vague to-do. A strong finding is fix-ready: evidence, impact, severity, scope.

Weak findingStrong finding
The mobile header feels heavyP1 (candidate): at 390px the stacked logo row plus nine wrapped nav links push the title toward the fold — evidence: article-390.png, site-shell.tsx
Navigation could be more accessibleP2 (verified): no nav link sets aria-current or an active style, so the current section is not indicated — evidence: site-shell.tsx
Many pages have spacing changesSystemic: card padding grew from 16px to 20px on 11 pages; likely cause is the space-4 token remap
Found 212 visual differences2 blockers, 4 systemic findings traced to one token, 16 polish items with owners

Require every finding to name its evidence — file, viewport, axe rule. Findings that cannot are either judgments (label them) or guesses (send them back).

Slide notes

This quality bar matters most when an agent will implement the fixes. If the report says spacing feels off, the next agent may redesign the whole page; if it says the header consumes a third of the 390-pixel viewport because nine links wrap into four rows, the fix can stay narrow and the recapture can prove it worked. Precision in the finding is what keeps the fix pass scoped.

The evidence rule is the single fastest improvement you can make to report quality: every finding names the file, the viewport, and where relevant the axe rule it rests on. Findings that cannot name evidence split cleanly into two groups — design judgments, which are welcome as long as they are labelled as judgments and routed to a human, and guesses, which go back to the agent with a request to verify or drop.

The fourth row is about the report as a whole rather than individual findings. A sweep that returns two hundred undifferentiated pixel notes has produced a diff log, not a decision artifact. The strong version leads with blockers, groups repeated findings by likely shared cause, and gives the polish items owners. The examples in this table are drawn from this school's own published review of its article page and from the regression-sweep workflow's case studies; they are real shapes, not invented ones.

Narration for this slide

Here is the difference between findings you can act on and findings that waste a review. The mobile header feels heavy — what do you do with that? Compare: at 390 pixels, the stacked logo row plus nine wrapped navigation links push the title toward the fold, evidence in this file at this width. One of those produces a scoped fix and a recapture that proves it worked; the other produces a redesign you did not ask for. The rule is simple: every finding names its evidence — which file, which viewport, which check. If it cannot, it is either a judgment, which is fine as long as it is labelled, or a guess, which goes back. And the report as a whole should lead with blockers and shared causes, not a count of pixel differences.

Slide 9 of 1416:9

Accessibility checks ride along with the sweep

Many accessibility failures are visual and structural. The same evidence pass that catches layout problems should catch them.

  • axe-core against the same routes: contrast, labels, landmarks, heading order
  • Run it from the CLI or inside Playwright so checks hit real rendered states
  • Accessibility snapshots answer structure questions cheaper than pixels
  • Lighthouse CI and pa11y-ci can gate a branch on score regressions
  • A clean automated report is a floor, not a ceiling — keyboard and screen-reader passes stay human
Same routes, same sweep
# axe against the local route the sweep just captured
npx @axe-core/cli http://localhost:3000/dashboard

# or inside Playwright, against real rendered states:
#   const results = await new AxeBuilder({ page }).analyze()

# findings feed the same P0–P3 report as the visual review

Do not run accessibility as a separate cleanup stage after visual polish. Same routes, same sweep, same report.

Slide notes

The argument for folding accessibility into the visual sweep rather than treating it as a later stage: most of the failures automated tools can catch are visual and structural anyway — low contrast, missing focus states, unlabelled controls, colour used as the only signal, heading levels that skip. The evidence is already being captured for the visual review; running axe against the same routes costs one extra command and the findings translate directly into the same P0 to P3 format, so the gate sees one report instead of two.

Tooling, briefly and as of June 2026: axe-core is the standard rules engine, runnable from the command line with @axe-core/cli or inside a Playwright spec with @axe-core/playwright, which matters because the Playwright route can test states behind interactions, not just initial loads. Lighthouse CI wraps Lighthouse runs in assertions and budgets so a score regression can fail a branch the way a broken test does, and pa11y-ci is a lighter URL-sweep runner for CI. All of them emit JSON an agent can read and prioritise.

The caveat has to be stated as firmly as the value: automated checks catch a meaningful share of issues and prove nothing on their own. They cannot tell you whether the focus order makes sense, whether the alt text is useful rather than merely present, or whether the page works with a screen reader in practice. A clean axe report is an entry condition for review, not a result.

Narration for this slide

Accessibility is not a cleanup stage you schedule after the visuals are polished. Most of what automated tools can catch — contrast, missing labels, broken heading order, absent focus states — is visual and structural, and you are already capturing the evidence. So run axe against the same routes the sweep just screenshotted, either from the command line or inside Playwright so it sees real rendered states, and feed the results into the same prioritised report. Lighthouse CI or pa11y can gate a branch on regressions if you want that. One honest caveat: a clean automated report is a floor, not a ceiling. Keyboard walkthroughs and screen-reader passes are still human work, and no sweep replaces them.

Slide 10 of 1416:9

Wire QA into the prototype loop, not after it

A sweep the night before the review finds problems when they are most expensive to fix. Put the loop inside the build, not at the end of it.

  • First capture as soon as the first route renders — that is baseline zero
  • Re-run the sweep after each build milestone, not on a calendar
  • Findings feed the next agent run as critique; recurring fixes go into the harness
  • Keep the packet in the repo so each sweep compares against the last, not against memory
  • Scope it to the prototype's claims — do not QA fidelity the prototype never promised

In a prototype, the QA loop is the critique step from the agentic loop, made repeatable and evidence-based.

Slide notes

The placement question is where prototype teams most often get this wrong. Visual QA borrowed from production habits gets scheduled at the end — a big sweep before the stakeholder review — which means problems are found when the prototype is largest, the deadline is closest, and every fix risks disturbing something else. Inside a prototyping loop the economics invert: captures are cheap, the agent is already iterating, and a small sweep after each milestone catches drift while it is one fix instead of forty.

Connect this to the loop the course has been building since Module 1. The QA sweep is not a new ceremony bolted onto the side; it is the critique step made repeatable and evidence-based. Findings from one sweep become the critique that steers the next agent run, and findings that keep recurring — the same spacing drift, the same forgotten focus state — are a signal that a rule belongs in the harness or the design system rather than in another round of feedback.

The last bullet keeps the loop honest about what a prototype is. Module 1 set fidelity decisions per layer: visual, data, interaction, content. The QA matrix should test the claims the prototype actually makes and not the ones it explicitly faked. Filing P2s against placeholder data the team agreed to fake wastes the gate's attention and trains people to ignore the report.

Narration for this slide

When does the loop run? Not the night before the review — that is when problems are most expensive. Wire it into the build. Capture a baseline as soon as the first route renders. Re-run the sweep after each milestone, while a regression is one fix instead of forty. Feed the findings into the next agent run as critique, and when the same finding keeps coming back, that is a rule that belongs in your harness or your design system, not in another round of feedback. One scoping note: QA the claims the prototype actually makes. If you decided in module one that the data layer is faked, do not file findings against the fake data. The matrix tests the promises, not the placeholders.

Slide 11 of 1416:9

Worked example: the sweep that caught a breakpoint regression

A 14-page SaaS prototype, swept two days before a quarterly release that bundled nine weeks of merged work.

What happened
Capture84 screenshots across 3 viewports from the sweep manifest; 14 compare agents ran in under twelve minutes
Findings31 total: 2 P0, 7 P1, 16 P2, 6 P3 — both P0s had passed functional tests
The breakpoint P0At 390px the settings form's save button fell below an overflowing filter panel; users could not complete edits without scrolling past it
The other P0A chart legend rendered white on white after a theme variable rename
ResolutionBlockers and 4 systemic P1s fixed in one afternoon; recaptured; shipped on schedule; the P2s became a polish backlog with owners

Both blockers were invisible to functional tests and to every reviewer who only looked at the desktop width.

Slide notes

This trace comes from the regression-sweep workflow published on this school's site; the numbers are from that case study, not a controlled benchmark, and are worth presenting that way. The setup is the situation this module has been describing: many small changes merged over nine weeks, nothing obviously broken, and a release date close enough that a manual page-by-page review was not going to happen.

The two P0s are the instructive part. The save button below the overflowing panel at 390 pixels is a pure breakpoint regression — the desktop layout was fine, the functional tests passed because the button existed and worked, and only a capture at phone width made the problem visible. The white-on-white chart legend is the other classic shape: a theme variable rename that no individual diff made look dangerous. Neither finding required sophisticated judgment to act on once it existed; both required evidence nobody was collecting.

The resolution is as important as the detection. The team did not fix all 31 findings; they fixed the two blockers and the four P1s that traced to shared causes, recaptured the affected pages with the same commands, and shipped. The sixteen P2s became a backlog with owners instead of a vague sense that the release felt rough, and the six P3s went to a designer as decisions. That is what a working gate looks like: the sweep narrowed the decision; humans made it.

Narration for this slide

Let's trace a real sweep. A fourteen-page SaaS product, two days before a quarterly release bundling nine weeks of work. The capture script produced eighty-four screenshots across three viewports, and fourteen compare agents ran in under twelve minutes. They came back with thirty-one findings, including two P0s — and both had passed functional tests. The first: at 390 pixels, the settings form's save button fell below an overflowing filter panel, so users could not finish an edit. The second: a chart legend rendering white on white after a theme variable rename. The team fixed the blockers and the systemic P1s in an afternoon, recaptured, and shipped on schedule. The point is not that the agent was clever. It is that nobody was looking at 390 pixels, and the sweep was.

Slide 12 of 1416:9

What the loop cannot prove, and where it goes wrong

The sweep proves the build matches or differs from the baseline in observable ways. Everything past that is still yours.

  • It cannot judge pages missing from the manifest — the matrix needs the same care as the code
  • It cannot prove accessibility from screenshots alone; DOM, keyboard, and assistive-technology checks stay in the loop
  • It cannot decide brand or taste questions; it can only flag them as P3 for a human
  • Full-page captures lie sometimes: sticky headers repeated mid-scroll, fonts caught before they swapped — reconfirm with viewport-sized shots
  • Never let the agent fix findings before a human approves them; that is the whole gate

A sweep narrows the release decision. It does not make it.

Slide notes

The limits split into what the loop cannot see and what it cannot decide. It cannot see pages that are not in the manifest, states that need login or data setup nobody scripted, or anything about whether users will actually understand the flow — visual QA is not usability testing and does not substitute for it. It cannot decide whether the baseline was good in the first place, whether a P3 judgment call becomes a change, or whether the prototype ships with known P2 debt. Those are design and product decisions, and the loop's job is to put them in front of the person who owns them with evidence attached.

The tooling failure modes are worth a minute because a workflow that does not name them ends up debugging its own evidence. Full-page captures of long pages produce artifacts that look like defects: sticky headers repeated mid-scroll, lazy-loaded images that never fired, animations frozen mid-state, fonts captured before they swapped. Wait for network idle, keep the capture deterministic, and treat any finding that only appears in a long stitched capture with suspicion until a viewport-sized shot confirms it. And keep evidence on disk rather than pasted into the conversation — the token cost of inline full-page screenshots is real and it crowds out the context the agent needs for the actual review.

The organisational anti-pattern outranks all the technical ones: letting the agent apply fixes the moment it produces findings. The entire value of the loop is that a human reads the prioritized report before the UI changes.

Narration for this slide

Be clear about what this loop cannot do. It cannot check pages that are not in the manifest, so the matrix needs the same care as the code. It cannot prove accessibility from screenshots — automated checks are a floor, and keyboard and screen-reader work stays human. It cannot decide taste; it can only flag it. The tooling has its own traps: full-page captures of long pages produce ghosts — repeated sticky headers, unloaded images, fonts caught mid-swap — so reconfirm anything suspicious with a viewport-sized shot. And the big one: never let the agent fix findings before a human approves them. The sweep narrows the decision about what is done and what ships. It does not make that decision. You do.

Slide 13 of 1416:9

Exercise: define the QA matrix for your prototype

Take the prototype you have been building through this course and make its visual quality checkable. Aim for a first sweep you could run this week.

  • List every route the prototype has; mark the three that matter most to its core claim
  • For each of those three, name the states worth checking: empty, error, loading, dense, plus any it specifically promises
  • Fix your breakpoints — 390, 768, 1440 unless you have a reason — and write one intent line per route
  • Decide the baseline: reference, last approved capture, or intent rules, and where it will live on disk
  • Write the severity rubric into the packet and name who sits at the approval gate

Keep the manifest. Module 6 runs the sprint, and this matrix is the QA half of its definition of done.

Slide notes

The deliverable is a manifest file and a one-page packet skeleton, not a working sweep — the capture script and the compare agents can come later, and for a small prototype the first run can even be manual. What matters is that the decisions are made and written down: which routes, which states, which widths, what the comparison is against, and who decides what counts as done.

The constraint to enforce is the focus on three routes. Most people's first instinct is to list everything, which produces a matrix nobody will ever run. Three routes crossed with four states and three widths is thirty-six captures — already a meaningful sweep and still small enough to read the results in one sitting. The matrix can grow once the loop has run a few times.

The last bullet is the one participants skip and the one the module is named for. Naming the person at the gate — even when it is themselves — forces the question of what done means for this prototype: which severities block the demo, which get logged, and who decides the taste calls. If the exercise is run in a group, comparing intent lines is the most useful discussion: two people writing intent lines for the same route usually discover they disagree about what the page is for, which is a finding in itself.

Narration for this slide

Your turn. Take the prototype you have been carrying through this course and make its visual quality checkable. List every route, then pick the three that matter most to what the prototype claims to prove. For those three, name the states worth checking — empty, error, loading, dense. Fix your breakpoints and write one intent line per route: what is this page for. Decide what the baseline is and where it lives on disk. And write down who sits at the approval gate, even if it is you, because that is the person who decides what done means. Keep the manifest — module six runs the full sprint, and this matrix becomes the QA half of its definition of done.

Slide 14 of 1416:9

Summary, and what comes next

  • If it is not screenshotted, it is not checked — the matrix of routes, states, and breakpoints decides what gets seen
  • Baselines live on disk and only move at the human gate; without them every review is unrepeatable
  • Sweeps capture, compare, and report with evidence; severity follows user impact, and every finding names its file and viewport
  • Accessibility checks ride along with the sweep — and remain a floor, not a ceiling
  • The loop runs inside the build, and a human decides what counts as done

Module 6 puts the whole course together: a timed prototype sprint, tested with users, closed with a handoff that says what was built, what was faked, and what is unknown.

Slide notes

Recap by connecting the bullets back to the loop diagram rather than re-listing them. The matrix decides what evidence exists; the baseline decides what the evidence is compared against; the sweep produces findings that are only as useful as the evidence they name; the rubric turns those findings into decisions; and the gate keeps the decisions human. Each piece on its own is small — a JSON file, a folder of screenshots, a severity table — and together they are what makes visual quality checkable instead of vibes.

It is worth restating the module's one cultural rule because the next module will stress it: the agent reports, the human decides. In the sprint, time pressure is exactly the condition under which teams are tempted to let the agent fix its own findings and to let a clean sweep stand in for a decision about readiness. The habits built here are what hold under that pressure.

Preview Module 6 concretely. It runs the whole course at full speed: planning a prototype sprint as a sequence of bounded agent runs with gates, a mid-sprint critique to catch drift while it is cheap, testing with users or stakeholders, and the honest handoff that distinguishes built, faked, and unknown. The QA matrix from this module's exercise becomes part of the sprint's definition of done, and the sweep report becomes one of the artifacts the handoff is built from.

Narration for this slide

Let's close the module. If it is not screenshotted, it is not checked — so the matrix of routes, states, and breakpoints decides what actually gets seen. Baselines live on disk and only move when a human promotes them. Sweeps capture, compare, and report with evidence; severity follows user impact, and findings that cannot name their evidence go back. Accessibility rides along with the sweep, as a floor, not a ceiling. And the loop runs inside the build, with a human deciding what counts as done. Module six is where it all comes together: a timed prototype sprint, from brief to a tested prototype, closed with a handoff that is honest about what was built, what was faked, and what nobody knows yet. See you there.

Module transcript
Module 5, narrated slide by slide

Slide 1Visual QA Loops

Welcome to module five. By now you can get a prototype built and hold one page to parity with its reference. This module is about everything else — the routes you did not stare at, the states you forgot existed, the phone width you never opened. The principle is blunt: if it is not screenshotted, it is not checked. We will define a QA matrix of routes, states, and breakpoints, set up baselines so regressions are caught against something real, run agent-driven sweeps that return findings with evidence, and keep one rule fixed throughout: the agent reports, and you decide what counts as done.

Slide 2Why visual QA is its own loop

Here is why visual QA needs its own loop. A prototype can compile, load, and pass every functional test and still fail the design. The hierarchy is wrong, the rhythm is broken, the empty state was invented, the phone layout collapses — and none of that shows up in a type check or a code review, because the diff looks fine. Visual bugs live in the combinations you did not look at: states crossed with widths crossed with themes. So we give the agent evidence — screenshots, accessibility snapshots, check output — and a procedure. The agent is good at the boring part: capturing the same things the same way, every single run. You stay good at the part that matters: judging what the evidence means.

Slide 3The QA matrix: routes × states × breakpoints

The sweep is only as good as its list, so the first artifact is a matrix: routes, crossed with states, crossed with breakpoints. Routes — every page the prototype claims to have, including the one you keep forgetting. States — empty, loading, error, dense; the ones nobody designs and agents invent. Breakpoints — pick a small set, like 390, 768, and 1440, and never vary them, because comparability is the whole point. And for every route, write one intent line: what this page is for. That sentence is the cheapest thing in the matrix, and it is what turns a pile of differences into findings someone can judge. Keep the manifest in the repo and review it like code.

Slide 4Baselines: what the comparison is against

A regression is only a regression relative to something, so decide what that something is before the first sweep. There are three useful baselines. The reference baseline — your design exports or the parity reference from last module — asks whether the build still matches the design. The approved baseline — captures from the last version a human signed off — asks whether anything changed that nobody intended. And the intent baseline — your design system rules and that one-line statement of what each page is for — catches failures the pixels cannot. Whichever you use, keep it on disk with a manifest, and only promote new captures to baseline at the human gate. If the baseline moves automatically, regressions just become the new normal.

Slide 5The visual QA loop

Here is the loop the rest of this module fills in. The agent captures — screenshots across your breakpoints and states, plus accessibility output — and saves it all to disk. It compares that evidence against the intent and the tokens, reporting observable differences only. It files findings, each with a severity, the evidence it rests on, and a proposed fix. Then the loop stops at the gate: a human reads the report, decides what is a defect and what is taste, and approves a fix plan. The agent fixes one concern at a time, recaptures with the same commands, and the loop runs again until nothing serious remains. Notice the two orderings: evidence before opinion, approval before changes. Both gates are yours.

Slide 6Agent-run sweeps: capture and compare at scale

Reviewing one page in chat is fine. Reviewing forty pages in chat is not, and that is what a sweep is for. The shape is simple. A capture script walks your manifest and screenshots every route, state, and width, with stable file names — Playwright MCP or Chrome DevTools MCP if you want the agent driving the browser directly. Then you fan out one compare agent per page, each looking only at its own captures and its own intent line, so its judgment stays grounded. The findings get merged into a single ranked report. And watch for repetition — if eleven pages show the same padding change, that is one token, not eleven bugs. Save the sweep once it works, and it becomes something the team runs before every review.

Slide 7Reading findings: severity by user impact

When the findings come back, severity is what makes them actionable. P0 means the user cannot complete the task, or there is a serious accessibility failure — that stops everything. P1 means the hierarchy or a key responsive state is broken — fix it in this pass and recapture. P2 weakens polish or consistency — log it, give it an owner, batch it. P3 is taste — the evidence cannot decide it, so a human does. Two rules keep this honest. Severity follows user impact, never fix effort. And not every difference is a regression — some changes are the design working as intended, and the report has to say which is which.

Slide 8Findings that name their evidence

Here is the difference between findings you can act on and findings that waste a review. The mobile header feels heavy — what do you do with that? Compare: at 390 pixels, the stacked logo row plus nine wrapped navigation links push the title toward the fold, evidence in this file at this width. One of those produces a scoped fix and a recapture that proves it worked; the other produces a redesign you did not ask for. The rule is simple: every finding names its evidence — which file, which viewport, which check. If it cannot, it is either a judgment, which is fine as long as it is labelled, or a guess, which goes back. And the report as a whole should lead with blockers and shared causes, not a count of pixel differences.

Slide 9Accessibility checks ride along with the sweep

Accessibility is not a cleanup stage you schedule after the visuals are polished. Most of what automated tools can catch — contrast, missing labels, broken heading order, absent focus states — is visual and structural, and you are already capturing the evidence. So run axe against the same routes the sweep just screenshotted, either from the command line or inside Playwright so it sees real rendered states, and feed the results into the same prioritised report. Lighthouse CI or pa11y can gate a branch on regressions if you want that. One honest caveat: a clean automated report is a floor, not a ceiling. Keyboard walkthroughs and screen-reader passes are still human work, and no sweep replaces them.

Slide 10Wire QA into the prototype loop, not after it

When does the loop run? Not the night before the review — that is when problems are most expensive. Wire it into the build. Capture a baseline as soon as the first route renders. Re-run the sweep after each milestone, while a regression is one fix instead of forty. Feed the findings into the next agent run as critique, and when the same finding keeps coming back, that is a rule that belongs in your harness or your design system, not in another round of feedback. One scoping note: QA the claims the prototype actually makes. If you decided in module one that the data layer is faked, do not file findings against the fake data. The matrix tests the promises, not the placeholders.

Slide 11Worked example: the sweep that caught a breakpoint regression

Let's trace a real sweep. A fourteen-page SaaS product, two days before a quarterly release bundling nine weeks of work. The capture script produced eighty-four screenshots across three viewports, and fourteen compare agents ran in under twelve minutes. They came back with thirty-one findings, including two P0s — and both had passed functional tests. The first: at 390 pixels, the settings form's save button fell below an overflowing filter panel, so users could not finish an edit. The second: a chart legend rendering white on white after a theme variable rename. The team fixed the blockers and the systemic P1s in an afternoon, recaptured, and shipped on schedule. The point is not that the agent was clever. It is that nobody was looking at 390 pixels, and the sweep was.

Slide 12What the loop cannot prove, and where it goes wrong

Be clear about what this loop cannot do. It cannot check pages that are not in the manifest, so the matrix needs the same care as the code. It cannot prove accessibility from screenshots — automated checks are a floor, and keyboard and screen-reader work stays human. It cannot decide taste; it can only flag it. The tooling has its own traps: full-page captures of long pages produce ghosts — repeated sticky headers, unloaded images, fonts caught mid-swap — so reconfirm anything suspicious with a viewport-sized shot. And the big one: never let the agent fix findings before a human approves them. The sweep narrows the decision about what is done and what ships. It does not make that decision. You do.

Slide 13Exercise: define the QA matrix for your prototype

Your turn. Take the prototype you have been carrying through this course and make its visual quality checkable. List every route, then pick the three that matter most to what the prototype claims to prove. For those three, name the states worth checking — empty, error, loading, dense. Fix your breakpoints and write one intent line per route: what is this page for. Decide what the baseline is and where it lives on disk. And write down who sits at the approval gate, even if it is you, because that is the person who decides what done means. Keep the manifest — module six runs the full sprint, and this matrix becomes the QA half of its definition of done.

Slide 14Summary, and what comes next

Let's close the module. If it is not screenshotted, it is not checked — so the matrix of routes, states, and breakpoints decides what actually gets seen. Baselines live on disk and only move when a human promotes them. Sweeps capture, compare, and report with evidence; severity follows user impact, and findings that cannot name their evidence go back. Accessibility rides along with the sweep, as a floor, not a ceiling. And the loop runs inside the build, with a human deciding what counts as done. Module six is where it all comes together: a timed prototype sprint, from brief to a tested prototype, closed with a handoff that is honest about what was built, what was faked, and what nobody knows yet. See you there.