Slide 1 — Visual QA Loops
Welcome to module five. By now you can get a prototype built and hold one page to parity with its reference. This module is about everything else — the routes you did not stare at, the states you forgot existed, the phone width you never opened. The principle is blunt: if it is not screenshotted, it is not checked. We will define a QA matrix of routes, states, and breakpoints, set up baselines so regressions are caught against something real, run agent-driven sweeps that return findings with evidence, and keep one rule fixed throughout: the agent reports, and you decide what counts as done.
Slide 2 — Why visual QA is its own loop
Here is why visual QA needs its own loop. A prototype can compile, load, and pass every functional test and still fail the design. The hierarchy is wrong, the rhythm is broken, the empty state was invented, the phone layout collapses — and none of that shows up in a type check or a code review, because the diff looks fine. Visual bugs live in the combinations you did not look at: states crossed with widths crossed with themes. So we give the agent evidence — screenshots, accessibility snapshots, check output — and a procedure. The agent is good at the boring part: capturing the same things the same way, every single run. You stay good at the part that matters: judging what the evidence means.
Slide 3 — The QA matrix: routes × states × breakpoints
The sweep is only as good as its list, so the first artifact is a matrix: routes, crossed with states, crossed with breakpoints. Routes — every page the prototype claims to have, including the one you keep forgetting. States — empty, loading, error, dense; the ones nobody designs and agents invent. Breakpoints — pick a small set, like 390, 768, and 1440, and never vary them, because comparability is the whole point. And for every route, write one intent line: what this page is for. That sentence is the cheapest thing in the matrix, and it is what turns a pile of differences into findings someone can judge. Keep the manifest in the repo and review it like code.
Slide 4 — Baselines: what the comparison is against
A regression is only a regression relative to something, so decide what that something is before the first sweep. There are three useful baselines. The reference baseline — your design exports or the parity reference from last module — asks whether the build still matches the design. The approved baseline — captures from the last version a human signed off — asks whether anything changed that nobody intended. And the intent baseline — your design system rules and that one-line statement of what each page is for — catches failures the pixels cannot. Whichever you use, keep it on disk with a manifest, and only promote new captures to baseline at the human gate. If the baseline moves automatically, regressions just become the new normal.
Slide 5 — The visual QA loop
Here is the loop the rest of this module fills in. The agent captures — screenshots across your breakpoints and states, plus accessibility output — and saves it all to disk. It compares that evidence against the intent and the tokens, reporting observable differences only. It files findings, each with a severity, the evidence it rests on, and a proposed fix. Then the loop stops at the gate: a human reads the report, decides what is a defect and what is taste, and approves a fix plan. The agent fixes one concern at a time, recaptures with the same commands, and the loop runs again until nothing serious remains. Notice the two orderings: evidence before opinion, approval before changes. Both gates are yours.
Slide 6 — Agent-run sweeps: capture and compare at scale
Reviewing one page in chat is fine. Reviewing forty pages in chat is not, and that is what a sweep is for. The shape is simple. A capture script walks your manifest and screenshots every route, state, and width, with stable file names — Playwright MCP or Chrome DevTools MCP if you want the agent driving the browser directly. Then you fan out one compare agent per page, each looking only at its own captures and its own intent line, so its judgment stays grounded. The findings get merged into a single ranked report. And watch for repetition — if eleven pages show the same padding change, that is one token, not eleven bugs. Save the sweep once it works, and it becomes something the team runs before every review.
Slide 7 — Reading findings: severity by user impact
When the findings come back, severity is what makes them actionable. P0 means the user cannot complete the task, or there is a serious accessibility failure — that stops everything. P1 means the hierarchy or a key responsive state is broken — fix it in this pass and recapture. P2 weakens polish or consistency — log it, give it an owner, batch it. P3 is taste — the evidence cannot decide it, so a human does. Two rules keep this honest. Severity follows user impact, never fix effort. And not every difference is a regression — some changes are the design working as intended, and the report has to say which is which.
Slide 8 — Findings that name their evidence
Here is the difference between findings you can act on and findings that waste a review. The mobile header feels heavy — what do you do with that? Compare: at 390 pixels, the stacked logo row plus nine wrapped navigation links push the title toward the fold, evidence in this file at this width. One of those produces a scoped fix and a recapture that proves it worked; the other produces a redesign you did not ask for. The rule is simple: every finding names its evidence — which file, which viewport, which check. If it cannot, it is either a judgment, which is fine as long as it is labelled, or a guess, which goes back. And the report as a whole should lead with blockers and shared causes, not a count of pixel differences.
Slide 9 — Accessibility checks ride along with the sweep
Accessibility is not a cleanup stage you schedule after the visuals are polished. Most of what automated tools can catch — contrast, missing labels, broken heading order, absent focus states — is visual and structural, and you are already capturing the evidence. So run axe against the same routes the sweep just screenshotted, either from the command line or inside Playwright so it sees real rendered states, and feed the results into the same prioritised report. Lighthouse CI or pa11y can gate a branch on regressions if you want that. One honest caveat: a clean automated report is a floor, not a ceiling. Keyboard walkthroughs and screen-reader passes are still human work, and no sweep replaces them.
Slide 10 — Wire QA into the prototype loop, not after it
When does the loop run? Not the night before the review — that is when problems are most expensive. Wire it into the build. Capture a baseline as soon as the first route renders. Re-run the sweep after each milestone, while a regression is one fix instead of forty. Feed the findings into the next agent run as critique, and when the same finding keeps coming back, that is a rule that belongs in your harness or your design system, not in another round of feedback. One scoping note: QA the claims the prototype actually makes. If you decided in module one that the data layer is faked, do not file findings against the fake data. The matrix tests the promises, not the placeholders.
Slide 11 — Worked example: the sweep that caught a breakpoint regression
Let's trace a real sweep. A fourteen-page SaaS product, two days before a quarterly release bundling nine weeks of work. The capture script produced eighty-four screenshots across three viewports, and fourteen compare agents ran in under twelve minutes. They came back with thirty-one findings, including two P0s — and both had passed functional tests. The first: at 390 pixels, the settings form's save button fell below an overflowing filter panel, so users could not finish an edit. The second: a chart legend rendering white on white after a theme variable rename. The team fixed the blockers and the systemic P1s in an afternoon, recaptured, and shipped on schedule. The point is not that the agent was clever. It is that nobody was looking at 390 pixels, and the sweep was.
Slide 12 — What the loop cannot prove, and where it goes wrong
Be clear about what this loop cannot do. It cannot check pages that are not in the manifest, so the matrix needs the same care as the code. It cannot prove accessibility from screenshots — automated checks are a floor, and keyboard and screen-reader work stays human. It cannot decide taste; it can only flag it. The tooling has its own traps: full-page captures of long pages produce ghosts — repeated sticky headers, unloaded images, fonts caught mid-swap — so reconfirm anything suspicious with a viewport-sized shot. And the big one: never let the agent fix findings before a human approves them. The sweep narrows the decision about what is done and what ships. It does not make that decision. You do.
Slide 13 — Exercise: define the QA matrix for your prototype
Your turn. Take the prototype you have been carrying through this course and make its visual quality checkable. List every route, then pick the three that matter most to what the prototype claims to prove. For those three, name the states worth checking — empty, error, loading, dense. Fix your breakpoints and write one intent line per route: what is this page for. Decide what the baseline is and where it lives on disk. And write down who sits at the approval gate, even if it is you, because that is the person who decides what done means. Keep the manifest — module six runs the full sprint, and this matrix becomes the QA half of its definition of done.
Slide 14 — Summary, and what comes next
Let's close the module. If it is not screenshotted, it is not checked — so the matrix of routes, states, and breakpoints decides what actually gets seen. Baselines live on disk and only move when a human promotes them. Sweeps capture, compare, and report with evidence; severity follows user impact, and findings that cannot name their evidence go back. Accessibility rides along with the sweep, as a floor, not a ceiling. And the loop runs inside the build, with a human deciding what counts as done. Module six is where it all comes together: a timed prototype sprint, from brief to a tested prototype, closed with a handoff that is honest about what was built, what was faked, and what nobody knows yet. See you there.