Slide 1 — Visual QA and Regression Evidence
Welcome to Module 3. The first two modules were about reviewing work as it is made — critique loops, heuristic evaluations, walkthroughs. This module is about a different question: has anything that was already approved quietly changed? Most teams answer that from memory. Someone opens a few pages before a release, decides it looks fine, and ships. Memory does not store spacing, breakpoints, or empty states. In this module we replace that with evidence: screenshot baselines, agent-run sweeps that compare every page against them, and a report that tells you what regressed, what changed on purpose, and what is just noise. Let's start with why memory fails.
Slide 2 — Why memory fails as a regression process
Here is why memory fails. Visual regressions are systemic — a token changes and eleven pages get looser padding at once. They are visual, not functional — every test passes, the product works, it just stops matching the design. And they are silent — the cookie banner that covered the primary button at phone width was missed by two days of manual checks and caught by a sweep. Memory-based review is the wrong shape for this: one person, a few pages, desktop width, whenever there is time. The fix is not more diligence. It is evidence the team can compare against, captured the same way every time — which is exactly the kind of repetitive work an agent does well.
Slide 3 — Baselines: choosing surfaces, states, and breakpoints
A baseline is the approved look of your product, stored as files you can compare against. Building one means making three choices. Which surfaces — start with the routes where a regression actually hurts: core tasks, anything involving money, the screens that lean hardest on the design system. Which states — empty, loading, error, expanded — because those are the ones nobody re-checks by hand. And which breakpoints — pick a small set, like 390, 768, and 1440 pixels, and never change them, because comparability is the point. Add a one-line intent for each page, keep the list in the repo, and review it like code. If a surface is not in the set, the sweep cannot protect it.
Slide 4 — The sweep manifest, in practice
Here is what a baseline manifest actually looks like — a small JSON file in the repo. Viewports declared once at the top, then one entry per page: an id, the route, the states that matter, and a single line of intent. That intent line is doing the most work. Pricing figures must stay aligned tells the compare agent which differences matter on this page; without it, every difference is just pixels and the report becomes a log nobody reads. Be honest about states: some need seeded data or flags to reach. Start with the cheap ones and grow the manifest as the sweep earns its keep.
Slide 5 — Capture sweeps: how the agent walks the product
Capture is the boring part, and it should stay boring. A short Playwright script walks the manifest: set the viewport, load the route, wait for the network to go idle, save a full-page screenshot with a predictable name. Run it twice — once against the last approved release to get the baseline, once against the branch under review. The discipline is in keeping everything stable: same widths, same names, same wait rules, every run. Disable animations and watch out for lazy images and font swaps, because they fake regressions. And keep the screenshots on disk — pasting full-page captures into a chat burns enormous token budgets and buys you nothing.
Slide 6 — Pixel diffs, structural diffs, and agent comparison
There are three ways to compare current against baseline, and they catch different things. Pixel diffs overlay the images and flag any pixel that changed — brutally effective for spacing and colour drift, but they also flag every anti-aliasing quirk, and teams drown in false positives. Structural diffs compare the accessibility tree — headings, labels, order — cheap and precise, but blind to purely visual change. The agent layer reads both captures plus the page's intent line and reports differences in design language: the queue now sits below the summaries on mobile, which the intent says must not happen. Use the mechanical diffs to detect change cheaply, and spend the agent's attention on saying whether it matters.
Slide 7 — The regression evidence board
Here is the whole sweep on one board. On the left, the agent captures baseline and current screenshots — same routes, same states, same three widths, every run. Then one compare agent per page reports observable differences, each finding naming its evidence file. Third column: every difference gets classified — regression, intended change, or noise. And on the right, the human review meeting, which only ever sees what deserves a decision: ranked blockers, systemic causes grouped to one likely fix, baseline updates to approve, and judgment calls. Notice what the board really is — a filter. Hundreds of raw differences go in on the left. A short list of decisions, with evidence attached, comes out on the right. Noise never reaches the room.
Slide 8 — Reading the report: regression, intended change, or noise
Every difference the sweep finds lands in one of three buckets. Regressions — the build no longer matches the approved intent — get fixed and then recaptured to prove the fix. Intended changes — the design moved on purpose — get approved, and the baseline gets updated. And noise — animation timing, a lazy image, a font swap — means you fix the capture, never the UI. Within the regressions, the familiar severity rubric ranks the work: P0 blocks the release, P1 gets fixed now, P2 is logged, P3 goes to a human. And watch for repetition. The same finding on eleven pages almost always means one shared token, and one fix clears eleven findings.
Slide 9 — Keeping baselines current without rubber-stamping drift
Baselines age, and how you update them decides whether the practice keeps working. Never update them and every sweep drowns in known differences until nobody reads the report. Update them automatically and you have ratified drift — whatever shipped becomes correct by definition. The discipline is simple: a baseline update is an approval, made by a person, per page or per finding, with a note about which design decision it reflects. Never bulk-approve. And read the update history as a signal — a page that needs its baseline updated every single sweep is not regressing, it is still being designed. Move it back into the critique loop until it settles.
Slide 10 — Where visual QA sits in the release rhythm
So when do you run this? Tie the sweep to the moments where visual risk concentrates. Before a release that bundles weeks of merged work. After a design-system or framework upgrade. After a theme or template change. And on a recurring schedule, because drift accumulates even when nothing big ships. The one place not to use it is a single new feature page — there is no baseline to compare against, and what new work needs is the critique and evaluation methods from the first two modules. Once the manifest and the script exist, a full sweep costs thirty to sixty minutes, mostly agent time. The real question is not whether you can afford to run it, but which triggers earn it.
Slide 11 — Worked example: a regression caught two days before release
Let's trace one real run. A fourteen-page SaaS product, two days before a quarterly release bundling nine weeks of work. The capture script produced 84 screenshots, fourteen compare agents ran in under twelve minutes, and the merged report held thirty-one findings — including two release blockers. A save button pushed below an overflowing panel at phone width, and a chart legend rendering white on white after a theme variable rename. Both had passed every functional test. The team fixed the blockers and the four systemic P1s in one afternoon, recaptured to prove it, and shipped on schedule. The sixteen polish findings became a backlog with owners — not a vague sense that the release felt rough.
Slide 12 — Exercise: define the baseline set for one product area
Time to apply this. Pick one product area you know well and write its sweep manifest on paper. Five to eight routes — only the ones where a regression would genuinely hurt. For each, one line of intent: what the page is for and what must not break. The states that matter, marking which ones need setup to reach. Your fixed viewports. Then test yourself: name one recent change this set would have caught, and one it would have missed. That second answer keeps you honest about the limits. Keep the page — in Module 5 this manifest becomes part of the review that runs on every pull request.
Slide 13 — Summary, and what comes next
Let's close the module. Memory-based visual review cannot keep up with regressions that are systemic and silent, so we replace it with evidence: a baseline manifest of the surfaces, states, and breakpoints that matter, captured the same way every run. The sweep compares current against baseline, one agent per page, and classifies every difference — regression, intended change, or noise. Severity ranks the regressions, repetition points to shared causes, and updating the baseline stays a human decision. Remember the boundary: the sweep protects approved work; new work still needs critique. In Module 4 we take on the two reviews teams defer longest — accessibility and content — and make them routine passes with the same discipline. See you there.