AAgentic Design School
Module 3 of 5
40–50 minutes

Design Review and Critique with Agents

Visual QA and Regression Evidence

Replacing it looked fine last time I checked with evidence: screenshot baselines, agent-run sweeps across states and breakpoints, and regression reports a designer can act on in minutes rather than re-checking everything by hand.

Duration40–50 minutes

Slides13 slides with notes and narration

Learning objectives

  • Establish screenshot baselines for the surfaces that matter.
  • Run regression sweeps that compare against the baseline with diffs as evidence.
  • Distinguish genuine regressions from intended change without manual re-review of everything.
Slide deck

Work through the module

Each slide is shown in its 16:9 frame, exactly as it appears in the video version. Open the notes under any slide for the longer explanation, and the narration if you prefer to read along.

Slide 1 of 1316:9

Visual QA and Regression Evidence

Design Review and Critique with Agents · Module 3 of 5

  • Why visual review needs evidence, not memory
  • Baselines: the surfaces, states, and breakpoints that matter
  • Agent-run capture sweeps and what the diffs actually show
  • Reading the report: regression, intended change, or noise
  • Keeping baselines current without rubber-stamping drift

By the end of this module, the question did anything visually break in this release? has an answer backed by files, not by whoever looked last.

Slide notes

Modules 1 and 2 covered review of individual artifacts and screens: critique loops with named dimensions, and heuristic evaluations and walkthroughs at scale. This module changes the question. It is no longer is this screen good? but has anything that was already approved quietly changed? That is a different kind of review, and the only honest way to answer it is comparison against stored evidence.

Name the failure mode this module replaces: it looked fine last time I checked. Most teams' visual regression process is one designer's memory of how the product looked a few weeks ago, applied unevenly to whichever pages someone happens to open before a release. Memory does not store spacing values, breakpoint behaviour, or empty states, and it degrades exactly when the team is busiest.

Set the boundary early as well: this is a review workflow, not a licence for the agent to change the UI. The agent captures, compares, and classifies; a human reads the ranked report and approves what gets fixed and which baselines get updated. That gate structure is the same one the previous two modules established, applied to a new kind of evidence.

Narration for this slide

Welcome to Module 3. The first two modules were about reviewing work as it is made — critique loops, heuristic evaluations, walkthroughs. This module is about a different question: has anything that was already approved quietly changed? Most teams answer that from memory. Someone opens a few pages before a release, decides it looks fine, and ships. Memory does not store spacing, breakpoints, or empty states. In this module we replace that with evidence: screenshot baselines, agent-run sweeps that compare every page against them, and a report that tells you what regressed, what changed on purpose, and what is just noise. Let's start with why memory fails.

Slide 2 of 1316:9

Why memory fails as a regression process

Visual regressions rarely arrive one page at a time, and they rarely arrive where anyone is looking.

  • A spacing token changes and every card in the product gets two pixels looser
  • A theme update lands and three long-form pages quietly lose their heading rhythm
  • The layout that holds at 1440px collapses at 390px — and nobody opened the phone view
  • Functional tests pass: the product works, it just no longer matches the design
  • The person who would notice was on leave, busy, or looking at a different page

Regressions are systemic, visual, and silent. Memory-based review is individual, occasional, and biased towards the pages someone happens to open.

Slide notes

The examples on this slide are drawn from real sweep case studies in the school's regression-sweep workflow: a design-system version bump that loosened card padding on eleven of eighteen pages, a CMS theme update that swapped a heading font fallback and broke rhythm on three long-form pages, and a cookie banner that overlapped the primary call to action at 390 pixels on every page — found by a sweep after two days of manual checking had missed it. None of these broke a functional test, because nothing stopped working; the product simply stopped matching the design.

The deeper point is about the shape of the problem versus the shape of the usual response. Regressions caused by shared tokens, components, or templates are systemic — they appear on many pages at once, often subtly. Memory-based review is the opposite shape: one person, a handful of pages, at desktop width, on whatever day they had time. The mismatch is structural, not a matter of diligence.

This is also why the work suits agents. Capturing the same routes at the same widths, comparing them against the same baseline, and writing up the differences is repetitive, evidence-heavy work that humans do badly on the fourth iteration and an agent does identically on the fortieth. The judgment about what the differences mean stays human — that comes later in the module.

Narration for this slide

Here is why memory fails. Visual regressions are systemic — a token changes and eleven pages get looser padding at once. They are visual, not functional — every test passes, the product works, it just stops matching the design. And they are silent — the cookie banner that covered the primary button at phone width was missed by two days of manual checks and caught by a sweep. Memory-based review is the wrong shape for this: one person, a few pages, desktop width, whenever there is time. The fix is not more diligence. It is evidence the team can compare against, captured the same way every time — which is exactly the kind of repetitive work an agent does well.

Slide 3 of 1316:9

Baselines: choosing surfaces, states, and breakpoints

A baseline is the approved look of the product, stored as files. The choice of what to baseline is design judgment, not tooling.

  • Surfaces: the routes where regressions would hurt — key tasks, revenue pages, the design system showcase
  • States: empty, loading, error, selected, expanded — the ones nobody re-checks by hand
  • Breakpoints: fix a small set and never vary them — 390, 768, and 1440px cover most products
  • Each entry carries a one-line intent: what this page is for and what must not break
  • The baseline set lives in the repo and is reviewed like code

If a surface is not in the baseline set, the sweep cannot protect it. The manifest deserves the same care as the product.

Slide notes

The baseline is two things at once: a folder of approved screenshots, and the manifest that says which routes, states, and viewports those screenshots represent. The screenshots come from the last approved release or, early on, from design exports. The manifest is the part that needs design judgment, because it encodes a priority call: which surfaces matter enough to protect.

Walk the three dimensions. Surfaces should start with the pages where a regression has real cost — the core task flows, anything involving money or sign-up, and the screens that exercise the design system most heavily. States are where manual review is weakest: nobody re-checks the empty state, the error state, or the expanded panel before every release, which is exactly why they belong in the baseline. Breakpoints must be fixed and reused; this school's own tooling uses 390, 768, and 1440 pixels, and the specific numbers matter less than never changing them, because comparability across runs is the entire point.

The intent line per page is the cheapest and most neglected item. Dense triage view; the queue must stay above summaries on mobile gives the compare agent something to judge differences against. Without it, every difference is just pixels. And keep the manifest in the repository so additions and removals go through review — a sweep cannot catch a regression on a page nobody listed.

Narration for this slide

A baseline is the approved look of your product, stored as files you can compare against. Building one means making three choices. Which surfaces — start with the routes where a regression actually hurts: core tasks, anything involving money, the screens that lean hardest on the design system. Which states — empty, loading, error, expanded — because those are the ones nobody re-checks by hand. And which breakpoints — pick a small set, like 390, 768, and 1440 pixels, and never change them, because comparability is the point. Add a one-line intent for each page, keep the list in the repo, and review it like code. If a surface is not in the set, the sweep cannot protect it.

Slide 4 of 1316:9

The sweep manifest, in practice

The manifest is a small JSON file the capture script and the compare agents both read. This excerpt is from the school's regression-sweep workflow.

sweep-manifest.json (excerpt)
{
  "viewports": [
    { "name": "mobile",  "width": 390,  "height": 900 },
    { "name": "desktop", "width": 1440, "height": 1000 }
  ],
  "pages": [
    {
      "id": "dashboard",
      "route": "/dashboard",
      "intent": "Dense triage view; queue must stay above summaries on mobile.",
      "states": ["default", "empty", "error"]
    },
    {
      "id": "billing",
      "route": "/settings/billing",
      "intent": "Plan comparison and invoice history; pricing figures must stay aligned.",
      "states": ["default", "past-due"]
    }
  ]
}

Route, states, viewports, and one line of intent per page — that is the whole contract between the designer and the sweep.

Slide notes

Keep the discussion of this file practical. Each page entry has four parts: an id used in file names, the route, the states that matter, and the intent line. The viewports are declared once at the top so every page is captured at the same widths. The capture script loops over pages and viewports; the compare agents receive the captures plus the intent line for their page and nothing else.

The intent line deserves the most attention because it is what turns a diff into a finding. Pricing figures must stay aligned tells the compare agent that a one-pixel shift in the invoice table matters more on this page than it would on a marketing page. Without intent, the agent either reports everything — and the report becomes a diff log nobody reads — or applies its own generic judgment, which is exactly what Module 1 warned against.

States need honesty about cost. Some states are reachable from a URL; others need setup — a seeded account in a past-due state, a feature flag, a mocked error. Start with the states you can reach cheaply and add the expensive ones as the sweep proves its value. A manifest that names a state nobody can actually capture just produces a coverage gap that looks covered, which is worse than an honest omission.

Narration for this slide

Here is what a baseline manifest actually looks like — a small JSON file in the repo. Viewports declared once at the top, then one entry per page: an id, the route, the states that matter, and a single line of intent. That intent line is doing the most work. Pricing figures must stay aligned tells the compare agent which differences matter on this page; without it, every difference is just pixels and the report becomes a log nobody reads. Be honest about states: some need seeded data or flags to reach. Start with the cheap ones and grow the manifest as the sweep earns its keep.

Slide 5 of 1316:9

Capture sweeps: how the agent walks the product

Capture is scripted, not interactive. The agent runs the same script over the manifest, against the baseline build and the branch under review.

  • A small Playwright script loops over pages and viewports and saves full-page PNGs plus a manifest of what was captured
  • Two folders: baseline from the last approved release, current from the branch under review
  • Stable widths, stable file names, stable wait rules — comparability across runs is the point
  • Wait for network idle and disable animations; lazy images and font swaps fake regressions
  • Keep captures on disk and pass file paths — full-page screenshots pasted into chat burn enormous token budgets

The capture script is design infrastructure. When it is stable, every sweep compares like with like; when it drifts, the findings stop meaning anything.

Slide notes

The capture layer is deliberately boring: a short script that launches Chromium through Playwright, walks the manifest, sets each viewport, navigates to each route, waits for network idle, and saves a full-page PNG with the page id and viewport name in the file name. This school keeps such a script in its own repository and uses it to produce the captures referenced in the visual QA article — the pattern is not hypothetical. The agent can also drive a browser interactively over Playwright MCP or Chrome DevTools MCP, which is useful for poking at one state, but the scripted path wins for sweeps because it produces an identical evidence set every run.

Determinism is the recurring battle. Animations frozen mid-state, lazy-loaded images that never fired, sticky headers repeated through a long full-page stitch, and fonts captured before they swapped all produce differences that look like regressions and are not. Wait for network idle, disable animations where the build allows it, and treat any finding that only appears in a long full-page capture with suspicion until a viewport-sized capture confirms it.

The token caution is practical and current as of mid-2026: practitioner reports have measured a single full-page screenshot returned into a conversation as base64 consuming a six-figure token count. Save captures to disk, pass paths, and let each compare agent read only its own pair of files. That is also what makes the fan-out pattern on the next slides affordable.

Narration for this slide

Capture is the boring part, and it should stay boring. A short Playwright script walks the manifest: set the viewport, load the route, wait for the network to go idle, save a full-page screenshot with a predictable name. Run it twice — once against the last approved release to get the baseline, once against the branch under review. The discipline is in keeping everything stable: same widths, same names, same wait rules, every run. Disable animations and watch out for lazy images and font swaps, because they fake regressions. And keep the screenshots on disk — pasting full-page captures into a chat burns enormous token budgets and buys you nothing.

Slide 6 of 1316:9

Pixel diffs, structural diffs, and agent comparison

Three ways to compare the current build against the baseline. Mature sweeps use more than one, because each catches what the others miss.

Pixel diffStructural / accessibility-tree diffAgent reading both captures
What it comparesRendered pixels, image against imageHeadings, landmarks, labels, element orderThe two images plus the page's intent line
Catches wellSpacing, colour, and layout shifts — even one pixelReordered content, missing labels, lost statesHierarchy and density changes that matter to the user
Misses or over-reportsFlags every anti-aliasing and font-render changeBlind to purely visual drift like colour and spacingCan miss tiny shifts a pixel diff would flag
Cost per pageCheap, fully mechanicalCheap, fully mechanicalAn agent call per page — still minutes per sweep

Mechanical diffs detect that something changed. The agent's job is to say what changed, whether it matters, and against which intent.

Slide notes

Pixel diffing is the oldest tool in this space and it remains useful precisely because it is ruthless: overlay two images and any differing pixel is flagged. Its weakness is the same ruthlessness — anti-aliasing, font rendering differences between environments, and one-pixel sub-layout shifts all light up the diff, and teams that rely on pixel diffs alone spend their time approving false positives until they stop looking at the output. Structural comparison works on the accessibility tree or DOM snapshot instead of pixels: it sees that a heading level changed, a label disappeared, or content reordered, and it is far cheaper in tokens than images, but it is blind to anything purely visual.

The agent comparison layer sits on top of both. A compare agent that reads the baseline capture, the current capture, and the page's intent line can report differences in design language — the queue now sits below the summary cards on mobile, which contradicts the stated intent — rather than as coordinates of changed pixels. That is what makes the report actionable for a designer.

Be honest about the trade-off. The agent layer costs an inference call per page, and it can miss small shifts a pixel diff catches mechanically. The practical pattern is to let the cheap mechanical diffs run on everything as a first filter, and spend the agent's attention on the pages where something changed or where the intent is most sensitive. As of June 2026, dedicated visual regression services exist that handle the pixel layer well; the agent layer is what they generally lack.

Narration for this slide

There are three ways to compare current against baseline, and they catch different things. Pixel diffs overlay the images and flag any pixel that changed — brutally effective for spacing and colour drift, but they also flag every anti-aliasing quirk, and teams drown in false positives. Structural diffs compare the accessibility tree — headings, labels, order — cheap and precise, but blind to purely visual change. The agent layer reads both captures plus the page's intent line and reports differences in design language: the queue now sits below the summaries on mobile, which the intent says must not happen. Use the mechanical diffs to detect change cheaply, and spend the agent's attention on saying whether it matters.

Slide 7 of 1316:9

The regression evidence board

The whole sweep on one board: capture, per-page comparison, classification, and the human review that only sees what deserves a decision.

Four-column board. Column one: the agent captures baseline and current screenshots for the same routes, states, and widths — 1440, 768, and 390 pixels — shown as paired thumbnails. Column two: one compare agent per page reports observable differences only, each with page, viewport, and evidence file. Column three: every diff is classified as a regression to fix, an intended change that updates the baseline, or capture noise that fixes the capture rather than the UI. Column four: the human review meeting receives ranked blockers with evidence attached, systemic causes grouped to one likely fix, baseline updates to approve, and judgment calls — noise never reaches the room.
Capture, per-page comparison, and classification are agent-run; the review meeting is human-led. Every finding that reaches the meeting carries its evidence file, and noise never reaches the room.

The board is a filter. Hundreds of raw differences go in on the left; a short list of decisions comes out on the right.

Slide notes

Walk the board left to right and keep returning to who does what. Column one is capture: the same script, the same routes and states from the manifest, the same three widths, producing a baseline set and a current set. Column two is the fan-out: one compare agent per page, reading only its own pair of captures and its intent line, reporting observable differences with the evidence file named on every finding. Keeping each agent narrow keeps it honest — an agent looking at one page does not average its judgment across the whole product, and forty pages of raw findings never flood one context window.

Column three is the classification this module turns on: every difference is proposed as a regression, an intended change, or noise, with evidence attached. The agent proposes the classification; it does not get to act on it. Column four is the human review — and the design goal of the entire pipeline is visible in what reaches that column: ranked blockers with their evidence, systemic findings grouped to one likely cause, baseline updates awaiting approval, and judgment calls. Noise never reaches the room.

If participants take one thing from the diagram, it should be the filtering. The value of the sweep is not that it finds more differences than a human would — it is that the human only spends attention on the differences that need a decision, each one arriving with its evidence already attached.

Narration for this slide

Here is the whole sweep on one board. On the left, the agent captures baseline and current screenshots — same routes, same states, same three widths, every run. Then one compare agent per page reports observable differences, each finding naming its evidence file. Third column: every difference gets classified — regression, intended change, or noise. And on the right, the human review meeting, which only ever sees what deserves a decision: ranked blockers, systemic causes grouped to one likely fix, baseline updates to approve, and judgment calls. Notice what the board really is — a filter. Hundreds of raw differences go in on the left. A short list of decisions, with evidence attached, comes out on the right. Noise never reaches the room.

Slide 8 of 1316:9

Reading the report: regression, intended change, or noise

Every difference the sweep finds lands in one of three buckets, and each bucket gets a different response.

  • Regression — the build no longer matches approved intent: fix it, then recapture to prove the fix
  • Intended change — the design moved on purpose: approve it and update the baseline
  • Noise — animation timing, lazy images, font swaps: fix the capture, never the UI
  • Repetition is the strongest signal: the same finding on eleven pages usually means one shared token or component
  • The agent proposes the classification with evidence; the human confirms it

Severity still ranks the regressions — P0 blocks the release, P1 gets fixed in this pass, P2 is logged, P3 goes to a human for judgment.

Slide notes

This classification is what separates a regression sweep from a diff dump. A raw comparison tool reports that 212 things changed; the report a designer can act on says which of those are regressions to fix, which are the design moving forward and therefore baseline updates, and which are artifacts of the capture itself. Each bucket has a different response, and mixing them up has a different cost in each direction: treating an intended change as a regression wastes a fix cycle reverting good work, treating a regression as intended quietly ratchets the product away from its design, and chasing noise erodes the team's trust in the whole report.

Within the regression bucket, the same P0 to P3 severity rubric from the earlier modules applies, assigned by user impact rather than by how easy the fix looks. P0 blocks the release — a save action below the fold at phone width. P1 changes hierarchy or breaks a key responsive state. P2 weakens polish or consistency and gets logged and batched. P3 is a judgment call and goes to a human, never to an automatic fix.

The most valuable signal in a merged report is repetition. When eleven pages report the same loosened card padding, the cause is almost certainly one shared token or component, and one fix clears eleven findings. A good report groups repeated findings by likely shared cause rather than listing them per page — that is the difference between a report that reads in five minutes and one that reads like a log.

Narration for this slide

Every difference the sweep finds lands in one of three buckets. Regressions — the build no longer matches the approved intent — get fixed and then recaptured to prove the fix. Intended changes — the design moved on purpose — get approved, and the baseline gets updated. And noise — animation timing, a lazy image, a font swap — means you fix the capture, never the UI. Within the regressions, the familiar severity rubric ranks the work: P0 blocks the release, P1 gets fixed now, P2 is logged, P3 goes to a human. And watch for repetition. The same finding on eleven pages almost always means one shared token, and one fix clears eleven findings.

Slide 9 of 1316:9

Keeping baselines current without rubber-stamping drift

A baseline that never updates buries real findings under known differences. A baseline that updates automatically defines whatever shipped as correct.

  • Baseline updates are deliberate approvals, not a side effect of the sweep passing
  • The person who approves the update is accountable for the new approved look
  • Update per page or per finding, never wholesale — bulk approval is how drift gets ratified
  • Record the why next to the update: which decision or design change it reflects
  • Stale entries are a smell: if a page's baseline keeps needing updates, its design is not actually settled

The baseline is a record of decisions. Updating it is itself a decision, made by a person, one finding at a time.

Slide notes

This is the governance problem every visual regression practice eventually hits, regardless of tooling. If the baseline never updates, every sweep re-reports the same known differences, the report gets longer and less read, and the one genuine regression hides in the pile. If the baseline updates automatically whenever a sweep completes — or whenever a build merges — the tool stops measuring drift and starts ratifying it: whatever shipped becomes, by definition, correct.

The resolution is to treat baseline updates exactly like the intended-change bucket from the previous slide: each one is an explicit approval, made by someone accountable for the product's look, ideally with a one-line note about which design decision it reflects. Per-page or per-finding approval matters because bulk approve all is where discipline goes to die — it feels efficient and it converts every unnoticed regression into the new standard in one click.

There is also a useful diagnostic in the update history. A page whose baseline needs updating every sweep is telling you its design is still in motion; that page belongs in the critique loop from Module 1, not in the regression set, until it settles. And when an intended change is systemic — a new spacing scale, a typography refresh — update the affected baselines in one reviewed batch tied to that decision, rather than letting them trickle through as individual approvals over weeks.

Narration for this slide

Baselines age, and how you update them decides whether the practice keeps working. Never update them and every sweep drowns in known differences until nobody reads the report. Update them automatically and you have ratified drift — whatever shipped becomes correct by definition. The discipline is simple: a baseline update is an approval, made by a person, per page or per finding, with a note about which design decision it reflects. Never bulk-approve. And read the update history as a signal — a page that needs its baseline updated every single sweep is not regressing, it is still being designed. Move it back into the critique loop until it settles.

Slide 10 of 1316:9

Where visual QA sits in the release rhythm

The sweep is not a daily ritual and not a quarterly heroic effort. Tie it to the moments where visual risk actually concentrates.

  • Before any release that bundles many merged changes — the classic two-days-before-release sweep
  • After a design-system, component-library, or CSS-framework upgrade
  • After a CMS theme or shared-template change on a content site
  • On a schedule as a recurring design health check, even when nothing big shipped
  • Not for a single new feature page — new pages have no baseline; use the single-page review from Module 1 instead

A sweep takes thirty to sixty minutes of mostly agent time. The cost question is not whether to run it, but which triggers earn it.

Slide notes

The sweep trades depth per page for coverage across the product, so it earns its keep when the change is broad and the risk is diffuse: release branches that bundle weeks of merged work, design-system version bumps, framework migrations, theme updates, and refactors that touch shared layout components. Those are the moments when a single underlying change can move dozens of pages at once and no individual code review sees the whole effect.

The inverse matters just as much. A single new feature page does not need a sweep — it has no baseline yet, and what it needs is the deeper single-screen review from Module 1: critique against named dimensions, with the heuristic and walkthrough methods from Module 2 where the flow is task-critical. The sweep protects what is already approved; the earlier methods judge what is new. Teams that confuse the two either run sweeps that report nothing useful on new work, or skip critique because the sweep passed.

On cadence: a full sweep is in the range of thirty to sixty minutes, most of it agent time, once the manifest and capture script exist. That is cheap enough to run before every release and on a recurring schedule as a health check. The recurring run matters because drift accumulates between releases too — dependency bumps, content changes, and small fixes all move pixels nobody is watching. Where a per-pull-request check fits into this picture is the subject of Module 5.

Narration for this slide

So when do you run this? Tie the sweep to the moments where visual risk concentrates. Before a release that bundles weeks of merged work. After a design-system or framework upgrade. After a theme or template change. And on a recurring schedule, because drift accumulates even when nothing big ships. The one place not to use it is a single new feature page — there is no baseline to compare against, and what new work needs is the critique and evaluation methods from the first two modules. Once the manifest and the script exist, a full sweep costs thirty to sixty minutes, mostly agent time. The real question is not whether you can afford to run it, but which triggers earn it.

Slide 11 of 1316:9

Worked example: a regression caught two days before release

A 14-page SaaS product, two days before a quarterly release bundling nine weeks of merged work. From the school's regression-sweep workflow.

StageWhat happened
Capture84 screenshots across three viewports from the manifest; 14 compare agents ran in under twelve minutes
Findings31 in total: 2 P0, 7 P1, 16 P2, 6 P3 — every one of them had passed functional tests
The P0sSave button below an overflowing panel at 390px; chart legend rendering white on white after a theme variable rename
SystemicFour P1s traced to one shared cause and cleared together
OutcomeBlockers and systemic fixes done in one afternoon, recapture confirmed, shipped on schedule; the P2s became a polish backlog with owners

Both P0s were invisible to functional tests and to memory-based review. The evidence found them; the humans decided what to do about them.

Slide notes

Walk the table as a story rather than a set of numbers, and be clear about the source: this case study comes from the school's published regression-sweep workflow, describing a product team's run, not a controlled benchmark. The setting is the worst case for memory-based review — nine weeks of merged work, fourteen pages, two days of runway — exactly the situation where someone clicks through a few pages, sees nothing alarming at desktop width, and signs off.

The two P0s repay attention because of why they were invisible. The save button below an overflowing panel at 390 pixels is a responsive-state failure: it only exists at phone width, on one settings page, in one panel state. The white-on-white chart legend came from a theme variable rename — the chart still rendered, the data was still correct, every functional test passed, and the legend was simply unreadable. Neither is the kind of thing a code reviewer sees in a diff.

The outcome is the part to emphasise for sceptics. The team did not slip the release. Blockers and the four systemic P1s were fixed in one afternoon, the affected pages were recaptured to prove the fixes, and the release shipped on schedule. The sixteen P2s did not get fixed that week — they became a polish backlog with named owners, which is a far better fate than a vague feeling that the release looked rough. The sweep's output was a set of decisions, not just a longer to-do list.

Narration for this slide

Let's trace one real run. A fourteen-page SaaS product, two days before a quarterly release bundling nine weeks of work. The capture script produced 84 screenshots, fourteen compare agents ran in under twelve minutes, and the merged report held thirty-one findings — including two release blockers. A save button pushed below an overflowing panel at phone width, and a chart legend rendering white on white after a theme variable rename. Both had passed every functional test. The team fixed the blockers and the four systemic P1s in one afternoon, recaptured to prove it, and shipped on schedule. The sixteen polish findings became a backlog with owners — not a vague sense that the release felt rough.

Slide 12 of 1316:9

Exercise: define the baseline set for one product area

Pick one product area you know well and write its sweep manifest on paper. Do not capture anything yet — the judgment is in the choosing.

  • List five to eight routes where a visual regression would genuinely hurt, and skip the ones that would not
  • For each route, write the one-line intent: what it is for and what must not break
  • Name the states that matter per route — and mark which ones need setup to reach
  • Fix your viewports: the standard three, plus any your audience demands
  • Note one recent change that this baseline set would have caught, and one it would have missed

Keep the page. It becomes the manifest you run in the Module 5 exercise, when design review moves onto every pull request.

Slide notes

The exercise is deliberately paper-only, and the constraint to five to eight routes is the point: an unconstrained list grows until it covers everything and protects nothing in particular, because nobody maintains it. Choosing what to leave out is the design judgment the sweep cannot make for you.

The intent lines are where most participants slow down, and that slowdown is the lesson. Writing dense triage view; the queue must stay above summaries on mobile forces you to articulate what the page is actually for — which is the same articulation a critique brief needs, and the reason this exercise feeds forward rather than being busywork. The state column usually surfaces an honest gap: the states that matter most, like error and past-due, are often the hardest to reach without seeded data, and a manifest should record that cost rather than pretend it away.

The last bullet is the self-test. Asking which recent change this set would have caught makes the value concrete; asking which it would have missed keeps the limits honest — a sweep cannot protect a page that is not listed, cannot judge a brand-new page with no baseline, and cannot prove accessibility from screenshots alone. If running this live, compare manifests in pairs: two people covering the same product area almost never choose the same routes, and the differences are a fast way to surface what each person believes matters most.

Narration for this slide

Time to apply this. Pick one product area you know well and write its sweep manifest on paper. Five to eight routes — only the ones where a regression would genuinely hurt. For each, one line of intent: what the page is for and what must not break. The states that matter, marking which ones need setup to reach. Your fixed viewports. Then test yourself: name one recent change this set would have caught, and one it would have missed. That second answer keeps you honest about the limits. Keep the page — in Module 5 this manifest becomes part of the review that runs on every pull request.

Slide 13 of 1316:9

Summary, and what comes next

  • Memory-based visual review is the wrong shape for systemic, silent regressions; stored evidence is the replacement
  • A baseline is a manifest plus approved captures: chosen surfaces, states, and fixed breakpoints, kept in the repo
  • The sweep captures both sets, fans out one compare agent per page, and classifies every diff: regression, intended change, or noise
  • Severity ranks the regressions; repetition points to shared causes; baseline updates are deliberate human approvals
  • The sweep protects approved work — new work still needs the critique and evaluation methods from Modules 1 and 2

Module 4 turns to the two reviews teams defer longest — accessibility and content — and makes them routine passes with the same evidence discipline.

Slide notes

Recap by connecting the bullets back to the single idea: the question has anything approved quietly changed? can only be answered by comparison against stored evidence, and agents make that comparison cheap enough to run at every release rather than as an occasional heroic effort. The human contribution is concentrated at three points — choosing what the baseline protects, classifying what the differences mean, and approving what changes, including the baseline itself.

Restate the boundaries so the module does not oversell. The sweep cannot judge new pages with no baseline, cannot prove accessibility from screenshots alone, cannot detect issues on pages missing from the manifest, and cannot decide taste questions — it can only put them in front of a person with evidence attached. Those limits are also the bridge points: critique and evaluation handle the new work, and the next module handles the checks screenshots cannot carry.

Preview Module 4 concretely. It covers the two reviews teams defer longest: accessibility review that goes beyond automated contrast checks into structure, keyboard flow, and screen-reader sense, and content review that holds interface copy to a written voice standard. Both reuse this module's discipline — evidence per finding, severity by user impact, and a clear split between what an agent can fix and what needs design or legal judgment. Participants should bring their exercise manifest forward; the same routes are a natural starting set for the accessibility pass.

Narration for this slide

Let's close the module. Memory-based visual review cannot keep up with regressions that are systemic and silent, so we replace it with evidence: a baseline manifest of the surfaces, states, and breakpoints that matter, captured the same way every run. The sweep compares current against baseline, one agent per page, and classifies every difference — regression, intended change, or noise. Severity ranks the regressions, repetition points to shared causes, and updating the baseline stays a human decision. Remember the boundary: the sweep protects approved work; new work still needs critique. In Module 4 we take on the two reviews teams defer longest — accessibility and content — and make them routine passes with the same discipline. See you there.

Module transcript
Module 3, narrated slide by slide

Slide 1Visual QA and Regression Evidence

Welcome to Module 3. The first two modules were about reviewing work as it is made — critique loops, heuristic evaluations, walkthroughs. This module is about a different question: has anything that was already approved quietly changed? Most teams answer that from memory. Someone opens a few pages before a release, decides it looks fine, and ships. Memory does not store spacing, breakpoints, or empty states. In this module we replace that with evidence: screenshot baselines, agent-run sweeps that compare every page against them, and a report that tells you what regressed, what changed on purpose, and what is just noise. Let's start with why memory fails.

Slide 2Why memory fails as a regression process

Here is why memory fails. Visual regressions are systemic — a token changes and eleven pages get looser padding at once. They are visual, not functional — every test passes, the product works, it just stops matching the design. And they are silent — the cookie banner that covered the primary button at phone width was missed by two days of manual checks and caught by a sweep. Memory-based review is the wrong shape for this: one person, a few pages, desktop width, whenever there is time. The fix is not more diligence. It is evidence the team can compare against, captured the same way every time — which is exactly the kind of repetitive work an agent does well.

Slide 3Baselines: choosing surfaces, states, and breakpoints

A baseline is the approved look of your product, stored as files you can compare against. Building one means making three choices. Which surfaces — start with the routes where a regression actually hurts: core tasks, anything involving money, the screens that lean hardest on the design system. Which states — empty, loading, error, expanded — because those are the ones nobody re-checks by hand. And which breakpoints — pick a small set, like 390, 768, and 1440 pixels, and never change them, because comparability is the point. Add a one-line intent for each page, keep the list in the repo, and review it like code. If a surface is not in the set, the sweep cannot protect it.

Slide 4The sweep manifest, in practice

Here is what a baseline manifest actually looks like — a small JSON file in the repo. Viewports declared once at the top, then one entry per page: an id, the route, the states that matter, and a single line of intent. That intent line is doing the most work. Pricing figures must stay aligned tells the compare agent which differences matter on this page; without it, every difference is just pixels and the report becomes a log nobody reads. Be honest about states: some need seeded data or flags to reach. Start with the cheap ones and grow the manifest as the sweep earns its keep.

Slide 5Capture sweeps: how the agent walks the product

Capture is the boring part, and it should stay boring. A short Playwright script walks the manifest: set the viewport, load the route, wait for the network to go idle, save a full-page screenshot with a predictable name. Run it twice — once against the last approved release to get the baseline, once against the branch under review. The discipline is in keeping everything stable: same widths, same names, same wait rules, every run. Disable animations and watch out for lazy images and font swaps, because they fake regressions. And keep the screenshots on disk — pasting full-page captures into a chat burns enormous token budgets and buys you nothing.

Slide 6Pixel diffs, structural diffs, and agent comparison

There are three ways to compare current against baseline, and they catch different things. Pixel diffs overlay the images and flag any pixel that changed — brutally effective for spacing and colour drift, but they also flag every anti-aliasing quirk, and teams drown in false positives. Structural diffs compare the accessibility tree — headings, labels, order — cheap and precise, but blind to purely visual change. The agent layer reads both captures plus the page's intent line and reports differences in design language: the queue now sits below the summaries on mobile, which the intent says must not happen. Use the mechanical diffs to detect change cheaply, and spend the agent's attention on saying whether it matters.

Slide 7The regression evidence board

Here is the whole sweep on one board. On the left, the agent captures baseline and current screenshots — same routes, same states, same three widths, every run. Then one compare agent per page reports observable differences, each finding naming its evidence file. Third column: every difference gets classified — regression, intended change, or noise. And on the right, the human review meeting, which only ever sees what deserves a decision: ranked blockers, systemic causes grouped to one likely fix, baseline updates to approve, and judgment calls. Notice what the board really is — a filter. Hundreds of raw differences go in on the left. A short list of decisions, with evidence attached, comes out on the right. Noise never reaches the room.

Slide 8Reading the report: regression, intended change, or noise

Every difference the sweep finds lands in one of three buckets. Regressions — the build no longer matches the approved intent — get fixed and then recaptured to prove the fix. Intended changes — the design moved on purpose — get approved, and the baseline gets updated. And noise — animation timing, a lazy image, a font swap — means you fix the capture, never the UI. Within the regressions, the familiar severity rubric ranks the work: P0 blocks the release, P1 gets fixed now, P2 is logged, P3 goes to a human. And watch for repetition. The same finding on eleven pages almost always means one shared token, and one fix clears eleven findings.

Slide 9Keeping baselines current without rubber-stamping drift

Baselines age, and how you update them decides whether the practice keeps working. Never update them and every sweep drowns in known differences until nobody reads the report. Update them automatically and you have ratified drift — whatever shipped becomes correct by definition. The discipline is simple: a baseline update is an approval, made by a person, per page or per finding, with a note about which design decision it reflects. Never bulk-approve. And read the update history as a signal — a page that needs its baseline updated every single sweep is not regressing, it is still being designed. Move it back into the critique loop until it settles.

Slide 10Where visual QA sits in the release rhythm

So when do you run this? Tie the sweep to the moments where visual risk concentrates. Before a release that bundles weeks of merged work. After a design-system or framework upgrade. After a theme or template change. And on a recurring schedule, because drift accumulates even when nothing big ships. The one place not to use it is a single new feature page — there is no baseline to compare against, and what new work needs is the critique and evaluation methods from the first two modules. Once the manifest and the script exist, a full sweep costs thirty to sixty minutes, mostly agent time. The real question is not whether you can afford to run it, but which triggers earn it.

Slide 11Worked example: a regression caught two days before release

Let's trace one real run. A fourteen-page SaaS product, two days before a quarterly release bundling nine weeks of work. The capture script produced 84 screenshots, fourteen compare agents ran in under twelve minutes, and the merged report held thirty-one findings — including two release blockers. A save button pushed below an overflowing panel at phone width, and a chart legend rendering white on white after a theme variable rename. Both had passed every functional test. The team fixed the blockers and the four systemic P1s in one afternoon, recaptured to prove it, and shipped on schedule. The sixteen polish findings became a backlog with owners — not a vague sense that the release felt rough.

Slide 12Exercise: define the baseline set for one product area

Time to apply this. Pick one product area you know well and write its sweep manifest on paper. Five to eight routes — only the ones where a regression would genuinely hurt. For each, one line of intent: what the page is for and what must not break. The states that matter, marking which ones need setup to reach. Your fixed viewports. Then test yourself: name one recent change this set would have caught, and one it would have missed. That second answer keeps you honest about the limits. Keep the page — in Module 5 this manifest becomes part of the review that runs on every pull request.

Slide 13Summary, and what comes next

Let's close the module. Memory-based visual review cannot keep up with regressions that are systemic and silent, so we replace it with evidence: a baseline manifest of the surfaces, states, and breakpoints that matter, captured the same way every run. The sweep compares current against baseline, one agent per page, and classifies every difference — regression, intended change, or noise. Severity ranks the regressions, repetition points to shared causes, and updating the baseline stays a human decision. Remember the boundary: the sweep protects approved work; new work still needs critique. In Module 4 we take on the two reviews teams defer longest — accessibility and content — and make them routine passes with the same discipline. See you there.