Agentic Design School

Section 01

Why a sweep, not a single review

A single-page visual review answers one question: does this page still match the approved design? A regression sweep answers a harder one: does the whole product still match it, after a release branch, a dependency bump, or a design-system upgrade touched dozens of files at once?

Sweeps exist because visual regressions rarely arrive one page at a time. A spacing token changes and every card in the product gets two pixels looser. A theme update lands and three marketing pages quietly lose their heading rhythm. Reviewing pages one by one in chat does not scale to that.

This workflow scales the single-page review by making the agent build the evidence first, then fan out one focused compare agent per page. The human sees a merged, ranked report instead of forty separate conversations.

Projects to inspect

Visual QA With AgentsThe single-page review workflow this sweep is built on. Read it first if you have never run an agent-driven visual review.

Section 02

When to reach for it

Run a sweep when the change is broad and the risk is diffuse. Quarterly releases, design-system version bumps, framework migrations, CMS theme updates, and any refactor that touches shared layout components are all good triggers.

Do not run a sweep for a single new feature page. The single-review workflow is faster and gives you more depth on one screen. The sweep trades depth per page for coverage across the product.

Before a release that bundles many small UI changes.
After upgrading a design system, component library, or CSS framework.
After a CMS theme or template change on a content site.
When responsive behavior may have drifted across many pages.
On a schedule, as a recurring design health check.

Section 03

The orchestration pattern: a dynamic workflow

This is a dynamic workflow, not a single chat session. When you include the word workflow in the prompt, Claude Code writes a small JavaScript orchestration script and runs it in the background. The script calls subagents, holds their intermediate results in script variables, and only surfaces the merged report back into the main conversation.

That separation is the point. Forty pages of raw findings would flood the main context and degrade the review. In a dynamic workflow, each compare agent reads only its own pair of screenshots, returns a structured finding list, and the script aggregates. Up to 16 agents run concurrently and a single run can use up to 1,000 agents, which is more than enough for any product surface.

Workflows are resumable, so a sweep interrupted halfway does not lose the captures already reviewed. Once the script works, save it to .claude/workflows/ and it becomes a reusable slash command the team can run before every release. You can also trigger the heavier orchestration path explicitly with /effort ultracode.

diagramRegression sweep loop

Step 1

Define page list

Step 2

Capture all pages

Step 3

Fan out compare agents

Step 4

Merge and rank findings

Step 5

Fix in passes

Step 6

Recapture and verify

Step 7

Sign off

feeds next cycle

The sweep is a cycle: capture, fan out, merge, fix, recapture, sign off.

Section 04

Step 1: define the sweep manifest

The sweep is only as good as its page list. Write a manifest that names every page, the states that matter on it, and the viewports to capture. Keep it in the repo so the list is reviewed like code and grows with the product.

For each page, record the route, any setup needed to reach the state, and a one-line statement of what the page is for. That intent line is what the compare agents use to judge whether a difference matters.

sweep-manifest.json

{
  "viewports": [
    { "name": "mobile", "width": 390, "height": 900 },
    { "name": "tablet", "width": 768, "height": 1024 },
    { "name": "desktop", "width": 1440, "height": 1000 }
  ],
  "pages": [
    {
      "id": "dashboard",
      "route": "/dashboard",
      "intent": "Dense triage view; queue must stay above summaries on mobile.",
      "states": ["default", "empty", "error"]
    },
    {
      "id": "billing",
      "route": "/settings/billing",
      "intent": "Plan comparison and invoice history; pricing figures must stay aligned.",
      "states": ["default", "past-due"]
    },
    {
      "id": "reports",
      "route": "/reports",
      "intent": "Chart-heavy page; legends and axis labels must remain readable at 390px.",
      "states": ["default", "loading"]
    }
  ]
}

Section 05

Step 2: capture evidence with a script, not by hand

The capture script walks the manifest and produces a baseline folder and a current folder of screenshots. The baseline comes from the last approved release or from design exports. The current set comes from the branch under review.

Stable viewports and stable file naming are what make the sweep repeatable. If every run captures slightly different widths or names files differently, the compare agents waste effort and the findings stop being comparable across sweeps.

capture-screens.mjs

import { chromium } from "playwright"
import { readFile } from "node:fs/promises"

const manifest = JSON.parse(await readFile("sweep-manifest.json", "utf8"))
const target = process.argv[2] ?? "current" // "baseline" or "current"
const baseUrl = process.env.SWEEP_BASE_URL ?? "http://localhost:3000"

const browser = await chromium.launch()
const page = await browser.newPage()

for (const entry of manifest.pages) {
  for (const viewport of manifest.viewports) {
    await page.setViewportSize({ width: viewport.width, height: viewport.height })
    await page.goto(baseUrl + entry.route, { waitUntil: "networkidle" })
    await page.screenshot({
      path: "sweeps/" + target + "/" + entry.id + "--" + viewport.name + ".png",
      fullPage: true,
    })
  }
}

await browser.close()

Section 06

Step 3: fan out one compare agent per page

Each compare agent receives one page: its baseline captures, its current captures, and the intent line from the manifest. It reports observable differences only, grouped by layout, typography, spacing, color, content, states, responsiveness, and accessibility, each with a severity and a proposed fix.

Keeping the agents narrow keeps them honest. An agent looking at one page does not average its judgment across the whole product, and its findings stay grounded in the two images in front of it.

Dynamic workflow sketch (the orchestration script Claude writes)

// Sketch of the orchestration script Claude Code generates for this workflow.
// agent(prompt, options) runs a subagent and resolves with its final answer.
import { readFile, writeFile } from "node:fs/promises"

const manifest = JSON.parse(await readFile("sweep-manifest.json", "utf8"))

const reviews = await Promise.all(
  manifest.pages.map((entry) =>
    agent(
      "Use the visual-compare agent rules. Compare sweeps/baseline and sweeps/current captures for page " +
        entry.id +
        ". Intent: " +
        entry.intent +
        ". Report only observable differences as JSON findings with severity P0-P3, page, viewport, observation, impact, and fix.",
      { model: "sonnet" }
    )
  )
)

const findings = reviews.flatMap((review) => JSON.parse(review).findings)
findings.sort((a, b) => a.severity.localeCompare(b.severity))
await writeFile("sweeps/findings.json", JSON.stringify(findings, null, 2))

const report = await agent(
  "Summarize sweeps/findings.json into a release-readiness report grouped by severity, then by page. Flag any P0 or P1 as release blockers and group repeated findings by likely shared cause.",
  { model: "sonnet" }
)
await writeFile("sweeps/report.md", report)

Section 07

Define the compare agent once

The compare behavior should live in a subagent definition rather than being re-typed into every prompt. A short markdown file in .claude/agents/ gives the role a name, a description, and the tools it is allowed to use.

This keeps every page review consistent. The sweep script just addresses the agent by name and supplies the page-specific details.

.claude/agents/visual-compare.md

---
name: visual-compare
description: Compares baseline and current screenshots for a single page and reports observable visual differences with severity and fixes. Use during regression sweeps.
tools: Read, Glob, Bash
---

You compare two sets of screenshots for one page.

Rules:
- Report observable differences only. Never report taste as a finding.
- Group findings by layout, typography, spacing, color, content, states, responsiveness, accessibility.
- Severity: P0 blocks the task, P1 changes hierarchy or breaks a viewport, P2 weakens polish, P3 subjective and needs human judgment.
- Every finding includes: page, viewport, observation, user impact, likely cause, concrete fix.
- Output valid JSON: { "findings": [...] }.

Section 08

Step 4: merge and rank

The orchestration script merges every page's findings into one list, sorts by severity, and asks one final agent to write the release-readiness report. The report leads with blockers, groups the rest by page, and notes which findings repeat across many pages.

Repetition is the most useful signal a sweep produces. If eleven pages report the same loosened card padding, the cause is almost certainly one shared token or component, and one fix clears eleven findings.

tableWhat the merged report contains

1Blockers

P0 and P1 findings that should stop the release until fixed

2Systemic findings

Differences that repeat across pages and point to a shared cause

3Per-page findings

P2 polish issues grouped by page for later passes

4Human questions

P3 judgment calls that need a designer, not a fix

5Coverage note

Pages or states that could not be captured this run

The report is a decision artifact for the release, not a raw diff dump.

Section 09

Step 5: fix in passes, then recapture

Fix blockers and systemic causes first, polish later. After each fix pass, rerun the capture script for the affected pages and re-run the compare agents on just those pages. The workflow is resumable, so a partial re-check does not require repeating the whole sweep.

The sweep is done when a full recapture produces no P0 or P1 findings and the remaining P2 and P3 items have an owner and a decision.

Pass 1: P0 blockers and anything that breaks a task on mobile.
Pass 2: systemic causes such as changed tokens or shared components.
Pass 3: per-page P2 polish, batched by area.
Pass 4: full recapture and a final compare across every page.

diagramFix and verify loop

Step 1

Pick pass scope

Step 2

Apply fixes

Step 3

Recapture affected pages

Step 4

Re-run compare agents

Step 5

Update report

Step 6

Decide next pass

feeds next cycle

Each fix pass ends with recapture and a narrow re-review before the next pass starts.

Section 10

Case study: 14-page SaaS app before a quarterly release

A product team ran the sweep on a 14-page SaaS app two days before a quarterly release that bundled nine weeks of merged work. The capture script produced 84 screenshots across three viewports, and 14 compare agents ran in under twelve minutes.

The merged report contained 31 findings: 2 P0, 7 P1, 16 P2, and 6 P3. The P0s were a settings form whose save button fell below an overflowing panel at 390px, and a reports page where a chart legend rendered white on white after a theme variable rename. Both had passed functional tests.

The team fixed the blockers and the four systemic P1s in one afternoon, recaptured, and shipped on schedule. The P2 list became a polish backlog with owners instead of a vague sense that the release felt rough.

Section 11

Case study: design-system version bump

A design-system upgrade moved spacing from a 4px base scale to a refined token set. Nothing looked obviously broken, which is exactly the kind of change a sweep is for. The compare agents reported the same finding pattern on 11 of 18 pages: card internal padding grew from 16px to 20px and list row height grew by 4px, dropping one to two rows below the fold on dense views.

Because the findings repeated, the report named a single likely cause: the mapping from the old space-4 token to the new scale. One mapping change plus a recapture cleared 23 of the 29 findings. The remaining six were genuine per-page issues that the bump had merely exposed.

Section 12

Case study: marketing site after a CMS theme update

A marketing team updated their CMS theme and assumed only colors had changed. The sweep across nine pages found that the theme also swapped the heading font fallback and tightened line height on body copy, which broke the rhythm on three long-form pages and pushed a pricing table's footnotes above the fold boundary on tablet.

It also caught one finding no human had noticed in two days of manual checks: the cookie banner now overlapped the primary call to action at 390px on every page. That single P0 justified the sweep on its own.

Section 13

Good vs bad sweep output

A weak sweep report reads like a diff log: hundreds of pixel-level notes with no ranking and no causes. A strong report is short at the top, specific underneath, and tells the team what to do first.

tableSweep report quality comparison

1Bad

Found 212 visual differences across 14 pages

2Good

2 release blockers, 4 systemic findings traced to one token mapping, 16 polish items with owners

3Bad

Dashboard looks slightly different on mobile

4Good

P0: Dashboard at 390px places the save action below an overflowing filter panel; users cannot complete edits without scrolling past it

5Bad

Many pages have spacing changes

6Good

Systemic: card padding grew from 16px to 20px on 11 pages; likely cause is the space-4 token remap in the design-system upgrade

The report should support a release decision, not just describe pixels.

Section 14

Limits: what the sweep cannot prove

The sweep proves that the current build matches or differs from the baseline in observable ways. It does not prove the baseline was good, that the product logic is correct, or that users will succeed with an unchanged but confusing flow.

Humans still decide what counts as a blocker, whether a P3 judgment call becomes a change, and whether the release ships with known P2 debt. The sweep narrows those decisions; it does not make them.

It cannot judge new pages that have no baseline; those need the single-page review instead.
It cannot prove accessibility from screenshots alone; keep DOM, contrast, and keyboard checks in the loop.
It cannot detect issues on pages missing from the manifest, so the manifest needs the same care as the code.
It cannot decide brand or taste questions; it can only flag them as P3 for a human.

Section 15

Reusable sweep workflow

Save the workflow to .claude/workflows/ once it works for your product, and it becomes a slash command the team can run before every release. The output should always be the same artifact: a ranked, evidence-backed release-readiness report.

Visual QA regression sweep workflow

1. Maintain a sweep manifest of pages, states, intents, and viewports.
2. Capture baseline and current screenshots with the capture script.
3. Run this as a dynamic workflow: fan out one visual-compare agent per page.
4. Merge findings, rank by severity, and group repeated findings by likely shared cause.
5. Approve the fix plan: blockers first, systemic causes second, polish later.
6. Apply fixes in passes, recapturing only the affected pages between passes.
7. Run a full recapture and final compare across every page.
8. Sign off when no P0 or P1 remains and every P2 and P3 has an owner or a decision.

Sources

Visual QA Regression Sweep