AAgentic Design School
Module 5 of 5
40–50 minutes

Design Review and Critique with Agents

Design Review on Every PR

The end state: every change that touches the interface gets a design review automatically — token and component checks, screenshot evidence, severity levels — with humans reviewing what matters instead of everything or nothing.

Duration40–50 minutes

Slides13 slides with notes and narration

Learning objectives

  • Define what a per-PR design review checks and what it deliberately ignores.
  • Set severity levels and escalation so humans see the findings that need them.
  • Roll the practice out without turning it into a blocking bureaucracy.
Slide deck

Work through the module

Each slide is shown in its 16:9 frame, exactly as it appears in the video version. Open the notes under any slide for the longer explanation, and the narration if you prefer to read along.

Slide 1 of 1316:9

Design Review on Every PR

Design Review and Critique with Agents · Module 5 of 5

  • Review every change, not every quarter
  • What a PR-level review checks — and what it deliberately ignores
  • Severity, escalation, and the designer's decision
  • Rolling it out without building a bureaucracy

Everything from the previous four modules — critique dimensions, evaluation methods, regression evidence, accessibility and content passes — converges into one standing gate on the pull request.

Slide notes

This is the closing module of the course, and it is where the earlier modules stop being separate practices and become one habit. Critique loops, heuristic findings, screenshot baselines, accessibility passes — all of them are valuable as occasional projects, but design quality mostly degrades between those projects, one merge at a time. The end state this module describes is a review that runs on every pull request that touches the interface, scoped to what the branch actually changed.

Set expectations about scale honestly. The gate described here is not a full design review of the product on every merge; it is a narrow, mechanical review of the diff, with humans pulled in only when the findings need judgment. That narrowness is what makes it sustainable. A gate that tries to check everything blocks everything, and a blocked team routes around the gate within a fortnight.

If participants have not done Modules 3 and 4, the screenshot-evidence and accessibility material here will still make sense, but point them back: the per-PR gate reuses the capture-and-diff and accessibility techniques from those modules at smaller scope, not new ones.

Narration for this slide

Welcome to the final module of the course. Everything so far — critique dimensions, heuristic evaluations, regression evidence, accessibility and content review — has been a practice you run when you decide to. This module is about the end state: a design review that runs automatically on every pull request that touches the interface. Not a full review of the product on every merge — a narrow, mechanical review of exactly what the branch changed, with screenshot evidence, severity levels, and a clear rule for when a human designer must look. By the end you will be able to draft the review rules for your own codebase. Let's start with why the pull request is the right place.

Slide 2 of 1316:9

Review every change, not every quarter

Design quality rarely fails in one decision. It degrades one merge at a time, and each individual diff looks reasonable.

  • Engineers review diffs for logic, not for hierarchy, rhythm, or token discipline
  • A spacing value gets hardcoded under deadline; a button picks up an off-palette hex
  • A new empty state ships without anyone seeing it on a phone
  • A finding raised on the branch is a thirty-second conversation; after release it is a ticket and a triage meeting

The pull request is the moment the change is cheapest to fix and the author's context is freshest. That is where the review belongs.

Slide notes

The argument for putting design review on the pull request is an argument about cost and timing, not about distrust of engineers. Engineers reviewing a UI pull request are reading it for logic, naming, and side effects — that is their job. Nobody in that review is checking whether the new badge colour exists in the token set, whether the gap below 768 pixels collapsed, or whether the new dialog's close button still has a focus ring. Each of those slips is small and individually defensible, which is exactly why they accumulate.

The timing point matters as much as the coverage point. When a finding is raised while the branch is open, the author still has the whole change in their head, the preview is already built, and the fix is usually a one-line commit. The same finding discovered after release becomes a bug report, a triage discussion, a regression risk assessment, and often a debate about whether it is worth fixing at all. The work is identical; the overhead is not.

The historical blocker has been cost: no design team can manually review every UI branch at three widths with a token check and an accessibility pass. The previous modules established that an agent can run each of those checks reliably when the procedure is written down. This module is about writing the procedure down once and attaching it to the place changes actually happen.

Narration for this slide

Here is the problem this module solves. Design quality almost never fails in one big decision — it degrades one merge at a time. A spacing value gets hardcoded because the deadline was tight. A button picks up a colour that is not in the tokens. A new empty state ships without anyone looking at it on a phone. Each diff looks fine to the engineer reviewing it, because engineers review for logic, not hierarchy and token discipline. Catch the issue while the branch is open and it is a thirty-second conversation. Catch it after release and it is a ticket, a triage meeting, and a regression risk. The pull request is where the fix is cheapest. That is where the review belongs.

Slide 3 of 1316:9

What the gate checks — and what it deliberately ignores

The scope is the diff, not the product. That is what keeps the run inside twenty minutes and the findings relevant to the author.

CheckScope on every PR
Visual diffBranch vs main captures of changed screens at 390, 768, and 1440 px
Token usageHardcoded colours, spacing, radii, and font sizes in changed components
AccessibilityAutomated checks plus a keyboard pass on the changed routes
StatesEmpty, loading, error, and disabled states the diff introduces or alters
Out of scopeCopy quality, product logic, performance, screens the branch does not touch

A gate that checks everything blocks everything. The deliberate exclusions are what make the gate survivable.

Slide notes

Spend as much time on the last row as on the first four. The strongest temptation when setting this up is to keep adding checks — copy review, performance budgets, full-product heuristics — because each one is individually valuable. Resist it. Every check added to the per-PR gate adds run time, adds findings the author has to read, and adds opportunities for false positives. The gate's authority comes from being fast, scoped, and almost always right; that is what earns it the right to run on every branch.

The four in-scope checks are the ones the previous modules made cheap. The visual diff is Module 3's capture-and-compare technique, narrowed to changed screens only. Token usage is the same kind of mechanical rule the design-system maintenance loop relies on: raw values in component code that have a token equivalent. Accessibility is Module 4's automated-plus-keyboard pass, on changed routes only. States matter because the diff is where new states are born, and a state nobody captured is a state nobody designed.

Note what the scoping buys operationally: a branch that touches two screens gets a two-screen review, and the whole run typically lands in the ten-to-twenty-minute range. The full-surface reviews from Modules 2 and 3 still exist — they just run on a release or quarterly cadence, not per branch.

Narration for this slide

So what does a per-PR design review actually check? Four things, all scoped to the screens the branch changed. A visual diff: branch versus main, captured at three widths. Token usage: hardcoded colours, spacing, radii, and font sizes in the changed components. Accessibility: automated checks plus a keyboard pass on the changed routes. And states: any empty, loading, error, or disabled state the diff introduces or alters. Just as important is what it deliberately ignores — copy quality, product logic, performance, and every screen the branch does not touch. That is not laziness. A gate that checks everything blocks everything, and a blocked team routes around the gate. The exclusions are what keep it alive.

Slide 4 of 1316:9

Token, component, and anti-pattern checks as gates

The mechanical checks are defined once, versioned with the code, and applied identically to every branch.

  • Token discipline: raw colours, spacing, radii, and font sizes flagged with the token that should replace them
  • Component rules: one-off variants, bypassed primitives, and props that duplicate existing behaviour
  • Anti-patterns your team has named: gradients on data surfaces, nested scroll areas, off-grid spacing
  • Reviewers live in version control, so every branch is reviewed by the same standard
.claude/agents/token-reviewer.md (excerpt)
---
name: token-reviewer
description: Reviews changed component files for design token
  discipline. Flags hardcoded colors, spacing, radii, shadows,
  and font sizes that have a token equivalent. Read-only.
tools: Read, Grep, Glob
---
For every finding report: file and line, the hardcoded value,
the token that should replace it, and severity.
If a value has no reasonable token equivalent, say so instead
of inventing a mapping. Do not report style preferences.

Because the reviewer definitions are files in the repository, changing the standard is a reviewed change — not a personal preference applied inconsistently.

Slide notes

The mechanical heart of the gate is a small set of reviewer definitions checked into the repository — in the Claude Code version of this workflow they live in .claude/agents/, and the saved workflow command that orchestrates them lives in .claude/workflows/. The token reviewer shown here is the workhorse: it reads only the changed component files, flags raw values that have a token equivalent, and reports file, line, value, replacement token, and severity. The two constraints in its definition matter most: it must say when no reasonable token exists rather than inventing a mapping, and it must not report style preferences. Both rules exist because they are the two ways automated reviewers lose a team's trust.

Component and anti-pattern checks follow the same shape. Component rules catch the one-off variant built under deadline pressure and the primitive that got bypassed; anti-pattern rules encode the things your team has explicitly decided not to do. The important property is that these are your team's named rules written down, not a generic linter's opinion. If a rule is not written down, the gate cannot check it — which is also a useful forcing function for actually writing the rules down.

The versioning point deserves emphasis with design leads. Because the reviewers are files, tightening or loosening the standard is itself a pull request: visible, discussed, and applied to everyone's next run. That is the difference between a shared quality bar and whichever reviewer happened to be grumpy that week.

Narration for this slide

The mechanical checks are where the gate earns its consistency. A token reviewer reads the changed component files and flags raw colours, spacing, radii, and font sizes that have a token equivalent — citing the file, the line, and the token that should replace it. Component rules catch one-off variants and bypassed primitives. Anti-pattern rules encode the things your team has explicitly named and banned. Here is the part that makes it a team practice rather than a personal trick: these reviewers are files in the repository, versioned next to the code they review. Every branch gets the same reviewers with the same instructions, and changing the standard is itself a reviewed change.

Slide 5 of 1316:9

Screenshot evidence attached to the PR automatically

Findings without evidence start debates. Findings with capture paths start commits.

  • Branch and main captured at the same widths, for the changed screens only
  • Diffs attached per finding, so the author sees the change rather than imagining it
  • States from the routes map captured too: error, empty, loading, disabled
  • The findings comment is paste-ready: severity, evidence, user impact, and a concrete fix per finding

The author should be able to act on the comment without opening the design tool, the staging site, or a meeting.

Slide notes

Module 3 made the case that visual review needs evidence rather than memory; the per-PR gate is where that evidence becomes routine. For each changed screen, the run captures the branch and main at the same widths, diffs them, and attaches the capture paths to the relevant finding. The difference this makes to the author is practical: they see the regression rather than being told about it, and they can judge in seconds whether the difference is the change they intended or a side effect they did not.

The quality bar for the findings comment decides whether the gate gets respected or routed around. A vague comment — the new badge colour looks off-brand — generates a debate. A precise one — Badge.tsx line 24 uses a hex value not in the tokens, the closest token is the warning colour, evidence at this capture path — generates a commit. Every finding should carry severity, evidence, user impact, and a concrete fix. Spot-check the first few runs and tighten the reviewer instructions if findings drift towards generalities.

There is also a quiet cultural benefit worth naming: the screenshot evidence gives engineers a way to see what the designer would have seen. Over a few weeks of reading these comments, authors start catching the issues before the gate does, which is the real win — the gate is a teaching mechanism as much as a catching mechanism.

Narration for this slide

Every finding in the review comes with evidence attached. The run captures the changed screens on the branch and on main, at the same widths, including the states the routes map lists — error, empty, loading, disabled — and attaches the diffs to the findings. That changes the conversation. A comment that says the badge colour looks off-brand starts a debate. A comment that says this file, this line, this hex value, here is the token it should be, and here is the capture — that starts a commit. The standard to hold is simple: the author should be able to act on the comment without opening the design tool, the staging site, or a meeting.

Slide 6 of 1316:9

The PR design review pipeline

Seven stations from an opened pull request to a merge, with one human gate and a feedback path into the harness.

Pipeline diagram of design review on every pull request. A pull request that touches the interface is scoped to its diff using a routes map, then passes through agent-run automated checks for tokens, components, anti-patterns, and accessibility, and an agent capture-and-critique step that screenshots the branch and main across widths and states. The findings merge into a posted review summary with severity levels, evidence, and a fix per finding. A designer decision gate reads the summary, distinguishes intentional deviation from regression, and decides to block, request fixes, or approve. The author fixes the branch and the same command re-verifies before merge. A dashed feedback line returns recurring findings from the decision gate into the harness rules and reviewer agents.
Scoping, checks, capture, and the posted summary are agent-run; the decision to block, fix, or approve is human, and the author closes the loop on the branch. The dashed line is the long-term payoff: recurring findings become rules in the harness and reviewer agents, so the same issue stops appearing.

Humans appear in exactly two places: the decision gate, and the dashed line that turns recurring findings into rules.

Slide notes

Walk the diagram left to right along the top, then right to left along the bottom. The pull request opens and the previews build — that is the author's only setup cost. The agent scopes the run by mapping the diff to screens and states using a routes map kept in the repository; when the map misses a component, the gate reports the gap rather than silently skipping it, which is how the map improves over time. The automated checks and the capture-and-critique pass then run in parallel — same reviewers, same widths, every branch.

The bottom row is where judgment lives. The merged summary is posted on the pull request with severity, evidence, and a fix per finding. The designer decision gate is deliberately narrow: most pull requests never reach it, because nothing block-level was found and the author simply fixes the warn-level items. When it is reached, the questions are the ones only a person can answer — is this deviation intentional, does this block the merge, is the severity right. Then the author fixes the branch and the same command re-verifies before merge.

Point at the dashed line last, because it is the part teams forget. A finding that appears on every third pull request is not a review problem; it is a missing rule. Encoding it — in the instruction file, the reviewer agent, or the component itself — is what makes the gate get quieter over time instead of noisier. A gate whose finding count never drops is a gate nobody is learning from.

Narration for this slide

Here is the whole pipeline in one picture. A pull request opens and the previews build. The agent scopes the run to the diff using a routes map, then runs the automated checks — tokens, components, anti-patterns, accessibility — and captures the changed screens on both branches as evidence. Everything merges into a review summary posted on the pull request: severity, evidence, and a fix per finding. The designer decision gate sits after that, and most pull requests never need it. When they do, the questions are human ones — intentional deviation or regression, block or approve. The author fixes the branch, the same command re-verifies, and the change merges. And the dashed line matters most: recurring findings become rules in the harness, so the gate gets quieter over time, not louder.

Slide 7 of 1316:9

Severity levels: block, warn, note

Three levels are enough. More than three and the team stops agreeing on what they mean.

LevelWhat qualifiesWho acts
BlockBroken accessibility on an interactive element, a regression on a primary flow, brand colour or core spacing violated on a primary surfaceAuthor must fix before merge; designer confirms or waives with a stated reason
WarnToken violations off the primary path, layout shifts at one width, missing states the diff introducedAuthor fixes on the branch or files a follow-up with an owner
NoteNo exact token exists and a new one may be needed, map gaps, observations for the design system backlogLogged for the system owner; never blocks the merge

Severity is a routing mechanism, not a grade. Its only job is deciding who has to look and when.

Slide notes

Severity exists to answer one question: who has to look at this finding, and does it stop the merge. Three levels are enough to answer it, and they map cleanly onto the P0-to-P3 scales most teams already use for bugs — block covers P0 and P1, warn covers P2, note covers P3. Adding more levels feels rigorous and in practice just moves the argument from whether something blocks to which of five labels it deserves.

The definitions need to be written for your product, not copied from a slide. What counts as a primary surface, which flows are revenue-critical, which token violations are brand-level rather than cosmetic — those are decisions the design lead makes once, writes into the reviewer instructions, and adjusts as the first weeks of findings show where the lines actually fall. The waive option on block-level findings is not a loophole; it is what keeps the gate honest. There are legitimate reasons to ship with a known issue, and recording the waiver with a reason is far healthier than pressuring the gate to stay quiet.

The note level earns its place over time. Notes are where the gate reports the things that are not the author's fault — a value with no token equivalent, a screen missing from the routes map — and routing those to the design-system owner is how the gate feeds the system instead of just policing it.

Narration for this slide

Every finding gets one of three severity levels, and three is the right number. Block means the merge stops: broken accessibility on an interactive element, a regression on a primary flow, a brand colour violated on a primary surface. The author fixes it, or a designer explicitly waives it with a reason on the record. Warn means fix it on the branch or file a follow-up with a named owner — token violations off the primary path, a layout shift at one width. Note never blocks anything: it is the gate telling the design-system owner that a token is missing or the routes map has a gap. Severity is not a grade. It is a routing mechanism — it decides who has to look, and when.

Slide 8 of 1316:9

Escalation: when a human designer must look

The gate's job is to make most pull requests not need a designer — and to be unambiguous about the ones that do.

  • Any block-level finding the author disputes or wants waived
  • Any visual diff the author marks as intentional — someone with design authority confirms it is an improvement
  • New screens, new components, or new states with no baseline to compare against
  • Anything the gate reports as outside its map: unmapped routes, components it could not check
  • Severity disputes — the conversation the findings comment makes short

Escalation is a named person and a response-time expectation, not a channel where findings go to be ignored.

Slide notes

The escalation rules are what make every PR honest rather than aspirational. Without them, one of two things happens: the designer tries to read every findings comment and burns out within a month, or nobody reads them and the gate becomes decoration. The list on the slide is deliberately short — block disputes and waivers, intentional visual changes, genuinely new surfaces, gaps the gate itself reports, and severity arguments. Everything else is between the author and the gate.

The intentional-change case is worth dwelling on, because it is the most common escalation in practice and the one the gate genuinely cannot decide. A visual diff only shows that something changed; whether the change is an improvement is a design judgment. The protocol is simple: the author marks the diff as intended, and someone with design authority looks at the capture and agrees or pushes back. That is a two-minute review with the evidence already attached, which is exactly the kind of review a designer can do twenty times a week without it consuming the week.

Make the escalation path concrete: a named person or rotation, an expected response time, and a rule for what happens when nobody responds — usually that warn-level changes proceed and block-level changes wait. An escalation path that is just a channel where comments accumulate is how the gate quietly dies.

Narration for this slide

The point of the gate is that most pull requests do not need a designer at all — the author reads the findings, fixes them, and merges. So the escalation rules need to be unambiguous about the exceptions. A designer must look when a block-level finding is disputed or needs waiving. When the author says a visual change is intentional — because the gate can show the change but cannot judge whether it is an improvement. When the branch adds genuinely new screens or components with no baseline. When the gate itself reports gaps it could not check. And when severity is contested. Make it a named person with a response-time expectation, not a channel where findings go to be ignored. That is the difference between a working gate and a decorative one.

Slide 9 of 1316:9

Avoiding bureaucracy: budgets for noise and false positives

The gate is a service to the team. The moment it costs more than it catches, the team will route around it — and they will be right to.

  • Set a noise budget: if more than roughly one finding in five is a false positive or a triviality, fix the reviewers before adding anything new
  • Keep the run inside twenty minutes; a slow gate becomes an end-of-day gate, then an optional one
  • Start as a soft requirement on UI-labelled PRs, not a hard CI failure on day one
  • Let authors mark intended changes; do not make people argue with a robot about a deliberate redesign
  • Every recurring finding becomes a rule in the harness or a fix in the system — the gate should get quieter, not louder
  • Review the gate itself monthly: findings raised, fix rate, false-positive rate, and what got waived

A design gate fails socially before it fails technically. Noise, latency, and pedantry are what kill it.

Slide notes

Most per-PR review efforts that die do not die because the checks were wrong; they die because the experience of being checked was annoying. The failure sequence is predictable: the gate is noisy, authors learn that most findings are ignorable, they stop reading the comments, then a real block-level finding gets ignored along with the noise, and the team concludes the gate does not work. The defences are operational, not technical, which is why this slide is a checklist rather than a concept.

The noise budget is the most important line. Track the false-positive rate from the first week, and treat exceeding the budget as a defect in the reviewers — tighten the instructions, narrow the rules, or remove the check — rather than as a tax authors should tolerate. The twenty-minute ceiling is the second one: a gate that takes an hour gets run overnight, then gets run sometimes, then stops being a gate.

The rollout advice is to start soft. A soft requirement — UI-labelled pull requests need the findings comment before merge, but nothing physically prevents merging — builds the habit and surfaces the false positives while the stakes are low. Promote it to a hard requirement only for the block level, only once the team trusts the findings. And keep the monthly review of the gate itself: a standing gate that nobody ever re-examines becomes bureaucracy by default, even if it started well.

Narration for this slide

Now the part that decides whether this survives contact with a real team. A design gate fails socially before it fails technically — noise, latency, and pedantry kill it long before a missed regression does. So set a noise budget: if more than about one finding in five is a false positive or a triviality, fix the reviewers before you add anything new. Keep the run inside twenty minutes. Start as a soft requirement on UI-labelled pull requests, not a hard CI failure on day one. Let authors mark intentional changes instead of arguing with a robot. Turn every recurring finding into a rule, so the gate gets quieter over time. And review the gate itself monthly. It is a service to the team, not a tollbooth.

Slide 10 of 1316:9

Measuring whether the practice is improving anything

If you cannot say what the gate has caught and what it has cost, you cannot defend it when someone busy wants it gone.

  • Escapes: visual and accessibility regressions found after merge — the number the gate exists to reduce
  • Catch profile: findings by severity, and the fix rate on the branch versus deferred
  • Cost: minutes per run, and designer minutes spent on escalations per week
  • Trend: recurring findings should fall as rules move into the harness and the system
  • Waivers: how many block-level findings shipped anyway, and why

The gate is working when escapes fall and the finding count falls too — because the rules moved upstream into the harness.

Slide notes

The honest measure of a per-PR design review is escapes: regressions of the kinds the gate checks for, found after merge by users, support, or a later audit. That is the number that was painful enough to justify the gate, and it is the number to keep reporting. Track it before the rollout if you can, even roughly — a quarter's worth of after-the-fact regressions is a baseline that makes the later comparison meaningful rather than anecdotal.

The supporting numbers guard against the two ways the headline number can mislead. The catch profile and fix rate show whether findings are actually being acted on or just generated. The cost numbers — run minutes, designer minutes on escalations — are what you weigh the catches against, and they are also the early warning for the social failure on the previous slide. Waivers are worth counting because a rising waiver rate means either the severity definitions are miscalibrated or the team is under delivery pressure that no gate will fix.

The trend line is the subtle one: a healthy gate's finding count should fall over months, because recurring findings become harness rules, fixed components, and better defaults — the issues stop being made rather than stop being caught. A gate that catches the same number of the same findings forever is technically working and organisationally failing. Be careful not to turn any of these numbers into individual performance metrics; the moment finding counts attach to people, the incentive becomes avoiding the gate rather than improving the work.

Narration for this slide

How do you know the gate is worth its cost? The headline number is escapes — regressions of the kinds the gate checks for that still reach users after merge. That is what the gate exists to reduce, so measure it before and after. Around it, track the catch profile and fix rate, the cost in run minutes and designer minutes, the waiver count, and the trend in recurring findings. That last one is the subtle signal: a healthy gate gets quieter over time, because recurring findings become harness rules and fixed components. If the gate catches the same issues at the same rate forever, it is catching things, but nobody is learning. And keep these as team numbers — never individual scorecards.

Slide 11 of 1316:9

Worked example: six weeks of PR reviews on one team

A B2B product team — four engineers, one designer — adopted the gate after a quarter in which three visual regressions reached customers.

MeasureSix-week result
Pull requests gated47, on a routes map covering 31 screens
Run timeAbout 14 minutes per run, posted as a PR comment
Block-level findings9 — largest: a refactor silently dropped the focus ring from every dialog close button
Warn-level findings41, mostly token violations; most fixed on the branch
Escapes after mergeRoughly one per week before; one in the entire six weeks after
The one escapeA screen missing from the routes map — the gap the gate reports rather than hides

The single escape was on a screen the gate had already said it could not see. The failure mode was known, reported, and fixable.

Slide notes

This case is drawn from the school's published Design QA on Every PR workflow, and it is worth presenting precisely because it is unglamorous. The team's starting condition is the common one: one designer, four engineers, UI pull requests merging weekly or faster, and a quarter in which three visual regressions reached customers — including a pricing table that lost its column alignment on tablets. The designer could not review every branch; the engineers did not know what to look for. The gate was adopted as a soft requirement: any pull request labelled ui needs the findings comment before merge.

Walk the numbers as a profile rather than a benchmark. Forty-seven gated pull requests in six weeks, about fourteen minutes per run. Nine block-level findings, the largest being a refactor that silently dropped the focus ring from every dialog close button — exactly the class of regression that passes engineering review because the code still works. Forty-one warn-level findings, mostly token violations, mostly fixed on the branch within the day. Escapes dropped from roughly one a week to one in the whole period.

The single escape is the most instructive row. It happened on a screen missing from the routes map — and the workflow's design is that map gaps are reported in the findings rather than silently skipped, so the team could see the blind spot and close it. That is the realistic promise of the gate: not zero escapes, but escapes that arrive with an explanation and a one-line fix to the map. These figures are from one team on one product; treat them as a shape to expect, not a guarantee to quote.

Narration for this slide

Let's look at six weeks of this running on a real team: four engineers, one designer, and a previous quarter in which three visual regressions reached customers. They saved the review as a team command, built a routes map covering thirty-one screens, and made the findings comment a soft requirement on UI pull requests. Forty-seven pull requests went through it, about fourteen minutes each. Nine block-level findings — the biggest was a refactor that silently dropped the focus ring from every dialog close button — and forty-one warn-level token violations, mostly fixed the same day. Escapes went from about one a week to one in six weeks. And that one escape was on a screen missing from the routes map — a gap the gate had already reported. Not magic. Just a quieter, better-evidenced default.

Slide 12 of 1316:9

Exercise: draft the PR review rules for your codebase

One page, written for your repository as it is today — not as you wish it were. This page becomes the first version of your gate.

  • List the checks: which token rules, component rules, and anti-patterns the gate enforces, and the three widths it captures
  • Write the severity definitions: what blocks a merge in your product, what warns, what is only noted
  • Name the escalation: who looks at disputes and intentional changes, and within what response time
  • Set the budgets: maximum run time, acceptable false-positive rate, and what triggers a review of the gate itself
  • Sketch the routes map for one product area: routes, the components that render them, and the states worth capturing

Have the rules reviewed like any design artifact — by the engineers who will live with the gate, before the gate exists.

Slide notes

This exercise produces the document that makes the rollout possible: the gate's rules, written before the gate runs. Most of the difficulty is not technical. Deciding what blocks a merge in your product, who answers escalations, and what false-positive rate you will tolerate are team agreements, and writing them down on one page is what turns this module from an idea into a proposal someone can say yes to.

Steer participants towards specificity. A good severity definition names the surfaces — checkout blocks, the marketing footer warns. A good escalation line names a person or a rotation, not the design team. The routes-map sketch should cover one product area, not the whole product; the map grows by being used, and the gate reporting its own gaps is part of the design. The widths and check list can be lifted directly from this module and adjusted later — those are the easy 20 per cent.

The last bullet is the one to insist on. The people who will live with this gate daily are the engineers whose pull requests it comments on, and rules they helped write get respected in a way that rules announced to them do not. If participants do one thing after this course, having that one-page review with their engineering counterparts is the highest-leverage option available.

Narration for this slide

Time to draft your own gate. One page, for your real repository. First, the checks: which token rules, component rules, and anti-patterns the gate enforces, and the widths it captures. Second, severity: what blocks a merge in your product, what warns, what is just noted — name the actual surfaces. Third, escalation: who looks at disputes and intentional changes, and how quickly. Fourth, the budgets: maximum run time, the false-positive rate you will tolerate, and what triggers a review of the gate itself. Finally, sketch the routes map for one product area. Then take the page to the engineers who will live with it, and have it reviewed like any other design artifact — before the gate exists, not after.

Slide 13 of 1316:9

Summary, and where to go next

  • Design quality degrades one merge at a time; the pull request is where review is cheapest and context is freshest
  • The gate checks the diff — visual diffs, tokens, accessibility, states — and deliberately ignores everything else
  • Three severities route the work: block stops the merge, warn gets fixed or owned, note feeds the system
  • Humans appear at the decision gate and in the rules; everything mechanical runs on every branch automatically
  • The gate is healthy when escapes fall, findings fall, and the team still trusts the comments

This closes the course. The natural next steps in the curriculum are Design Systems for Agents, for the rules the gate enforces, and Orchestrating Design Agent Teams, for running reviewers like this at portfolio scale.

Slide notes

Close the module by closing the course, because this module is the destination the earlier ones were building towards. Module 1 gave critique a structure — named dimensions, evidence, the split between actionable findings and judgment calls. Modules 2 and 3 made evaluation and visual evidence affordable at scale. Module 4 made accessibility and content review routine. This module wires the per-change subset of all of that into the place changes actually happen, with humans reviewing what matters instead of everything or nothing.

Restate the honest limits one last time. A clean run means the branch did not regress the things the gate checks on the screens the routes map knows about. It does not mean the design is good, the flow makes sense, or the change should ship. Automated accessibility checks cover well under half of the relevant guidance, visual diffs are noisy around animation and live data, and taste never came from the gate in the first place. The gate makes human review cheap and consistent; it is not a substitute for it.

For participants asking what to study next: Design Systems for Agents covers the token and component infrastructure this gate enforces — the better the system, the quieter the gate. Orchestrating Design Agent Teams covers running multiple reviewer agents and workflows like this across a portfolio. And the school's published Design QA on Every PR workflow is the hands-on companion to this module, with the prompts, reviewer definitions, and the saved-command setup ready to adapt.

Narration for this slide

Let's close the module, and with it the course. Design quality degrades one merge at a time, so the review moved to the merge: a gate on every pull request that checks the diff — visual changes, tokens, accessibility, states — and deliberately ignores the rest. Three severities route the work, humans decide only what needs deciding, and recurring findings become rules so the gate gets quieter over time. Remember what a clean run means: the branch did not regress what the gate checks. It does not mean the design is good — that judgment stayed with you through all five modules. From here, Design Systems for Agents deepens the rules this gate enforces, and Orchestrating Design Agent Teams scales the reviewers across a portfolio. Thanks for taking the course.

Module transcript
Module 5, narrated slide by slide

Slide 1Design Review on Every PR

Welcome to the final module of the course. Everything so far — critique dimensions, heuristic evaluations, regression evidence, accessibility and content review — has been a practice you run when you decide to. This module is about the end state: a design review that runs automatically on every pull request that touches the interface. Not a full review of the product on every merge — a narrow, mechanical review of exactly what the branch changed, with screenshot evidence, severity levels, and a clear rule for when a human designer must look. By the end you will be able to draft the review rules for your own codebase. Let's start with why the pull request is the right place.

Slide 2Review every change, not every quarter

Here is the problem this module solves. Design quality almost never fails in one big decision — it degrades one merge at a time. A spacing value gets hardcoded because the deadline was tight. A button picks up a colour that is not in the tokens. A new empty state ships without anyone looking at it on a phone. Each diff looks fine to the engineer reviewing it, because engineers review for logic, not hierarchy and token discipline. Catch the issue while the branch is open and it is a thirty-second conversation. Catch it after release and it is a ticket, a triage meeting, and a regression risk. The pull request is where the fix is cheapest. That is where the review belongs.

Slide 3What the gate checks — and what it deliberately ignores

So what does a per-PR design review actually check? Four things, all scoped to the screens the branch changed. A visual diff: branch versus main, captured at three widths. Token usage: hardcoded colours, spacing, radii, and font sizes in the changed components. Accessibility: automated checks plus a keyboard pass on the changed routes. And states: any empty, loading, error, or disabled state the diff introduces or alters. Just as important is what it deliberately ignores — copy quality, product logic, performance, and every screen the branch does not touch. That is not laziness. A gate that checks everything blocks everything, and a blocked team routes around the gate. The exclusions are what keep it alive.

Slide 4Token, component, and anti-pattern checks as gates

The mechanical checks are where the gate earns its consistency. A token reviewer reads the changed component files and flags raw colours, spacing, radii, and font sizes that have a token equivalent — citing the file, the line, and the token that should replace it. Component rules catch one-off variants and bypassed primitives. Anti-pattern rules encode the things your team has explicitly named and banned. Here is the part that makes it a team practice rather than a personal trick: these reviewers are files in the repository, versioned next to the code they review. Every branch gets the same reviewers with the same instructions, and changing the standard is itself a reviewed change.

Slide 5Screenshot evidence attached to the PR automatically

Every finding in the review comes with evidence attached. The run captures the changed screens on the branch and on main, at the same widths, including the states the routes map lists — error, empty, loading, disabled — and attaches the diffs to the findings. That changes the conversation. A comment that says the badge colour looks off-brand starts a debate. A comment that says this file, this line, this hex value, here is the token it should be, and here is the capture — that starts a commit. The standard to hold is simple: the author should be able to act on the comment without opening the design tool, the staging site, or a meeting.

Slide 6The PR design review pipeline

Here is the whole pipeline in one picture. A pull request opens and the previews build. The agent scopes the run to the diff using a routes map, then runs the automated checks — tokens, components, anti-patterns, accessibility — and captures the changed screens on both branches as evidence. Everything merges into a review summary posted on the pull request: severity, evidence, and a fix per finding. The designer decision gate sits after that, and most pull requests never need it. When they do, the questions are human ones — intentional deviation or regression, block or approve. The author fixes the branch, the same command re-verifies, and the change merges. And the dashed line matters most: recurring findings become rules in the harness, so the gate gets quieter over time, not louder.

Slide 7Severity levels: block, warn, note

Every finding gets one of three severity levels, and three is the right number. Block means the merge stops: broken accessibility on an interactive element, a regression on a primary flow, a brand colour violated on a primary surface. The author fixes it, or a designer explicitly waives it with a reason on the record. Warn means fix it on the branch or file a follow-up with a named owner — token violations off the primary path, a layout shift at one width. Note never blocks anything: it is the gate telling the design-system owner that a token is missing or the routes map has a gap. Severity is not a grade. It is a routing mechanism — it decides who has to look, and when.

Slide 8Escalation: when a human designer must look

The point of the gate is that most pull requests do not need a designer at all — the author reads the findings, fixes them, and merges. So the escalation rules need to be unambiguous about the exceptions. A designer must look when a block-level finding is disputed or needs waiving. When the author says a visual change is intentional — because the gate can show the change but cannot judge whether it is an improvement. When the branch adds genuinely new screens or components with no baseline. When the gate itself reports gaps it could not check. And when severity is contested. Make it a named person with a response-time expectation, not a channel where findings go to be ignored. That is the difference between a working gate and a decorative one.

Slide 9Avoiding bureaucracy: budgets for noise and false positives

Now the part that decides whether this survives contact with a real team. A design gate fails socially before it fails technically — noise, latency, and pedantry kill it long before a missed regression does. So set a noise budget: if more than about one finding in five is a false positive or a triviality, fix the reviewers before you add anything new. Keep the run inside twenty minutes. Start as a soft requirement on UI-labelled pull requests, not a hard CI failure on day one. Let authors mark intentional changes instead of arguing with a robot. Turn every recurring finding into a rule, so the gate gets quieter over time. And review the gate itself monthly. It is a service to the team, not a tollbooth.

Slide 10Measuring whether the practice is improving anything

How do you know the gate is worth its cost? The headline number is escapes — regressions of the kinds the gate checks for that still reach users after merge. That is what the gate exists to reduce, so measure it before and after. Around it, track the catch profile and fix rate, the cost in run minutes and designer minutes, the waiver count, and the trend in recurring findings. That last one is the subtle signal: a healthy gate gets quieter over time, because recurring findings become harness rules and fixed components. If the gate catches the same issues at the same rate forever, it is catching things, but nobody is learning. And keep these as team numbers — never individual scorecards.

Slide 11Worked example: six weeks of PR reviews on one team

Let's look at six weeks of this running on a real team: four engineers, one designer, and a previous quarter in which three visual regressions reached customers. They saved the review as a team command, built a routes map covering thirty-one screens, and made the findings comment a soft requirement on UI pull requests. Forty-seven pull requests went through it, about fourteen minutes each. Nine block-level findings — the biggest was a refactor that silently dropped the focus ring from every dialog close button — and forty-one warn-level token violations, mostly fixed the same day. Escapes went from about one a week to one in six weeks. And that one escape was on a screen missing from the routes map — a gap the gate had already reported. Not magic. Just a quieter, better-evidenced default.

Slide 12Exercise: draft the PR review rules for your codebase

Time to draft your own gate. One page, for your real repository. First, the checks: which token rules, component rules, and anti-patterns the gate enforces, and the widths it captures. Second, severity: what blocks a merge in your product, what warns, what is just noted — name the actual surfaces. Third, escalation: who looks at disputes and intentional changes, and how quickly. Fourth, the budgets: maximum run time, the false-positive rate you will tolerate, and what triggers a review of the gate itself. Finally, sketch the routes map for one product area. Then take the page to the engineers who will live with it, and have it reviewed like any other design artifact — before the gate exists, not after.

Slide 13Summary, and where to go next

Let's close the module, and with it the course. Design quality degrades one merge at a time, so the review moved to the merge: a gate on every pull request that checks the diff — visual changes, tokens, accessibility, states — and deliberately ignores the rest. Three severities route the work, humans decide only what needs deciding, and recurring findings become rules so the gate gets quieter over time. Remember what a clean run means: the branch did not regress what the gate checks. It does not mean the design is good — that judgment stayed with you through all five modules. From here, Design Systems for Agents deepens the rules this gate enforces, and Orchestrating Design Agent Teams scales the reviewers across a portfolio. Thanks for taking the course.