AAgentic Design School
Module 1 of 5
40–50 minutes

Design Review and Critique with Agents

Critique Loops with Agents

The structure of a useful critique loop: named dimensions instead of vibes, the agent as a tireless first reviewer, and the difference between feedback an agent can act on and feedback that needs a human decision.

Duration40–50 minutes

Slides13 slides with notes and narration

Learning objectives

  • Define critique dimensions — hierarchy, density, tone, states, accessibility — for your product.
  • Run an agent critique pass that produces findings with evidence, not opinions.
  • Separate actionable feedback from judgment calls and route each correctly.
Slide deck

Work through the module

Each slide is shown in its 16:9 frame, exactly as it appears in the video version. Open the notes under any slide for the longer explanation, and the narration if you prefer to read along.

Slide 1 of 1316:9

Critique is a loop, not a meeting

Design Review and Critique with Agents · Module 1 of 5

  • Why critique under-delivers in most teams today
  • Named dimensions: what good means, written down
  • The agent as first reviewer — and the human triage gate
  • Findings with evidence, severity, and the smallest useful fix

Critique decides design quality, and it is the part of the process most teams under-resource. This module turns it into a repeatable loop.

Slide notes

Open by naming the gap this whole course addresses: most teams say critique matters and then run it as an occasional meeting that depends on whoever has spare time and seniority that week. The output is a handful of comments on the screens that happened to be presented, with no record of what was checked or why.

This module is about the structure of a critique loop where an agent does the inspection and the designer makes the calls. The agent is not the critic in the sense of taste; it is the first reviewer in the sense of coverage — every state, every breakpoint, every screen, against criteria the team wrote down. The designer's role shifts from generating all the observations to triaging findings and owning judgment.

Set expectations on scope. This module covers the loop itself: dimensions, evidence, severity, the triage gate, and cadence. Modules 2 to 5 take the same loop into heuristic evaluations at scale, visual regression evidence, accessibility and content review, and per-PR design review. Everything later in the course assumes the vocabulary established here.

Narration for this slide

Welcome to Design Review and Critique with Agents. This first module is about the critique loop itself. In most teams, critique is a meeting — it happens when someone has time, it covers whatever got presented, and the quality of the feedback depends on who showed up. We are going to replace that with a loop: named dimensions that say what good means, an agent that inspects every state and screen against them, findings with evidence and severity, and a human triage gate where you decide what matters. The agent does the inspection. You keep the judgment. Let's start with why critique under-delivers today.

Slide 2 of 1316:9

Why critique under-delivers today

The problem is not that teams do not value critique. It is that critique is expensive in exactly the resources teams are short of.

  • Time — a thorough review of one flow takes hours nobody has scheduled
  • Seniority — the sharpest reviewers are the busiest people in the room
  • Recency bias — feedback covers what was presented, not what shipped quietly
  • Coverage — edge states, error paths, and small screens rarely get looked at
  • No record — what was checked and why is not written down anywhere

Critique quality currently depends on who has spare attention that week. That is a staffing model, not a quality model.

Slide notes

Walk through the bullets as failure modes the audience will recognise. Time is the obvious one: inspecting one flow properly — every state, every breakpoint, the copy, the focus order — takes hours, and design calendars do not have those hours. Seniority is the quieter one: the people whose critique is most valuable are the leads, and leads are the most over-booked people on the team, so their review attention gets rationed to whatever is politically loudest.

Recency bias is structural. Critique sessions cover what someone chose to present, which means the work that gets reviewed is the work someone is proud of or worried about. The screens that ship through small PRs without ever appearing in a review deck never get critiqued at all. Coverage compounds this: even when a screen is reviewed, it is usually reviewed in its happy state on a designer's large monitor, not in its error state at 390 pixels.

The last bullet — no record — is what makes the rest unfixable. Without a written account of what was checked against which criteria, every review starts from zero, repeats the same observations, and cannot be improved. The loop in this module exists to make the criteria, the evidence, and the decisions explicit and reusable.

Narration for this slide

Why does critique under-deliver? Not because teams do not care — because it is expensive in exactly the resources they are short of. A proper review of one flow takes hours. The reviewers whose judgment matters most are the busiest people in the building. Critique covers whatever got presented this week, so the work that ships quietly through small changes never gets reviewed at all. Edge states and small screens rarely get looked at. And none of it is written down, so every session starts from zero. The result is that critique quality depends on who had spare attention that week. That is a staffing problem pretending to be a quality process.

Slide 3 of 1316:9

Named dimensions: what good means, written down

An agent asked for general feedback returns general taste language. A useful critique names the dimensions before it starts.

  • Task clarity — can the user tell what to do next?
  • Hierarchy and scan path — does the screen reveal information in the right order?
  • Interaction states — loading, empty, error, disabled, focus, confirmation
  • Trust and risk — pricing, permissions, destructive actions, commitments
  • System consistency — tokens, components, density, copy patterns
  • Accessibility — keyboard, screen reader, low vision, small screens

Dimensions are not a universal checklist. A checkout needs trust and recovery; a dashboard needs density and scan path. Pick the ones the user job depends on.

Slide notes

This is the most important slide in the module: critique only works when what good means is written down before the inspection starts. Ask an agent to review this design and you get the visual greatest common denominator of its training data — clean, modern, could be more consistent. Ask it to inspect six named dimensions against a stated user job and you get findings you can act on.

The six dimensions on the slide are a strong default set, drawn from the school's critique-loop article, but stress that they are a starting point, not a universal checklist. Different artifacts need different dimensions: a checkout flow needs trust, recovery, and price clarity; a dashboard needs density, filters, and table behaviour; a marketing page needs promise clarity, proof, and conversion path. The act of choosing dimensions for the artifact is itself design work, and it is human work.

It is worth noting that published critique skills do the same thing with different labels. Huashu Design's expert critique, for example, scores five dimensions — philosophy coherence, visual hierarchy, execution craft, functionality, and innovation — and outputs a keep/fix/quick-wins list in about three minutes (as of mid-2026). The labels matter less than the principle: named dimensions, declared up front, applied consistently.

Narration for this slide

Here is the foundation of the whole loop: write down what good means before you ask for critique. If you ask an agent for general feedback you get general taste language — clean, modern, a bit inconsistent. If you name the dimensions, you get findings. These six are a strong default: task clarity, hierarchy, interaction states, trust and risk, system consistency, and accessibility. But they are not a universal checklist. A checkout needs trust and error recovery. A dashboard needs density and scan path. Choosing the dimensions that matter for this artifact and this user job is design work — and it stays yours. The agent applies them; it does not pick them.

Slide 4 of 1316:9

The agent as first reviewer

The agent is not a judge handing down taste. It is a reviewer doing the inspection a human never has time for.

  • Thorough — every screen, every state, every breakpoint, the same criteria each time
  • Literal — checks what the dimensions say, not what it assumes you meant
  • Taste-free — it has no point of view, which is a limitation and a feature
  • Read-only — the critique pass produces findings, never edits
  • Tireless — the third pass this week is as careful as the first

The agent owns inspection, comparison, and consistency checking. The designer owns judgment. That division is what keeps critique practical instead of performative.

Slide notes

Position the agent precisely, because both over-claiming and dismissing it cause problems. The agent's strengths are coverage and consistency: it will inspect the error state at 390 pixels with the same care as the hero screen on desktop, it will apply the same six dimensions to the fortieth screen as to the first, and it will do it again next sprint without fatigue or politics. No human reviewer can offer that, and no team can staff it.

Its weaknesses are the mirror image. It is literal — if a dimension is vaguely written it will check it vaguely or invent an interpretation. It has no taste: it cannot tell you which of five competent layouts is right for this brand, and if you ask it to, it will answer confidently anyway. That is why the critique pass must be read-only and findings-only: the moment the agent starts fixing what it found, you have lost the chance to decide which findings were actually right.

The closing framing from the article is worth quoting in your own words: the agent should not behave like a judge handing down taste; it should behave like a reviewer helping the designer make the next decision. Inspection scales. Judgment does not, and should not.

Narration for this slide

So what is the agent in this loop? A first reviewer, not a judge. Its strengths are exactly the things human review is short of: it is thorough — every screen, every state, every breakpoint, the same criteria every time. It is literal — it checks what the dimensions say. It is tireless — the third review this week is as careful as the first. And it is taste-free, which cuts both ways: it will never tell you which competent option is right for your brand, and you should not ask it to. That is why the critique pass is read-only — findings only, no edits. The agent owns inspection. You own judgment.

Slide 5 of 1316:9

The critique loop

Seven steps. The agent runs the inspection and revision; the designer holds the triage gate and the ship decision.

Loop diagram of an agent-assisted critique loop. An artifact and its evidence packet feed an agent critique pass run read-only against named dimensions and the brief's review criteria. The critique produces findings, each with severity, evidence, and a smallest recommended fix. A human triage gate accepts, rejects, or marks findings as judgment calls. The agent runs a revision pass on approved findings only, then a re-critique with fresh evidence, and a human makes the final ship decision. A dashed feedback line shows recurring findings being encoded as named dimensions in the harness.
The critique pass and revision pass are agent-run; the triage gate and the ship decision are human. Nothing is fixed before the triage gate, and recurring findings feed back into the harness as named dimensions.

The loop works because the agent critiques against evidence and constraints, not against a generic idea of good design.

Slide notes

Walk the diagram in order and name the owner of each step. Step one is the artifact plus its evidence packet: the screen or flow itself, screenshots covering states and breakpoints, the route and component files, the copy, and the brief with its review criteria. The quality of the critique tracks the quality of this packet — an agent that can only see one happy-path screenshot can only critique one happy-path screenshot. Step two is the critique pass: agent-run, read-only, against the named dimensions. Step three is the findings list, one record per issue, each with severity, evidence, user impact, and the smallest recommended fix.

Step four is the human triage gate, and it is the hinge of the whole loop. The designer accepts, rejects, or defers each finding, and marks the ones that are genuinely judgment calls. Nothing is fixed before this gate. Step five is the revision pass, scoped strictly to the approved list — no new direction, no opportunistic improvements. Step six is the re-critique: same dimensions, fresh screenshots, confirming fixes and checking for regressions. Step seven is the ship decision, which is human, full stop — a clean findings list is evidence for the decision, not the decision itself.

Point at the dashed feedback line. When the same finding keeps recurring across artifacts — the same spacing drift, the same missing empty state — that is a signal it belongs in the harness as a named dimension or an executable check, so future critique passes catch it automatically. That feedback line is how the loop gets cheaper over time.

Narration for this slide

Here is the loop the rest of this course builds on. You start with the artifact and its evidence packet — screenshots of states and breakpoints, the actual files, the copy, and the brief. The agent runs a read-only critique pass against your named dimensions and returns findings: severity, evidence, user impact, smallest fix. Then comes the part that stays human — the triage gate. You accept, reject, or defer each finding before anything gets fixed. The agent revises only what you approved, re-critiques with fresh screenshots, and you make the ship call. And notice the dashed line: findings that keep recurring get written into the harness, so the loop gets sharper every time you run it.

Slide 6 of 1316:9

Findings with evidence, not opinions

A good finding is specific enough to act on and restrained enough to preserve design ownership.

  • Evidence: the file, screenshot, or visible detail — and the criterion violated
  • User impact: why it matters, in the user's terms
  • Smallest recommended fix — never a full redesign
  • Owner: design, engineering, product, or legal review
Sample critique finding
Important — Payment context is hidden on mobile

Evidence:
At 390px width, the payment form appears before the plan
summary. Desktop keeps price and plan visible beside the form.

User impact:
A buyer may enter card details without confirming the plan,
price, renewal timing, or included seats.

Recommended fix:
Compact sticky plan summary above the payment form on mobile.

Owner:
Design approval for hierarchy; engineering after approval.

If a finding has no evidence you can point at, it is an opinion. Opinions are allowed — but they go to the judgment pile, not the fix queue.

Slide notes

Spend time on the anatomy of the example, because the format is what makes the loop work at scale. The severity is stated first, so triage can happen by scanning. The evidence names a concrete, checkable fact: at 390 pixels the form precedes the plan summary, and the desktop layout does not. Anyone can verify it in thirty seconds, which means the triage gate is fast. The user impact translates the observation into consequence — a buyer pays without seeing the price — which is what lets a designer or product owner weigh it against everything else competing for the sprint.

The recommended fix is deliberately the smallest useful change. Agents are prone to turning one observation into a redesign proposal; the format constrains them to a single, scoped suggestion the designer can accept, modify, or reject. The owner field matters in team settings: an agent can recommend that a payment summary move higher on mobile, but whether that fits product, legal, and revenue constraints is not its call.

The contrast to draw is with the feedback most critique sessions actually produce: the payment step feels a bit cramped, can we make the summary more prominent. That comment might be pointing at the same problem, but it cannot be triaged, verified, or assigned. Evidence is what separates a finding from a vibe.

Narration for this slide

Here is what a finding should look like. Severity first — this one is important, not a blocker. Then evidence: at 390 pixels, the payment form appears before the plan summary, and desktop does not have this problem. That is a fact anyone can check in thirty seconds. Then user impact: a buyer might enter card details without seeing the final price. Then the smallest recommended fix — a compact plan summary above the form, not a redesign of the checkout. And an owner, because recommending the change and approving it are different jobs. Compare that to the payment step feels cramped. Same instinct, but one of these can be triaged and acted on. The other is a vibe.

Slide 7 of 1316:9

Severity keeps the findings list usable

Without severity, critique becomes a pile of suggestions the designer has to sort by hand. Use few levels.

SeverityWhat it means
BlockerThe user cannot complete the job, or may make a serious mistake
ImportantThe user can continue, but comprehension, trust, or speed is harmed
PolishAffects consistency or quality but does not block the task
QuestionThe agent found ambiguity and needs a human decision before recommending a fix

The point is not a perfect taxonomy. It is deciding what gets fixed first, what needs a human decision, and what can wait.

Slide notes

Four levels is deliberate. Teams that invent seven-point severity scales spend their triage time arguing about whether something is a 3 or a 4 instead of deciding whether to fix it. Blocker and important carry the real weight: blockers stop the user or expose them to a serious mistake, important findings let the user continue but cost comprehension, trust, or speed. Polish exists so that consistency issues get recorded without competing with user-facing problems for attention.

The question level is the one most teams forget and the one most specific to working with agents. A literal reviewer will regularly hit genuine ambiguity — the brief says emphasise the annual plan, but the legal copy requires the monthly price to appear first; which wins? A good critique setup tells the agent to surface that as a question rather than guessing, because a guess presented as a finding pollutes the triage gate with invented certainty.

Severity is assigned by the agent in the first pass and corrected by the human at the triage gate. Expect to downgrade some findings and occasionally upgrade one — the agent does not know that the cramped summary is on the screen your biggest customer complained about last quarter. That correction is itself useful signal: if you keep downgrading the same class of finding, the dimension it comes from is written too aggressively.

Narration for this slide

Severity is what stops the findings list from becoming homework. Keep it to four levels. Blockers mean the user cannot finish the job or might make a serious mistake — they get fixed before anything else. Important means the user gets through, but trust, comprehension, or speed takes a hit. Polish is real but can wait. And question is the level teams forget: it is where the agent says, I found an ambiguity and I am not going to guess. You will adjust severities at the triage gate, and that is fine — the point is not a perfect taxonomy, it is a fast, defensible answer to what gets fixed first.

Slide 8 of 1316:9

Feedback the agent can act on vs decisions that stay human

Routing is the triage gate's real job. Some findings are fixes; some are choices.

Agent can act on itStays a human decision
NatureA criterion was violated and the fix is mechanicalCompetent options exist and one must be chosen
ExamplesMissing focus state, hardcoded colour, truncated label, absent error stateWhich layout direction, what the brand should refuse, what to cut from the screen
EvidenceFile, screenshot, token, or check that failsContext that never appears in the files: politics, history, strategy
After triageGoes to the scoped revision passGoes to the designer, the lead, or the team conversation
Risk if mis-routedHumans burn hours on mechanical fixesThe agent confidently makes a call that was never its to make

Mis-routing in either direction is expensive: humans doing mechanical fixes wastes the loop, and agents making judgment calls quietly outsources taste.

Slide notes

This distinction is the third learning objective of the module and the one that determines whether the loop helps or quietly erodes design ownership. Actionable findings share a shape: a stated criterion was violated, the evidence is in the files or the screenshots, and the fix is mechanical — add the focus state, replace the hardcoded colour with the token, write the missing error state. These are exactly the findings the revision pass should consume, and routing them to a human reviewer wastes the loop's whole advantage.

Judgment calls share a different shape: more than one competent answer exists, and choosing between them requires context the agent does not have — what the brand stands for, what failed last time, which stakeholder is already nervous about this flow, what the roadmap needs this screen to become. The danger is not that the agent refuses these; it is that it answers them fluently. An agent asked whether the dashboard should lead with the chart or the table will pick one and argue for it. If that answer slides through triage unexamined, taste has been outsourced one plausible-sounding finding at a time.

The practical habit: at the triage gate, explicitly tag each accepted finding as fix or decision. Fixes go into the scoped revision prompt. Decisions go to a person, a conversation, or a follow-up exploration. Findings tagged question by the agent are almost always decisions.

Narration for this slide

The triage gate is really a routing decision. Some findings the agent can act on: a criterion was violated, the evidence is right there, and the fix is mechanical — a missing focus state, a hardcoded colour, a label that truncates. Send those to the revision pass. Other findings are decisions: more than one competent answer exists, and choosing needs context that never appears in the files — brand, history, politics, strategy. Those stay with you. Mis-route in either direction and you pay for it. Humans doing mechanical fixes wastes the loop. Agents making judgment calls outsources taste, one plausible finding at a time. Tag every accepted finding as a fix or a decision, and route it accordingly.

Slide 9 of 1316:9

The critique contract

The contract tells the agent what kind of feedback is allowed — and what it must not do.

  • Scope: the artifact, the user job, and the dimensions to check
  • Evidence the agent must use: screenshots, files, the design harness
  • Output format: severity, evidence, impact, smallest fix, owner
  • Prohibitions: no redesign, no production code, no invented rules
Critique contract prompt (excerpt)
You are reviewing a checkout flow as a design QA partner.

Do not redesign the page.
Do not write production code.
Do not invent new brand rules.

User job:
- A buyer should understand the plan, price, payment step,
  and recovery path.

Inputs: DESIGN.md, screenshots (desktop, mobile, error),
src/app/checkout/

Return findings: severity, evidence, issue, user impact,
recommended fix, owner.

Without the contract, the agent redesigns the screen, introduces a new direction, or buries the useful findings under polite commentary.

Slide notes

The contract is where everything from the previous slides becomes a reusable prompt. It states the user job, lists the evidence the agent must inspect, names the dimensions, fixes the output format, and — just as important — states what the agent must not do. The prohibitions are not decoration. Left to its defaults, an agent asked to review a screen will frequently propose a new direction, rewrite copy in a voice it invented, or produce a wall of polite commentary in which the two findings that matter are paragraphs five and eleven.

The contract also separates critique from revision structurally, not just by request. The critique prompt ends at findings; the revision prompt is a different prompt that takes only the approved findings as input and carries its own constraints — do not address rejected findings, do not introduce a new visual direction, keep changes scoped to the affected files, report what changed. Keeping them as two prompts rather than one is what makes the triage gate real rather than ceremonial.

For teams running this regularly, the contract belongs in the repository — as a skill, a saved prompt, or a section of the design harness file — so every critique pass starts from the same standard rather than from whatever the requester remembered to type that day. That is also where the dashed feedback line from the loop diagram lands: recurring findings get added to the contract's dimensions.

Narration for this slide

All of this gets packaged into what the article behind this module calls a critique contract. It tells the agent the user job, the evidence to inspect, the dimensions to check, and the exact format to return findings in. And it says what is off-limits: do not redesign the page, do not write production code, do not invent brand rules. That last part matters more than it looks — without it, agents drift into proposing new directions and burying the two findings you needed under polite commentary. Keep the contract in the repository, not in someone's chat history, and keep critique and revision as separate prompts. That separation is what makes your triage gate real.

Slide 10 of 1316:9

Critique cadence: per artifact, per sprint, per release

Once critique is cheap to run, the question changes from can we afford a review to which rhythm does this surface need.

  • Per artifact — every new screen, flow, or substantial revision gets a loop before it merges
  • Per sprint — a sweep of what changed: consistency, states, and copy across the touched surfaces
  • Per release — the high-stakes flows get a full pass: checkout, onboarding, settings, billing
  • The contract stays the same; the scope and depth change with the cadence
  • Recurring findings at any cadence get encoded into the harness or an executable check

The expensive part of critique used to be the inspection. Now it is the triage. Choose cadences your triage attention can actually keep up with.

Slide notes

Cadence is where teams either make this sustainable or drown themselves. The per-artifact loop is the default: any new screen, flow, or substantial revision goes through critique before it merges, using the standard contract. Because the agent does the inspection, the marginal cost is mostly the designer's triage time — typically minutes per artifact, not hours.

The per-sprint sweep covers the gap that per-artifact review leaves: drift. Individually reviewed changes can still accumulate into inconsistency — three slightly different empty states, two competing button hierarchies. A sprint-level pass critiques the touched surfaces together, looking specifically for consistency and state coverage. The per-release pass is reserved for the flows where mistakes are expensive — checkout, onboarding, billing, anything legal cares about — and goes deeper: every state, every breakpoint, accessibility, and copy.

The warning to give explicitly: the constraint is no longer agent capacity, it is human triage attention. A team that schedules every cadence at maximum depth will generate more findings than anyone reads, and unread findings are worse than no findings because they create the illusion of review. Start with per-artifact loops on new work plus a release pass on one critical flow, and only add cadence when triage is comfortably keeping up. Module 2 deals with triaging large finding sets in detail.

Narration for this slide

Once the inspection is cheap, cadence becomes a real choice. Per artifact: every new screen or substantial revision gets a loop before it merges. Per sprint: a sweep across whatever changed, looking for drift — the inconsistencies that creep in even when each change was reviewed on its own. Per release: the high-stakes flows get the deep pass — checkout, onboarding, billing. Same contract every time; what changes is scope and depth. One warning. The bottleneck has moved. It is no longer the inspection, it is your triage attention. Pick cadences you can actually keep up with, because findings nobody reads are worse than no findings at all.

Slide 11 of 1316:9

Worked example: one screen through a full loop

The checkout payment step from the school's critique-loop article, traced through all seven steps.

StepWhat happened
Evidence packetDesktop, mobile, error, loading and confirmation screenshots; route files; DESIGN.md; the user job
Critique passRead-only, six dimensions; findings only — no edits, no redesign
Findings9 findings: 1 blocker, 3 important, 4 polish, 1 question
Triage gateBlocker and 2 important accepted; 1 important deferred; 2 polish rejected; question answered by the designer
Revision + re-critiqueScoped fix of 3 approved findings; fresh screenshots confirmed fixes, no regressions
Ship decisionDesigner shipped, logged the deferred finding, added a states dimension to the harness

The loop produced a trail of design reasoning: what was checked, what failed, what changed, and what still needed a human call.

Slide notes

Walk the table as a narrative. The setup: a three-step purchase path where trial users hesitate on the payment step. The weak version of this review is asking the agent whether the page is clear. The strong version gives it the user job — a buyer should understand what they are paying for, what happens after payment, and how to recover from errors — plus screenshots of every state on desktop and mobile, the route files, and the design harness, and asks for findings only.

The findings split is typical of a first pass on a reasonable screen: one blocker (the error state offered no recovery path, only a generic message), three important (including the mobile payment step hiding the plan summary below the form), four polish, and one question (whether the renewal date must legally appear before the pay button — the agent flagged it rather than guessing). At the triage gate the designer accepted the blocker and two of the important findings, deferred one, rejected two polish comments as taste, and answered the legal question after checking with the product owner.

The revision pass touched only the approved findings, the re-critique recaptured the same viewports and confirmed the fixes without regressions, and the designer shipped. Two artefacts of the loop outlived the screen: the deferred finding went into the backlog with its evidence attached, and the recurring weakness around interaction states became an explicit dimension in the harness — which is the dashed feedback line from the diagram doing its job. Total designer attention across the loop was well under an hour; the inspection itself was agent time.

Narration for this slide

Let's trace one screen through the whole loop — the checkout payment step from the article behind this module. The agent got the user job, screenshots of every state on desktop and mobile, the route files, and the design harness, and was asked for findings only. It came back with nine: one blocker — the error state had no recovery path — three important, four polish, and one genuine question about where the renewal date had to appear. At the triage gate the designer accepted three findings, deferred one, rejected two as taste, and answered the question. The agent fixed only what was approved, re-critiqued with fresh screenshots, and the designer shipped. Under an hour of human attention, and a written trail of what was checked and why.

Slide 12 of 1316:9

Exercise: write the critique dimensions for your product

No agent needed yet. One page, one artifact, the dimensions and the contract that would review it.

  • Pick one real artifact: a screen, a flow, or a recent PR that touched the interface
  • Write the user job in one or two sentences — what must the user understand or accomplish?
  • Choose five or six dimensions that matter for this artifact; drop the ones that do not
  • For each dimension, write one sentence describing what a violation would look like
  • List the evidence packet: which screenshots, states, files, and docs the agent would need
  • Mark which likely findings would be fixes and which would be judgment calls

Keep the page. In Module 2 you will scale these dimensions into a heuristic evaluation across a whole product surface.

Slide notes

The exercise is deliberately on paper, because the hard part of the loop is not running the agent — it is the design thinking the agent cannot do for you. Most participants find the user job and the violation sentences hardest, and that is the point: a dimension like good hierarchy is unusable until you can say what a violation looks like — primary action competes with a secondary link, the price is below the fold on mobile, the table header disappears on scroll.

Steer people away from copying the six default dimensions wholesale. The exercise works when someone reviewing a data-dense admin table realises that trust and risk barely applies but density, scan path, and table behaviour are everything — that realisation is the skill the module is teaching. The evidence-packet step usually exposes a practical gap too: many teams discover they have no easy way to produce screenshots of error and loading states, which is worth knowing before the agent is ever involved.

If running this live, have two or three people read out their violation sentences. Weak ones are restatements of the dimension; strong ones are checkable facts. The fix-versus-judgment marking at the end is a rehearsal for the triage gate, and it sets up Module 2, where the same dimensions get applied across an entire product surface and the triage problem becomes the central one.

Narration for this slide

Time to make this yours. Pick one real artifact — a screen, a flow, or a recent PR that touched the interface. Write the user job in a sentence or two. Then choose five or six dimensions that actually matter for this artifact, and for each one, write what a violation would look like — a checkable fact, not a restatement of the dimension. List the evidence the agent would need: which screenshots, which states, which files. And mark which likely findings would be mechanical fixes and which would be judgment calls. Don't run it yet. Keep the page — in Module 2, these dimensions become the basis of a heuristic evaluation across your whole product.

Slide 13 of 1316:9

Summary, and the bridge to evaluation at scale

  • Critique under-delivers because it is expensive in time, seniority, and coverage — and nothing is written down
  • Named dimensions, chosen for the artifact and the user job, are what turn agent feedback from vibes into findings
  • The agent is the first reviewer: thorough, literal, read-only — inspection, not taste
  • Every finding carries severity, evidence, user impact, and the smallest fix; the triage gate routes fixes to the agent and decisions to humans
  • Cadence is a choice — per artifact, per sprint, per release — limited by triage attention, not agent capacity

Module 2 takes the same structure to scale: heuristic evaluations and cognitive walkthroughs run across an entire product, and the triage discipline that keeps hundreds of findings workable.

Slide notes

Recap the module by walking the loop one more time, but emphasise the two ideas that carry into the rest of the course. First, the division of labour: the agent owns inspection, comparison, and consistency checking; the designer owns dimensions, triage, judgment, and the ship call. Every later module — heuristic evaluation, visual regression, accessibility, per-PR review — is this same division applied to a different review surface. Second, the feedback line: recurring findings become named dimensions or executable checks in the harness, which is how the loop compounds instead of just repeating.

Preview Module 2 concretely. The critique loop in this module reviewed one artifact at a time. Module 2 applies the same structure to two classic methods most teams skip because of cost — heuristic evaluation and cognitive walkthroughs — run across an entire product surface. The new problems are scale problems: keeping criteria consistent across hundreds of screens and states, anchoring walkthroughs to defined user tasks rather than screens in isolation, and triaging finding sets large enough to drown a team if handled naively. The dimensions written in this module's exercise are the direct input.

If participants did the exercise, remind them to keep the page accessible: it gets used in Module 2's exercise, and the evidence-packet list becomes the baseline definition in Module 3's visual QA work.

Narration for this slide

Let's close the module. Critique under-delivers today because it is expensive in time, seniority, and coverage — and because nothing gets written down. The fix is structural: name the dimensions that define good for this artifact, let the agent run the inspection — thorough, literal, read-only — and have it return findings with severity, evidence, and the smallest fix. You hold the triage gate: fixes go to the agent, decisions stay with you, and recurring findings get written into the harness. Cadence is yours to choose, limited by your triage attention. In Module 2 we take this structure to scale — heuristic evaluations and cognitive walkthroughs across an entire product, and the triage discipline that keeps it workable. See you there.

Module transcript
Module 1, narrated slide by slide

Slide 1Critique is a loop, not a meeting

Welcome to Design Review and Critique with Agents. This first module is about the critique loop itself. In most teams, critique is a meeting — it happens when someone has time, it covers whatever got presented, and the quality of the feedback depends on who showed up. We are going to replace that with a loop: named dimensions that say what good means, an agent that inspects every state and screen against them, findings with evidence and severity, and a human triage gate where you decide what matters. The agent does the inspection. You keep the judgment. Let's start with why critique under-delivers today.

Slide 2Why critique under-delivers today

Why does critique under-deliver? Not because teams do not care — because it is expensive in exactly the resources they are short of. A proper review of one flow takes hours. The reviewers whose judgment matters most are the busiest people in the building. Critique covers whatever got presented this week, so the work that ships quietly through small changes never gets reviewed at all. Edge states and small screens rarely get looked at. And none of it is written down, so every session starts from zero. The result is that critique quality depends on who had spare attention that week. That is a staffing problem pretending to be a quality process.

Slide 3Named dimensions: what good means, written down

Here is the foundation of the whole loop: write down what good means before you ask for critique. If you ask an agent for general feedback you get general taste language — clean, modern, a bit inconsistent. If you name the dimensions, you get findings. These six are a strong default: task clarity, hierarchy, interaction states, trust and risk, system consistency, and accessibility. But they are not a universal checklist. A checkout needs trust and error recovery. A dashboard needs density and scan path. Choosing the dimensions that matter for this artifact and this user job is design work — and it stays yours. The agent applies them; it does not pick them.

Slide 4The agent as first reviewer

So what is the agent in this loop? A first reviewer, not a judge. Its strengths are exactly the things human review is short of: it is thorough — every screen, every state, every breakpoint, the same criteria every time. It is literal — it checks what the dimensions say. It is tireless — the third review this week is as careful as the first. And it is taste-free, which cuts both ways: it will never tell you which competent option is right for your brand, and you should not ask it to. That is why the critique pass is read-only — findings only, no edits. The agent owns inspection. You own judgment.

Slide 5The critique loop

Here is the loop the rest of this course builds on. You start with the artifact and its evidence packet — screenshots of states and breakpoints, the actual files, the copy, and the brief. The agent runs a read-only critique pass against your named dimensions and returns findings: severity, evidence, user impact, smallest fix. Then comes the part that stays human — the triage gate. You accept, reject, or defer each finding before anything gets fixed. The agent revises only what you approved, re-critiques with fresh screenshots, and you make the ship call. And notice the dashed line: findings that keep recurring get written into the harness, so the loop gets sharper every time you run it.

Slide 6Findings with evidence, not opinions

Here is what a finding should look like. Severity first — this one is important, not a blocker. Then evidence: at 390 pixels, the payment form appears before the plan summary, and desktop does not have this problem. That is a fact anyone can check in thirty seconds. Then user impact: a buyer might enter card details without seeing the final price. Then the smallest recommended fix — a compact plan summary above the form, not a redesign of the checkout. And an owner, because recommending the change and approving it are different jobs. Compare that to the payment step feels cramped. Same instinct, but one of these can be triaged and acted on. The other is a vibe.

Slide 7Severity keeps the findings list usable

Severity is what stops the findings list from becoming homework. Keep it to four levels. Blockers mean the user cannot finish the job or might make a serious mistake — they get fixed before anything else. Important means the user gets through, but trust, comprehension, or speed takes a hit. Polish is real but can wait. And question is the level teams forget: it is where the agent says, I found an ambiguity and I am not going to guess. You will adjust severities at the triage gate, and that is fine — the point is not a perfect taxonomy, it is a fast, defensible answer to what gets fixed first.

Slide 8Feedback the agent can act on vs decisions that stay human

The triage gate is really a routing decision. Some findings the agent can act on: a criterion was violated, the evidence is right there, and the fix is mechanical — a missing focus state, a hardcoded colour, a label that truncates. Send those to the revision pass. Other findings are decisions: more than one competent answer exists, and choosing needs context that never appears in the files — brand, history, politics, strategy. Those stay with you. Mis-route in either direction and you pay for it. Humans doing mechanical fixes wastes the loop. Agents making judgment calls outsources taste, one plausible finding at a time. Tag every accepted finding as a fix or a decision, and route it accordingly.

Slide 9The critique contract

All of this gets packaged into what the article behind this module calls a critique contract. It tells the agent the user job, the evidence to inspect, the dimensions to check, and the exact format to return findings in. And it says what is off-limits: do not redesign the page, do not write production code, do not invent brand rules. That last part matters more than it looks — without it, agents drift into proposing new directions and burying the two findings you needed under polite commentary. Keep the contract in the repository, not in someone's chat history, and keep critique and revision as separate prompts. That separation is what makes your triage gate real.

Slide 10Critique cadence: per artifact, per sprint, per release

Once the inspection is cheap, cadence becomes a real choice. Per artifact: every new screen or substantial revision gets a loop before it merges. Per sprint: a sweep across whatever changed, looking for drift — the inconsistencies that creep in even when each change was reviewed on its own. Per release: the high-stakes flows get the deep pass — checkout, onboarding, billing. Same contract every time; what changes is scope and depth. One warning. The bottleneck has moved. It is no longer the inspection, it is your triage attention. Pick cadences you can actually keep up with, because findings nobody reads are worse than no findings at all.

Slide 11Worked example: one screen through a full loop

Let's trace one screen through the whole loop — the checkout payment step from the article behind this module. The agent got the user job, screenshots of every state on desktop and mobile, the route files, and the design harness, and was asked for findings only. It came back with nine: one blocker — the error state had no recovery path — three important, four polish, and one genuine question about where the renewal date had to appear. At the triage gate the designer accepted three findings, deferred one, rejected two as taste, and answered the question. The agent fixed only what was approved, re-critiqued with fresh screenshots, and the designer shipped. Under an hour of human attention, and a written trail of what was checked and why.

Slide 12Exercise: write the critique dimensions for your product

Time to make this yours. Pick one real artifact — a screen, a flow, or a recent PR that touched the interface. Write the user job in a sentence or two. Then choose five or six dimensions that actually matter for this artifact, and for each one, write what a violation would look like — a checkable fact, not a restatement of the dimension. List the evidence the agent would need: which screenshots, which states, which files. And mark which likely findings would be mechanical fixes and which would be judgment calls. Don't run it yet. Keep the page — in Module 2, these dimensions become the basis of a heuristic evaluation across your whole product.

Slide 13Summary, and the bridge to evaluation at scale

Let's close the module. Critique under-delivers today because it is expensive in time, seniority, and coverage — and because nothing gets written down. The fix is structural: name the dimensions that define good for this artifact, let the agent run the inspection — thorough, literal, read-only — and have it return findings with severity, evidence, and the smallest fix. You hold the triage gate: fixes go to the agent, decisions stay with you, and recurring findings get written into the harness. Cadence is yours to choose, limited by your triage attention. In Module 2 we take this structure to scale — heuristic evaluations and cognitive walkthroughs across an entire product, and the triage discipline that keeps it workable. See you there.