AAgentic Design School
Module 6 of 6
45–55 minutes

Agentic Design Fundamentals

Critique, Quality Gates, and Limits

The closing module: how to critique agent output so it improves, which checks can be made executable, the failure modes that keep recurring, and an honest account of what agents are still bad at — finishing with a ship checklist and where to go next in the curriculum.

Duration45–55 minutes

Slides13 slides with notes and narration

Learning objectives

  • Run a structured critique loop with an agent using named dimensions rather than vibes.
  • Use screenshot evidence and visual QA to verify implementation against intent.
  • Encode anti-slop rules and executable gates that catch recurring failures automatically.
  • State clearly what agents are still bad at and design the workflow around those limits.
Slide deck

Work through the module

Each slide is shown in its 16:9 frame, exactly as it appears in the video version. Open the notes under any slide for the longer explanation, and the narration if you prefer to read along.

Slide 1 of 1316:9

Critique, Quality Gates, and Limits

Agentic Design Fundamentals · Module 6 of 6

  • Visual QA: screenshots as evidence, not decoration
  • Critique against named dimensions, with severity and evidence
  • Executable gates and anti-slop rules that catch failures automatically
  • An honest account of what agents are still bad at
  • The ship checklist, the course retrospective, and where to go next

Modules 3 and 4 raised the quality of the agent's first draft. This module is about everything between that draft and the decision to ship.

Slide notes

This is the closing module, and it carries the part of the workflow that most teams skip: the structured review between a plausible first draft and the decision to ship. The earlier modules made the input better — clearer briefs, a harness that carries the durable rules, artifacts the agent can actually read. None of that removes the need for review; it just makes the review worth doing, because the draft arriving at the gate is close enough to be improved rather than discarded.

Frame the module around one claim and keep returning to it: critique is where the quality comes from. Agents produce output that is plausible by default, and plausible is precisely what makes unreviewed work dangerous — it reads as finished while carrying invented props, generic patterns, missing states, and quiet scope creep. The gates exist to catch those before a stakeholder does.

Flag the structure up front. The first half of the module is technique: visual QA, critique dimensions, executable checks, anti-slop rules, and the failure modes they catch. The second half is honesty and closure: what agents still cannot do, the ship checklist, the return to the Module 1 exercise, and where the curriculum goes from here. Both halves matter; a team that learns the techniques without the limits oversells the workflow and gets burned.

Narration for this slide

Welcome to the final module of Agentic Design Fundamentals. Everything so far has been about getting a better first draft out of the agent — the brief, the harness, the right artifacts. This module is about what happens between that draft and the decision to ship. We will cover visual QA with real screenshot evidence, critique against named dimensions instead of vibes, the checks you can make fully automatic, and the failure patterns worth automating against. Then we get honest about what agents are still bad at, build your ship checklist, and close the loop on the exercise you started back in Module 1. Critique is where the quality comes from. Let's do it properly.

Slide 2 of 1316:9

Visual QA: screenshots as evidence, not decoration

A generated interface can compile, load, and still fail the design. Visual QA gives the review evidence instead of opinion.

  • Capture at fixed widths every run — 390, 768, and 1440 pixels — so findings stay comparable
  • Capture the states, not just the happy path: empty, loading, error, disabled, focus
  • Add an accessibility snapshot or axe report, and one paragraph restating the intent
  • Findings name their evidence — which file, which viewport, which check — or they go back
  • The agent reports; a human approves the fix plan before any code moves

The review stops being an opinion about the code and becomes a comparison between what a user would see and what the design was supposed to do.

Slide notes

The core argument here comes straight from the visual QA article: a type check and a code review cannot see hierarchy, rhythm, density, or a layout that collapses at phone width. The diff looks reasonable, the page loads, and the design has still failed. Visual QA closes that gap by giving the agent — and the human reading its findings — actual evidence: screenshots at fixed viewport widths, an accessibility snapshot of the rendered page, the output of automated checks, and a short restatement of what the page was supposed to do.

Two disciplines carry most of the value. First, fix the widths and never change them between runs; the moment every review uses different viewports, findings stop being comparable and the second review cannot tell you whether the first fix worked. Second, require every finding to name its evidence — the file, the viewport, the axe rule. Findings that cannot name evidence are either judgments, which is fine if labeled, or guesses, which get sent back. That single rule is what separates a fix-ready report from another vague to-do list.

Name the boundary explicitly: visual QA is a review activity, not a license to change the UI. Capable agents will helpfully fix their own findings the moment they produce them, and at that point taste decisions are being made silently inside a diff. The findings get prioritized, a human approves the fix plan, and only then does anything change. Mention also the practical caution from the article — full-page screenshots pasted into chat are extremely token-expensive; keep evidence on disk and let the agent read files or crops.

Narration for this slide

Here is the problem visual QA solves. A generated page can compile, pass the type check, and still fail the design — the hierarchy is wrong, the density is off, the layout collapses on a phone. None of that shows up in a diff. So we give the review evidence: screenshots at the same fixed widths every run, the important states and not just the happy path, an accessibility report, and one paragraph reminding everyone what the page was for. Then one rule does most of the work: every finding has to name its evidence — which file, which viewport, which check. And the agent reports findings; a human approves the fixes. That order is not negotiable.

Slide 3 of 1316:9

Critique dimensions: named, not vibes

Ask for general feedback and you get taste language. Ask against named dimensions and you get findings you can act on.

DimensionWhat the reviewer checksEvidence that settles it
HierarchyDoes the screen reveal the right information in the right order for the user job?Screenshots at each width; the scan path against the brief's stated job
DensityIs the information density right for the surface — dashboard-tight or reading-loose?Side-by-side captures against a reference surface or DESIGN.md density rules
ToneDo copy, color, and emphasis match the product's voice rather than the genre average?The brief's direction lines and the harness's named anti-patterns
StatesDo empty, loading, error, disabled, and focus states exist and behave consistently?State captures; a search for the state code paths in the implementation
AccessibilityCan keyboard, screen-reader, and low-vision users complete the job?axe output, the accessibility snapshot, and a human keyboard walkthrough

Severity is assigned by user impact, never by how easy the fix looks — and judgment calls get routed to a human, not fixed by the agent.

Slide notes

This table is the working core of structured critique. The failure mode it prevents is the vibe check: ask an agent for general feedback and you get general taste language — clean, modern, needs polish — which nobody can act on. Naming the dimensions turns the critique into inspection: for each dimension there is a question to answer and evidence that settles it.

Walk one or two rows properly rather than all five quickly. Hierarchy is the highest-value one, because it is the dimension agents most often get subtly wrong — the components are right, the order of emphasis is not. States is the one agents most often simply omit; the happy path gets built and the empty, error, and disabled states quietly do not exist until someone asks. Accessibility is the row where automated evidence and human judgment have to share the work: axe settles contrast and labels, a keyboard walkthrough settles whether the focus order makes sense.

Two rules from the critique-loops article belong on top of this table. First, severity by user impact: blocker, important, polish, question — and never by how easy the fix looks, because that is how trivial cosmetic fixes crowd out cheap but important ones. Second, separate observable mismatch from design judgment. A missing aria-current is a mismatch; whether the display type scale earns its cost on small phones is a judgment call that goes to a human. A review that contains only one kind has usually been filtered; the honest mix, clearly labeled, is what a designer can act on. Finally, the dimensions should flex with the artifact — a checkout flow adds trust and recovery, a dashboard adds scan path and table behavior. The table is a starting set, not a universal checklist.

Narration for this slide

If you ask an agent whether a design is good, you get adjectives. If you ask it to inspect against named dimensions, you get findings. Here are the five we use as a baseline: hierarchy — is the information in the right order for the job; density — is it dashboard-tight or reading-loose, deliberately; tone — does it sound and look like this product, not the genre average; states — do empty, loading, error, and disabled actually exist; and accessibility — can keyboard and screen-reader users finish the task. Each one has evidence that settles it. Then two rules: severity goes by user impact, not by how easy the fix is, and judgment calls get routed to a human — the agent does not fix taste.

Slide 4 of 1316:9

Making checks executable

Anything you have corrected twice is a candidate for a check that runs without you.

  • Token audit: flag raw hex values, off-scale spacing, and hardcoded colors outside shared components
  • Type check: invented component props and wrong variants fail before anyone reviews them
  • Lint and structural verify scripts: naming, file ownership, and required content rules
  • Accessibility audit: axe-core or pa11y against the real rendered route, not the source
  • The agent runs all of these on its own output before a human ever looks

Executable checks do not replace critique. They clear the mechanical failures out of the way so the human review can spend itself on judgment.

Slide notes

The principle is leverage: a correction you make in conversation fixes one run; a correction you encode as a check fixes every run after it. The categories on the slide map to real tooling rather than aspiration. Token audits are simple searches — raw hex values, rgb and hsl literals, arbitrary spacing values, semantic-token bypasses — and this site runs exactly that as a script. Type checks catch the single most common implementation failure in agent output: an invented prop or a variant the component never had. Structural verify scripts enforce the boring contract — required fields present, naming conventions followed, files in the right place. Accessibility audits run axe-core or pa11y against the rendered route, which catches contrast, missing labels, and landmark problems mechanically.

The order of operations matters as much as the checks themselves: the agent runs them on its own output before a human reviews anything. That changes what arrives at the review gate. Instead of spending your attention noticing a hardcoded color, you spend it on whether the hierarchy serves the job — the thing only you can judge.

Be equally clear about what executable checks cannot do. A clean axe report is a floor, not a ceiling — it cannot tell you whether the focus order makes sense or the alt text is useful. A passing token audit cannot tell you the design is any good; it can only tell you it is consistent with the system. Each gate catches what it encodes, and only the human catches the rest. That sentence is the bridge to the next two slides.

Narration for this slide

Here is the highest-leverage habit in this module: anything you have corrected twice should become a check that runs without you. Token audits catch raw colors and off-scale spacing. The type check catches invented component props before anyone reads the code. Verify scripts enforce naming and structure. An accessibility audit runs axe against the real rendered page. And critically, the agent runs all of these on its own output before you ever look. That does not replace your critique — a clean audit can still be a bad design. What it does is clear the mechanical failures out of the way, so your attention goes where only your attention works: judgment.

Slide 5 of 1316:9

Anti-slop rules: the failure patterns worth automating against

Left to its defaults, an agent reproduces the centre of its training data. Anti-slop rules pull it back toward your product.

  • Purple-gradient hero treatments and generic SaaS landing-page layouts where product UI was asked for
  • Emoji used as bullet decoration in interface copy
  • Low-contrast badges and muted-on-muted text that fail the contrast check
  • Marketing bands, invented calls-to-action, and stock 'how it works' patterns nobody briefed
  • Each named pattern becomes a harness rule, and where possible an executable check

Slop is not a model failure. It is the statistically most common answer showing up where your product needed a specific one.

Slide notes

Define slop precisely, because the word gets used as a sneer and it is more useful as a diagnosis: slop is the centre of the training distribution arriving in your product. Purple gradients, generic SaaS hero layouts, emoji bullets, marketing bands with invented calls-to-action — these are not random errors, they are the most statistically common answers to underspecified requests. That is exactly why they respond to rules: the agent is not failing to follow your direction, it is filling the gaps you left with the public web's average.

The practical move is to name the patterns your team keeps rejecting and write them into the harness as explicit prohibitions — no gradient backgrounds, no emoji in interface copy, no marketing-style bands on product pages, no calls-to-action that do not exist in the product. Naming matters: a rule that says avoid generic design does nothing, while a rule that names the pattern gives the agent something it can actually check itself against. The Module 1 worked example showed this concretely — the unbriefed run produced a marketing band with a gradient and an invented CTA precisely because nothing said not to.

Then push as many of these rules as possible from prose into checks. A no-raw-colors rule can be a token audit. A contrast rule can be an axe assertion. A no-emoji rule can be a lint pattern. The ones that stay prose — tone, layout genre — still earn their place in the harness, because the agent reads them on every run and your reviewers stop having to repeat them. This is the anti-slop layer of the quality-gate pipeline on the next diagram.

Narration for this slide

Let's name the failure everyone has seen: the purple gradient hero, the generic SaaS layout, emoji as bullet points, a low-contrast badge, a marketing band with a call-to-action your product does not have. We call it slop, but it is really just the most common answer on the public web showing up where your product needed a specific one. Which means it responds to rules. Name the patterns your team keeps rejecting and write them into the harness as explicit prohibitions. Then push whatever you can into executable checks — token audits, contrast assertions, lint rules. Slop survives on vagueness. Named patterns and automated gates starve it.

Slide 6 of 1316:9

Common failure modes, mapped to the gate that catches them

Every recurring failure has a gate that catches it cheaply. The expensive version is the stakeholder catching it instead.

Failure modeWhat it looks likeThe gate that catches it
Invented component propsA variant or prop the component never had; it reads fine in the diffType check, run by the agent before review
Generic patternsStock hero layouts, marketing bands, gradients where product UI was briefedAnti-slop rules in the harness, plus the tone and hierarchy critique pass
Quiet scope creepThe agent 'improves' adjacent screens and renames things nobody asked aboutPlan review and a scoped diff — flag any file outside the brief's scope
Token and system driftRaw hex values, one-off spacing, a copied local component bypassing the shared oneToken audit and design-system audit scripts
Missing statesHappy path only; empty, error, loading, and disabled states do not existThe states row of the critique, plus state captures in visual QA
Accessibility regressionsLow contrast, missing labels, no focus styles, no aria-currentaxe audit on the rendered route, plus a human keyboard pass

The pattern: mechanical failures go to executable gates, judgment failures go to critique, and scope failures go to the plan review you already hold.

Slide notes

This table is the module's argument made operational: recurring failures are not reasons to distrust the workflow, they are inputs to it. Each row names a failure that shows up repeatedly in real agent output and the gate that catches it at its cheapest point.

A few rows deserve commentary. Invented props is the canonical example of why the agent runs the type check on itself — the failure is invisible in review because the code reads plausibly, and trivially visible to the compiler. Quiet scope creep is the one that surprises people: the agent helpfully refactors a neighbouring component, renames a prop for consistency, or restyles a screen it was never asked to touch, and the diff balloons. The catch is procedural, not technical — the plan review states the scope, and the diff review flags any file outside it. Token and system drift is the audit story from the design-system audits article: drift is usually quiet, it starts as one raw color or one copied component, and an agent-run audit finds it across the whole surface faster than any human sweep.

Close on the meta-pattern in the highlight: mechanical failures belong to executable gates, judgment failures belong to critique, and scope failures belong to the plan review from Module 2. If a failure keeps recurring and no gate catches it, that is not bad luck — it is a missing rule, which is the subject of the harness-update slide coming shortly.

Narration for this slide

Here are the failure modes you will actually see, mapped to the gate that catches each one cheaply. Invented component props — caught by the type check the agent runs on itself. Generic patterns — caught by anti-slop rules and the tone critique. Quiet scope creep, where the agent improves things nobody asked about — caught at the plan review and by flagging any file outside the brief's scope. Token drift — caught by the audit script. Missing states — caught by the states row of your critique. Accessibility regressions — caught by axe plus a keyboard pass. Notice the pattern: mechanical failures go to automated gates, judgment failures go to critique, scope failures go to the plan review. Every recurring failure should have an address.

Slide 7 of 1316:9

The quality-gate pipeline

Six stages between a generated draft and the decision to ship. Some gates are automated, some are agent-run, and the ones that matter most stay human.

Pipeline diagram of the quality gates for agent design output. Agent-generated work flows into executable checks — token audit, type check, lint, and accessibility audit — marked as automated gates. It then passes through visual QA with screenshots at fixed widths, an agent-run step whose findings a human reads, and into structured critique against named dimensions: hierarchy, density, tone, states, and accessibility, marked as a human-led gate. Recurring critique feeds a harness update where repeated feedback becomes a rule, with a dashed line returning to generation, and the pipeline ends at the ship decision, which is always human. Green stripes mark automated gates, yellow marks agent-run steps, and blue marks human-led gates.
Generation and visual QA are agent-run; the executable checks are automated and run on every pass; structured critique, the harness update, and the ship decision are human-led. The dashed line is the compounding step: feedback that recurs becomes a rule, so the next generation starts from a stricter harness.

The pipeline is layered on purpose: each gate catches what it encodes, and only the human catches the rest.

Slide notes

Walk the pipeline left to right and name the ownership of each stage, because the colors on the diagram carry the argument. Generation is agent-run and produces something plausible — which is the problem the rest of the pipeline exists to solve. The executable checks are the automated gate: token audit, type check, lint, accessibility audit, all run by the agent on its own output, identical on every pass. Visual QA is agent-run but human-read: the agent captures screenshots at fixed widths, compares them against the stated intent, and writes prioritized findings; a person reads them before anything changes. Structured critique is human-led — the named dimensions from earlier in the module, with severity by user impact and judgment calls separated from mismatches.

The harness update is the stage teams skip and the one that compounds. When the same finding shows up twice, it stops being feedback and becomes a rule — in the harness as prose, or better, as another executable check. The dashed line back to generation is the point of the whole diagram: each loop through the pipeline should make the next loop's first draft better, because the harness got stricter. A team whose critique never reaches the harness gives the same feedback forever.

The ship decision stays human, full stop, and it is worth saying why even when every check passes: the checks verify what was encoded, and nobody has encoded whether this is the right thing to ship. That judgment is the accountability the Module 1 'what stays human' slide promised would not transfer.

Narration for this slide

Here is the whole module in one picture. A generated draft enters the pipeline. First the executable checks — tokens, types, lint, accessibility — fully automated, run by the agent on its own output. Then visual QA: screenshots at fixed widths compared against intent, with findings a human reads. Then structured critique against the named dimensions — that gate is yours. Then the step most teams skip: the harness update, where any feedback you have given twice becomes a rule, so the next draft starts better. And finally the ship decision, which is always human, because passing every check is not the same as being the right thing to ship. Automated gates catch what they encode. You catch the rest.

Slide 8 of 1316:9

What agents are still bad at

An honest account of the limits, because designing the workflow around them works better than discovering them in production.

  • Ambiguity — an unclear brief gets filled with the most common answer, stated confidently
  • Organisational politics — stakeholder history, ownership, and risk never appear in the files
  • Novel taste — agents converge on the centre of the distribution; a new direction has to come from you
  • Long-horizon consistency — quality drifts across long sessions and between sessions without a harness
  • Knowing when to stop — agents will keep producing plausible output well past the point of usefulness

None of these are temporary bugs to wait out. They are the reasons the gates, the harness, and the human decision exist at all.

Slide notes

Resist the temptation to soften this slide, because the credibility of the whole course rests on it. The limits listed are not edge cases; they are structural, and the workflow taught in this course is shaped around them rather than in denial of them.

Take them in turn. Ambiguity: an agent does not flag a vague brief and wait — it fills the gap with the statistically most common answer and presents it with full confidence, which is exactly the slop mechanism from earlier. Organisational politics: the context that decides many design calls — who owns this surface, what failed last year, which stakeholder needs to be consulted before the navigation changes — never appears in the repository, so the agent cannot weigh it. Novel taste: agents are interpolation machines; they produce competent versions of things that exist, and a genuinely new direction still has to come from a person with a point of view. Long-horizon consistency: quality drifts over long sessions and across sessions, which is precisely why the harness exists — it is the memory the agent does not have. And knowing when to stop: the screenshot-iteration plateau from the visual QA work is a small example of a general truth — improvement flattens after a few rounds, and past that point more iterations produce churn, not quality.

The field is moving — canvases opening to agents, design systems becoming machine-readable contracts — but the public evidence says the human-approval ceiling is holding: agents propagate and verify; humans approve, decide, and own taste. Plan on that division continuing, and treat anyone selling full autonomy on these dimensions with suspicion.

Narration for this slide

Now the slide this course owes you: what agents are still bad at. They handle ambiguity badly — a vague brief gets filled with the most common answer, delivered confidently. They cannot read organisational politics, because none of it is in the files. They do not produce novel taste; they converge on the centre of the distribution, and a new direction has to come from you. They drift over long horizons — across a long session and between sessions — which is why the harness exists. And they do not know when to stop; they will keep producing plausible output long after it stopped improving. None of this is a bug that next year's model fixes. These limits are why the gates exist.

Slide 9 of 1316:9

The ship checklist

What gets verified before agent-produced design work leaves the loop. Copy it, then make it yours.

  • Executable checks pass: token audit, type check, lint, and the accessibility audit on the rendered route
  • Visual QA done at the standard widths, states included, with no P0 or P1 finding open
  • Critique run against the named dimensions; every finding has evidence and a severity
  • Open judgment calls are logged with a named human owner — not silently 'fixed' by the agent
  • The diff stays inside the brief's scope; anything outside it is explained or reverted
  • Recurring feedback from this run has been written into the harness or turned into a check
  • A human has decided to ship, and knows they are the one answering for it

The checklist is short on purpose. If an item cannot be verified in minutes, it belongs in a gate earlier in the pipeline, not at the end.

Slide notes

This is the artifact participants should leave the course with, so present it as a starting point to edit rather than a standard to obey. The seven items deliberately mirror the pipeline: the first two are the automated and agent-run gates, the middle three are the human review, the sixth is the harness update, and the last is the decision. If a team's checklist diverges from their pipeline, one of the two is fiction.

A few items reward emphasis. No open P0 or P1 is the stopping condition borrowed from the visual QA workflow — polish-level findings can ship as logged debt, hierarchy and task blockers cannot. Judgment calls logged with an owner is the anti-pattern guard: the alternative is the agent quietly resolving taste questions inside a fix pass, which is how a product's design direction changes without anyone deciding it should. Scope held is the cheap defence against quiet creep — a one-line check of which files changed against what the brief named.

The sixth item is the one that makes the checklist compound rather than just gate: every shipped piece of work should leave the harness slightly stricter than it found it. And the last item is deliberately phrased as a person deciding, not a status flipping — it is the accountability thread that runs from Module 1 to here. Encourage teams to keep the list to roughly this length; checklists that grow past a screen stop being run.

Narration for this slide

Here is the ship checklist — what gets verified before agent-produced work leaves the loop. The executable checks pass: tokens, types, lint, accessibility. Visual QA is done at the standard widths, states included, with no P0 or P1 still open. The critique ran against named dimensions, and every finding has evidence and a severity. Judgment calls are logged with a human owner, not quietly resolved by the agent. The diff stayed inside the brief's scope. Recurring feedback went into the harness. And a person decided to ship, knowing their name is on it. Seven items, deliberately short. If something on your version takes an hour to verify, it belongs in an earlier gate, not at the end.

Slide 10 of 1316:9

Keeping the harness alive

A harness is not a setup task you finish. It is the place recurring critique goes to stop recurring.

  • Twice is the threshold: feedback given twice becomes a harness rule or an executable check
  • Prefer checks over prose — a rule the audit enforces never has to be repeated in review
  • Record approved exceptions, or the next audit will try to 'fix' an intentional variation
  • Prune as deliberately as you add: stale rules and bloat erode the agent's trust in the file
  • Decisions that live only in a conversation are decisions the next session never heard

The teams whose agent output keeps improving are not the ones with better prompts. They are the ones whose harness absorbs every lesson the critique produces.

Slide notes

This slide closes the loop the diagram drew with its dashed line. Module 4 built the harness; this slide is about what keeps it true. The operating rule is simple enough to be memorable: twice is the threshold. The first time you give a piece of feedback, it is critique. The second time, it is a missing rule — write it into the harness, or better, turn it into a check the audit runs. Teams that skip this step experience the most demoralising version of agent work: giving the same feedback forever, to a collaborator that genuinely cannot remember it.

Two refinements keep the practice honest. Record approved exceptions alongside the rules — the design-system audit work makes this point sharply, because an audit that does not know a variation was intentional will keep flagging it, and eventually someone 'fixes' a deliberate decision. And prune as deliberately as you add: a harness that accumulates stale rules, contradictions, and restated documentation stops being trusted, by the agent and by the team. The instruction file is a design system for agent behaviour, and like any design system it rots without maintenance.

The last bullet is the cultural one: a decision that lives only in a conversation is a decision the next session never heard. Conversations are where decisions get made; the harness, the DESIGN.md, and the checks are where they get kept. That is also the honest answer to the long-horizon consistency limit from two slides ago — the consistency does not come from the model, it comes from the file.

Narration for this slide

One habit separates teams whose agent output keeps getting better from teams stuck giving the same feedback forever: the harness update. The rule is simple — feedback you have given twice becomes a harness rule, or better, an executable check. Prefer checks over prose, because a rule the audit enforces never needs repeating in review. Write down approved exceptions too, or the next audit will try to fix something you did on purpose. And prune — stale rules erode trust in the file. Remember: a decision that lives only in a conversation is one the next session never heard. The harness is where critique goes to stop recurring.

Slide 11 of 1316:9

Course retrospective: the Module 1 task, run through the full pipeline

In Module 1 you sketched a task on paper. In Module 3 you ran it. Now run it through every gate this module added — and compare against that first page.

  • Take the Module 1 task (or its Module 3 result) and run the executable checks on it: tokens, types, accessibility
  • Capture screenshots at 390, 768, and 1440 pixels, including at least one non-happy-path state
  • Run a structured critique against the five dimensions; require evidence and severity per finding
  • Write your own ship checklist by editing the one from this module — cut or add no more than three items
  • Add at least one recurring piece of feedback to your harness as a rule or a check
  • Compare the result, and the process, against the page you sketched in Module 1

The comparison is the point: the gap between the Module 1 sketch and today's run is the course, made visible in your own work.

Slide notes

This exercise is the course's full circle, so frame it that way. In Module 1 participants sketched a task on paper before they had touched an agent — the brief, the gates they would hold, what could go wrong. In Module 3 they wrote the brief properly and ran it. This retrospective takes that same artifact and pushes it through the full pipeline: executable checks, visual QA at fixed widths, structured critique with evidence and severity, a harness update, and a personal ship checklist.

The deliverables to insist on are the comparison and the checklist. The comparison against the Module 1 page is where the learning becomes visible: most participants find their original sketch named the right gates but wildly underestimated how much of the quality work could be made executable, and overestimated how much the agent would get right unprompted. The personal ship checklist matters because the one in this module is a default, not a doctrine — editing it forces a decision about what their context actually requires. The constraint of changing no more than three items is deliberate: it keeps the checklist short and the editing honest.

If running this live, the discussion question that produces the best conversation is the harness item: what single rule or check, added today, would have saved the most correction across the whole course's exercises? The answers tend to be small, specific, and immediately reusable — which is exactly the habit the course is trying to leave behind.

Narration for this slide

Time to close the loop you opened in Module 1. Take that original task — the one you sketched on paper, then ran in Module 3 — and push it through the full pipeline. Run the executable checks: tokens, types, accessibility. Capture screenshots at the three standard widths, including a state that is not the happy path. Critique it against the five dimensions, with evidence and severity for every finding. Edit this module's ship checklist into your own — change no more than three items. Add at least one recurring piece of feedback to your harness. Then put today's result next to the page you sketched in Module 1. That gap is the course, visible in your own work.

Slide 12 of 1316:9

Where to go next in the curriculum

This course built the foundation: the loop, the brief, the harness, the gates. The next courses each go deep on one part of it.

  • Claude Code for Designers — hands-on depth in one platform: sessions, skills, MCP connections, and daily working habits
  • Design Systems for Agents — turning your design system into the machine-readable contract agents execute against
  • Agentic Prototyping — prototype-first workflows: from brief to testable artifact fast, without polluting production
  • Whichever you pick, the fundamentals transfer: the brief, the harness, the critique, and the gates come with you

Choose by your nearest bottleneck: tool fluency, system legibility, or speed to a testable artifact.

Slide notes

Keep this slide practical: the goal is to route people to the right next course, not to advertise all of them. The honest framing is that this course taught the loop and the next courses each deepen one part of it. Claude Code for Designers is the tool-depth course — for people who picked their platform and now want fluency: session habits, skills, MCP connections to canvases and browsers, and the daily mechanics this course deliberately stayed above. Design Systems for Agents goes deep on the harness side: tokens, component inventories, anti-patterns, and audits — the work of making a design system legible to machines, which the field-notes article argues is the lowest-regret investment a team can make this year because it pays off under every version of where the tooling goes. Agentic Prototyping is the speed course: prototype-first workflows that get from a brief to something testable in a session, kept deliberately separate from production code.

The selection heuristic in the highlight works in practice: pick by the nearest bottleneck. If the agent sessions themselves feel clumsy, take the platform course. If output keeps drifting from the system, take the design-systems course. If the team's problem is getting ideas in front of users fast enough, take the prototyping course.

Close with the reassurance that matters given how fast the platform layer churns: everything from this course — the brief, the harness, the critique dimensions, the gates — is written to the open-standards layer and travels with you regardless of which course, or which platform, comes next.

Narration for this slide

So where do you go from here? Three courses continue the curriculum, and each goes deep on one part of what you have just learned. Claude Code for Designers is the platform-depth course — sessions, skills, MCP connections, the daily mechanics. Design Systems for Agents is about making your design system the machine-readable contract agents execute against — tokens, inventories, audits. And Agentic Prototyping is the speed course: from brief to testable artifact in a session, without touching production. Pick by your nearest bottleneck — tool fluency, system legibility, or speed to something testable. And whichever you choose, the fundamentals you built here come with you.

Slide 13 of 1316:9

Summary, and the close of the course

  • Critique is where the quality comes from: evidence, named dimensions, severity, and a human gate
  • Make checks executable — tokens, types, lint, accessibility — and let the agent run them on itself
  • Name the failure patterns; every recurring failure should have a gate that catches it
  • Be honest about the limits: ambiguity, politics, novel taste, and long-horizon consistency stay yours to manage
  • Feedback given twice becomes a harness rule; the ship decision is always a person

Six modules ago this was a vocabulary lesson. Now it is a working practice: a loop you can run, gates you can hold, and a harness that gets stricter every time you ship.

Slide notes

Recap the module first, then zoom out to the course. The module's argument in one line: agent output becomes good through review, and review becomes cheap through structure — evidence instead of opinion, named dimensions instead of vibes, executable checks instead of repeated corrections, and a harness that absorbs every lesson so it never has to be taught twice. The limits slide is part of the summary, not a footnote: the gates exist precisely because ambiguity, politics, novel taste, and long-horizon consistency do not transfer to the agent.

Then close the course. Module 1 gave the vocabulary and the loop; Module 2 slowed the loop to working speed; Module 3 made the brief carry intent; Module 4 built the harness that carries everything durable; Module 5 chose the artifacts and tools the loop runs on; and this module added the gates that decide what ships. The thread through all six is the same division of labour: the agent runs production, and the human holds the brief, the critique, and the decision — which is also where the value of the job is moving.

If participants did the exercises, the artifacts they should leave with are concrete: a brief template they have used, a harness file they have started, a critique rubric with named dimensions, and a ship checklist they have edited into their own. Point them at the next courses, and at the school's articles for the workflows this course could only gesture at — visual QA, design-system audits, and the field notes on where the tooling is heading.

Narration for this slide

Let's close the module, and the course. This module's claim was simple: critique is where the quality comes from. Give the review evidence, name the dimensions, assign severity by user impact, and make every mechanical check executable so the agent runs it on itself. Name the failure patterns, give each one a gate, and when feedback recurs, write it into the harness so it stops recurring. Stay honest about the limits — ambiguity, politics, novel taste, long horizons — because those are exactly why the gates and the final decision stay human. Six modules ago this was vocabulary. Now it is a practice you can run on Monday. Thank you for taking the course — and pick your next one by your nearest bottleneck.

Module transcript
Module 6, narrated slide by slide

Slide 1Critique, Quality Gates, and Limits

Welcome to the final module of Agentic Design Fundamentals. Everything so far has been about getting a better first draft out of the agent — the brief, the harness, the right artifacts. This module is about what happens between that draft and the decision to ship. We will cover visual QA with real screenshot evidence, critique against named dimensions instead of vibes, the checks you can make fully automatic, and the failure patterns worth automating against. Then we get honest about what agents are still bad at, build your ship checklist, and close the loop on the exercise you started back in Module 1. Critique is where the quality comes from. Let's do it properly.

Slide 2Visual QA: screenshots as evidence, not decoration

Here is the problem visual QA solves. A generated page can compile, pass the type check, and still fail the design — the hierarchy is wrong, the density is off, the layout collapses on a phone. None of that shows up in a diff. So we give the review evidence: screenshots at the same fixed widths every run, the important states and not just the happy path, an accessibility report, and one paragraph reminding everyone what the page was for. Then one rule does most of the work: every finding has to name its evidence — which file, which viewport, which check. And the agent reports findings; a human approves the fixes. That order is not negotiable.

Slide 3Critique dimensions: named, not vibes

If you ask an agent whether a design is good, you get adjectives. If you ask it to inspect against named dimensions, you get findings. Here are the five we use as a baseline: hierarchy — is the information in the right order for the job; density — is it dashboard-tight or reading-loose, deliberately; tone — does it sound and look like this product, not the genre average; states — do empty, loading, error, and disabled actually exist; and accessibility — can keyboard and screen-reader users finish the task. Each one has evidence that settles it. Then two rules: severity goes by user impact, not by how easy the fix is, and judgment calls get routed to a human — the agent does not fix taste.

Slide 4Making checks executable

Here is the highest-leverage habit in this module: anything you have corrected twice should become a check that runs without you. Token audits catch raw colors and off-scale spacing. The type check catches invented component props before anyone reads the code. Verify scripts enforce naming and structure. An accessibility audit runs axe against the real rendered page. And critically, the agent runs all of these on its own output before you ever look. That does not replace your critique — a clean audit can still be a bad design. What it does is clear the mechanical failures out of the way, so your attention goes where only your attention works: judgment.

Slide 5Anti-slop rules: the failure patterns worth automating against

Let's name the failure everyone has seen: the purple gradient hero, the generic SaaS layout, emoji as bullet points, a low-contrast badge, a marketing band with a call-to-action your product does not have. We call it slop, but it is really just the most common answer on the public web showing up where your product needed a specific one. Which means it responds to rules. Name the patterns your team keeps rejecting and write them into the harness as explicit prohibitions. Then push whatever you can into executable checks — token audits, contrast assertions, lint rules. Slop survives on vagueness. Named patterns and automated gates starve it.

Slide 6Common failure modes, mapped to the gate that catches them

Here are the failure modes you will actually see, mapped to the gate that catches each one cheaply. Invented component props — caught by the type check the agent runs on itself. Generic patterns — caught by anti-slop rules and the tone critique. Quiet scope creep, where the agent improves things nobody asked about — caught at the plan review and by flagging any file outside the brief's scope. Token drift — caught by the audit script. Missing states — caught by the states row of your critique. Accessibility regressions — caught by axe plus a keyboard pass. Notice the pattern: mechanical failures go to automated gates, judgment failures go to critique, scope failures go to the plan review. Every recurring failure should have an address.

Slide 7The quality-gate pipeline

Here is the whole module in one picture. A generated draft enters the pipeline. First the executable checks — tokens, types, lint, accessibility — fully automated, run by the agent on its own output. Then visual QA: screenshots at fixed widths compared against intent, with findings a human reads. Then structured critique against the named dimensions — that gate is yours. Then the step most teams skip: the harness update, where any feedback you have given twice becomes a rule, so the next draft starts better. And finally the ship decision, which is always human, because passing every check is not the same as being the right thing to ship. Automated gates catch what they encode. You catch the rest.

Slide 8What agents are still bad at

Now the slide this course owes you: what agents are still bad at. They handle ambiguity badly — a vague brief gets filled with the most common answer, delivered confidently. They cannot read organisational politics, because none of it is in the files. They do not produce novel taste; they converge on the centre of the distribution, and a new direction has to come from you. They drift over long horizons — across a long session and between sessions — which is why the harness exists. And they do not know when to stop; they will keep producing plausible output long after it stopped improving. None of this is a bug that next year's model fixes. These limits are why the gates exist.

Slide 9The ship checklist

Here is the ship checklist — what gets verified before agent-produced work leaves the loop. The executable checks pass: tokens, types, lint, accessibility. Visual QA is done at the standard widths, states included, with no P0 or P1 still open. The critique ran against named dimensions, and every finding has evidence and a severity. Judgment calls are logged with a human owner, not quietly resolved by the agent. The diff stayed inside the brief's scope. Recurring feedback went into the harness. And a person decided to ship, knowing their name is on it. Seven items, deliberately short. If something on your version takes an hour to verify, it belongs in an earlier gate, not at the end.

Slide 10Keeping the harness alive

One habit separates teams whose agent output keeps getting better from teams stuck giving the same feedback forever: the harness update. The rule is simple — feedback you have given twice becomes a harness rule, or better, an executable check. Prefer checks over prose, because a rule the audit enforces never needs repeating in review. Write down approved exceptions too, or the next audit will try to fix something you did on purpose. And prune — stale rules erode trust in the file. Remember: a decision that lives only in a conversation is one the next session never heard. The harness is where critique goes to stop recurring.

Slide 11Course retrospective: the Module 1 task, run through the full pipeline

Time to close the loop you opened in Module 1. Take that original task — the one you sketched on paper, then ran in Module 3 — and push it through the full pipeline. Run the executable checks: tokens, types, accessibility. Capture screenshots at the three standard widths, including a state that is not the happy path. Critique it against the five dimensions, with evidence and severity for every finding. Edit this module's ship checklist into your own — change no more than three items. Add at least one recurring piece of feedback to your harness. Then put today's result next to the page you sketched in Module 1. That gap is the course, visible in your own work.

Slide 12Where to go next in the curriculum

So where do you go from here? Three courses continue the curriculum, and each goes deep on one part of what you have just learned. Claude Code for Designers is the platform-depth course — sessions, skills, MCP connections, the daily mechanics. Design Systems for Agents is about making your design system the machine-readable contract agents execute against — tokens, inventories, audits. And Agentic Prototyping is the speed course: from brief to testable artifact in a session, without touching production. Pick by your nearest bottleneck — tool fluency, system legibility, or speed to something testable. And whichever you choose, the fundamentals you built here come with you.

Slide 13Summary, and the close of the course

Let's close the module, and the course. This module's claim was simple: critique is where the quality comes from. Give the review evidence, name the dimensions, assign severity by user impact, and make every mechanical check executable so the agent runs it on itself. Name the failure patterns, give each one a gate, and when feedback recurs, write it into the harness so it stops recurring. Stay honest about the limits — ambiguity, politics, novel taste, long horizons — because those are exactly why the gates and the final decision stay human. Six modules ago this was vocabulary. Now it is a practice you can run on Monday. Thank you for taking the course — and pick your next one by your nearest bottleneck.