AAgentic Design School
Module 4 of 5
40–50 minutes

Design Review and Critique with Agents

Accessibility and Content Review

Two reviews that are chronically deferred, made routine: accessibility sweeps that go beyond automated contrast checks, and UX-writing review that holds copy to the product's voice with the same rigour as the visuals.

Duration40–50 minutes

Slides14 slides with notes and narration

Learning objectives

  • Run accessibility reviews covering structure, keyboard flow, and screen-reader sense, not just contrast.
  • Audit interface copy for voice, clarity, and consistency against a written standard.
  • Decide which findings agents can fix directly and which need design or legal judgment.
Slide deck

Work through the module

Each slide is shown in its 16:9 frame, exactly as it appears in the video version. Open the notes under any slide for the longer explanation, and the narration if you prefer to read along.

Slide 1 of 1416:9

The Reviews Everyone Defers

Design Review and Critique with Agents · Module 4 of 5

  • Accessibility review beyond the automated checker
  • Keyboard and screen-reader walkthroughs as agent runs
  • Content review: voice, clarity, terminology, errors
  • Which findings agents fix, and which need human judgment

Accessibility and content review fail for the same reason: they are everyone's responsibility and nobody's calendar slot. This module makes both routine.

Slide notes

Open by naming the pattern honestly. Almost no team disagrees that accessibility and copy quality matter, and almost every team defers the review of both. They are deferred because they are slow to do well by hand, because the people who can do them well are scarce, and because the work is mostly careful reading at scale — exactly the kind of work that loses to a deadline every time.

This module applies the structure built in Modules 1 to 3 — named criteria, findings with evidence, severity, and a human gate — to these two specific review types. The methods are not new. What changes is the cost: an agent can read every route and every string against the same standard in under an hour, which means the review can happen every release instead of once before an audit.

Set one expectation early: this module will be careful about what these reviews can and cannot prove. Passing an agent sweep does not make a product accessible, and a copy audit does not make the writing good. The sweeps make the mechanical part continuous so the human judgment is spent on decisions rather than on hunting.

Narration for this slide

Welcome to Module 4. This one is about the two reviews everyone agrees matter and almost everyone defers: accessibility and content. They get deferred for the same reason — done properly, both are careful reading at scale, and careful reading at scale always loses to the deadline. We are going to take the critique structure from the earlier modules — named criteria, findings with evidence, severity, a human gate — and apply it to these two reviews so they become routine passes rather than special projects. And we will be honest about the limits: the sweep finds the failures it can detect. The judgment about what to do with them stays with you.

Slide 2 of 1416:9

Why these two reviews stay deferred

Neither review fails because people do not care. They fail because the work is structurally easy to postpone.

  • Both accumulate one component and one string at a time — no single change looks like the problem
  • Both need exhaustive reading: every route, every state, every string, against the same standard
  • Specialist time is scarce: one accessibility lead or content designer across many teams
  • The forcing functions arrive late: an external audit, a procurement deadline, a localisation push

By the time something forces the review, the findings arrive in the hundreds and the fix becomes a project instead of a habit.

Slide notes

The accumulation point deserves the most time. Nobody ships an inaccessible product or an incoherent voice on purpose. They ship it one decision at a time: an icon button without a label here, a low-contrast badge there, an engineer writing an error string at midnight, a marketer naming a feature. Each change is reasonable locally. The problems only become visible when someone reads the whole surface as a whole — and users are usually the first people who do.

The scarcity point matters for how the agent fits in. Most organisations have far fewer accessibility specialists and content designers than product teams. The traditional answer is to ration their time: they review the highest-risk flows and hope the rest holds. The agent does not replace them; it changes what their scarce time is spent on. Instead of reading 1,800 strings or tabbing through forty routes, they review a ranked findings list and make the calls only they can make.

The forcing-function point is the cost argument. When the trigger is an external audit or a statutory deadline, fixes happen under pressure, in bulk, and late — which is the most expensive way to do them. The case studies later in this module show the same fixes costing a fraction when they are caught as part of a routine sweep.

Narration for this slide

Why do these two reviews keep slipping? Not because anyone disputes that they matter. They slip because the problems accumulate one small decision at a time — an unlabelled icon button here, an error string written at midnight there — and no single change looks like the problem. Reviewing properly means reading everything against the same standard, and the people who can do that well are spread thin across many teams. So the review waits for a forcing function: an audit, a procurement requirement, a localisation push. And by then the findings arrive in the hundreds and the fix is a project. The agent's job here is to make the reading continuous, so the specialists spend their time on decisions instead.

Slide 3 of 1416:9

Accessibility beyond the automated checker

Automated checkers are precise about what they can detect — and they detect well under half of what WCAG asks for.

  • Tools catch the mechanical part: contrast ratios, missing labels and alt text, invalid ARIA, unlabelled form fields
  • They cannot judge meaning: whether the accessible name describes the control, or the error tells the user what to do next
  • They cannot judge order: whether keyboard focus follows the task, or only the accident of the DOM
  • Agents review structure, names, focus order, and announcements — the part that needs reading, not measuring

Run the tools first, then the agent. Tool results anchor the agent pass and stop it re-deriving what a checker already proved.

Slide notes

Be precise about the division of labour, because over-claiming here damages credibility with anyone who has done accessibility work. Automated checkers like axe-core are excellent at what they measure: contrast ratios against the WCAG AA thresholds, missing alt attributes, invalid ARIA, form fields without a programmatic label, broken landmark structure. They are also famously incomplete — common estimates put automated coverage at well under half of WCAG criteria, and the gap is precisely the part that requires judgment about meaning.

That gap is where the agent review earns its place. An agent can read a component and judge whether the accessible name actually describes what the control does, whether the focus order follows the order of the task rather than the order someone happened to write the DOM, whether an error message tells the user what to do next, whether a status is communicated by colour alone, and whether a custom dropdown behaves like the native pattern it imitates. None of that is something a rule-based checker can decide.

The sequencing matters as much as the split. The accessibility-sweep workflow this module is built on runs the static checks first and hands the results to the agent. That keeps the agent from spending its effort confirming what a tool already proved, and it gives every agent finding a concrete anchor: this axe violation, in this file, on this route. Findings without anchors get argued with; findings with anchors get fixed.

Narration for this slide

Let's start with accessibility, and with an honest claim about tools. Automated checkers are precise about what they can detect — contrast ratios, missing labels, invalid ARIA, unlabelled form fields. But they cover well under half of what WCAG actually asks for, and the missing half is the half that needs judgment. Does the accessible name describe the control, or is it the icon's file name? Does focus order follow the task, or just the accident of the DOM? Does the error message tell the user what to do next? That reading is what the agent pass does. Run the tools first, then the agent — the tool results give every agent finding a concrete anchor.

Slide 4 of 1416:9

Three bands of review: tools, agents, judgment

Each band catches what it is good at and passes the rest down. The bottom band is decisions, not detection.

Three stacked bands. The top band is automated checks run by tools such as axe-core and Playwright: they catch contrast ratios, missing labels and alt text, invalid ARIA, and unlabelled form fields, but miss whether names make sense, whether focus order follows the task, and anything about copy meaning. The middle band is the agent-run semantic and content review: it catches names that do not describe the control, focus order against task order, error messages with no next step, colour-only signals, and terminology and voice drift, but misses real assistive technology behaviour, naming decisions, and legal wording. The bottom band is human judgment: severity calls and exceptions, naming and voice decisions, testing with real screen readers and users, and release trade-offs. Arrows show findings flowing down from each band as evidence for the next.
Tools run first and anchor the agent pass; the agent pass covers meaning, order, and copy at scale; ranked findings reach the human severity gate. Each band makes the next one cheaper — none replaces it.

The bands stack: tools make the agent pass cheaper, the agent pass makes human judgment cheaper, and nothing in the bottom band can be delegated upward.

Slide notes

Walk the diagram top to bottom. Band one is the automated checks: axe-core, contrast tools, scripted keyboard walks recorded by something like Playwright. Their character is precise but incomplete — they prove specific failures and miss everything that requires interpretation. Band two is the agent review, one reviewer per route or surface, covering both the accessibility semantics and the content rules: names that do not describe the control, focus order against task order, errors with no next step, colour-only signals, and terminology, voice, and readability drift. Band three is human judgment: severity calls and approved exceptions, naming and voice decisions, testing with real assistive technology and real users, and the release trade-off itself.

The arrows are the point of the picture. Findings flow downward as evidence for the next band, and each band exists to make the one below it cheaper. The tool results stop the agent re-deriving proven facts. The agent's ranked, located, evidenced findings stop the human spending their judgment on hunting.

The right-hand column of each band lists what that band misses, and the bottom band's list is the one to dwell on: severity decisions, naming decisions, real screen reader behaviour across browser combinations, and release trade-offs cannot be delegated to either tools or agents. That is not a temporary limitation to be engineered away; it is where accountability lives, and the same point recurs from Module 1 of this course.

Narration for this slide

Here is the whole module in one picture. Three bands. At the top, automated checks — axe-core, contrast tools, scripted keyboard walks. Precise, but they miss everything that needs interpretation. In the middle, the agent review: names, focus order, error messages, colour-only signals, and on the content side, terminology and voice drift — judgment about meaning, applied at scale. At the bottom, human judgment: severity calls, naming decisions, testing with real screen readers and real users, and the call on what blocks a release. The arrows matter most: each band passes its findings down as evidence, and each band makes the next one cheaper. Nothing in the bottom band moves up.

Slide 5 of 1416:9

Keyboard and screen-reader walkthroughs as agent runs

The sweep pattern from the workflow library: capture static evidence, fan out one reviewer per route, merge and rank.

  • A scan script walks each route: axe-core results plus a recorded tab-by-tab focus order
  • One read-only reviewer agent per route, holding every route to the same WCAG 2.2 AA standard
  • Every finding carries route, file, criterion, severity, evidence, and a concrete fix
  • Findings a tool proved are separated from findings based on the agent's judgment
  • The scan script is infrastructure: it re-runs after every fix pass so the numbers move visibly

The recorded focus order is the quiet hero — it turns keyboard users keep getting lost into a reproducible sequence anyone can read.

Slide notes

This slide describes the accessibility-sweep workflow from the school's workflow library, so point participants there for the full prompts and the scan script. The structure is simple enough to remember: capture static evidence first, fan out one reviewer agent per route, merge the findings into one ranked list, and hold a human gate before anything is fixed.

The scan script does two jobs. It runs axe-core against each route in scope, and it performs a basic keyboard walk — tab through the page a few dozen times, record which element receives focus at each step, and note any element that takes focus without a visible indicator. That recorded focus order is worth emphasising because it converts the vaguest class of accessibility complaint, keyboard users get lost, into a concrete sequence that can be compared against the visual and task order. In the dashboard case study from the workflow, the recorded order made a problem visible that nobody had been able to reproduce reliably: the side panel was rendered early in the DOM and pulled forward in the tab order ahead of the filters.

The reviewer agent is defined once, as a read-only subagent, and reused on every route and every run. That is what keeps the standard consistent — every route is held to the same criteria, in this sweep and the next one. Requiring the agent to separate tool-proven findings from judgment-based findings keeps the report honest, and requiring a concrete fix on every finding keeps it actionable. The fix passes themselves happen later, under the severity gate covered on the next slide.

Narration for this slide

Here is how the accessibility sweep actually runs. First, a scan script walks every route in scope — it runs axe-core and records the keyboard focus order, tab by tab. That recording matters more than it sounds: it turns keyboard users keep getting lost into a sequence anyone can read and compare against the task order. Then one read-only reviewer agent per route checks the source against WCAG 2.2 AA — names, focus order, announcements, colour-only signals — with the scan results as anchors. Every finding names the route, the file, the criterion, the severity, the evidence, and a concrete fix. And the scan script is not throwaway: it re-runs after every fix pass, so you watch the numbers move instead of trusting that things feel better.

Slide 6 of 1416:9

Findings that are fixes vs findings that are redesigns

The severity gate sorts the list before anything is changed: agents apply the mechanical fixes, humans own everything that changes the design or the wording owners care about.

Agent can fix under reviewNeeds design, content, or legal judgment
AccessibilityMissing labels, off-token contrast, stray tabindex, unlinked error summariesReordering a flow, redesigning a colour-only status pattern, replacing a custom widget
ContentSentence case, banned-term swaps with one canonical term, three-part error rewritesChoosing the canonical term, voice changes, legal and regulatory wording
GateFixed one severity tier at a time, re-verified after each passRouted to the owner; decisions and exceptions recorded so the next sweep does not relitigate them

Nothing below the gate gets fixed until everything above it is fixed and re-verified — that stops the agent polishing alt text while checkout is still unreachable by keyboard.

Slide notes

The severity scale used in both workflows is deliberately simple. P0 blocks task completion or fails AA on a core flow and is fixed before anything else. P1 is a serious barrier with a workaround, fixed in the same release. P2 is friction or inconsistency scheduled into normal work. P3 is advisory, fixed opportunistically. The human gate review confirms the P0s are real, demotes anything with an approved exception, and approves the first fix pass — and it is fast precisely because the findings arrive in a consistent format.

The fixes-versus-redesigns split is the learning objective of this module that teams most often get wrong. Mechanical fixes — adding a label, swapping an off-palette colour for the accessible token, removing a positive tabindex, rewriting an error to the three-part anatomy when the canonical term is already decided — are well within what an agent can apply, in passes, with re-verification after each pass. Anything that changes the design or the words a stakeholder owns is not: reordering a flow, redesigning how status is communicated, choosing whether the product says Transfer or Send, and any legal or regulatory wording. Routing those to the agent does not save time; it produces confident changes that get reverted.

The one-tier-at-a-time discipline is slower on paper and faster in practice. The common failure it prevents is a hundred small fixes landing at once with three of them breaking something nobody can isolate. Each tier is fixed, the scan re-runs, and only then does the next tier open. Recording exceptions matters for the same reason: an exception that is not written down resurfaces as a finding in every future sweep and erodes trust in the whole report.

Narration for this slide

Once the findings are merged and ranked, the human gate sorts them before anything changes. Two questions per finding. First, severity: does it block the task or fail AA on a core flow, or is it friction? Second, and this is the one teams get wrong: is it a fix or a redesign? Adding a label, swapping in the accessible token, rewriting an error into what happened, why, what next — agents can apply those, one severity tier at a time, re-verified after each pass. Reordering a flow, choosing between Transfer and Send, anything legal — those are decisions, and they go to the people who own them. And nothing below the gate gets touched until everything above it is fixed and verified.

Slide 7 of 1416:9

Content review: voice, clarity, terminology, errors

Copy drifts because it is written by many hands over many years, and nobody ever reads it as a whole. Users do.

  • Terminology: the same action called Transfer, Send, and Pay reads like a product built by strangers
  • Voice and tone: address the user directly, active voice, no blame, no alarm words for routine states
  • Error messages: what happened, why if useful, and what to do next — all three parts
  • Clarity: readability targets and sentence case, checked, not hoped for
  • One audit agent per surface — buttons, empty states, errors, onboarding, emails — merged into a terminology map and a fix list

Reading every string against a written standard is exactly the exhaustive, low-judgment sweep agents are good at — and the part content designers never get time for.

Slide notes

Frame content review as the same discipline as the accessibility sweep applied to words. The drift mechanism is identical: designers label buttons in Figma, engineers write error strings at midnight, marketers name features, support writes macros. Each string makes sense locally. Together they call the same action three different things and explain the same error five different ways, and nobody notices because nobody ever reads the copy as a whole — except users, who learn the button says Transfer on one screen and meet Send on the next and wonder if it is a different feature.

The orchestration mirrors the accessibility sweep: the product area is split into surfaces — buttons and primary actions, empty states, error messages, onboarding, transactional emails — and each surface gets its own audit agent applying the same rules in the same reporting format. Bounded surfaces keep the reading careful; one agent reading everything gets sloppy by the third screen. A merge step then combines the per-surface findings into a terminology map, a severity-ranked fix list, and a localisation risk log.

Of the three artifacts, the terminology map is the keystone. The fix list gets worked through and forgotten; the map records the canonical term for each concept, every variant found, and where each variant lives, and it becomes the reference for every future string anyone writes. In the fintech case study from the workflow, the map showed the core money-movement action named Transfer in 14 places, Send in 9, and Pay in 6 — and support data showed the most common pre-contact search was whether send and transfer were different things.

Narration for this slide

Now the second review: the words. Copy drifts for a structural reason — it is written by many hands over many years, and nobody ever reads it as a whole. Users do, and they notice when the same action is called Transfer on one screen and Send on the next. The audit works like the accessibility sweep: split the product into surfaces — buttons, empty states, errors, onboarding, emails — give each surface its own agent with the same rules, then merge everything into a terminology map and a ranked fix list. The map is the artifact that keeps paying: the canonical term for every concept, every variant in use, and exactly where each one lives.

Slide 8 of 1416:9

The voice standard: written down so it can be checked

Prose guidelines like be friendly but professional do not produce consistent findings. Rewrite the guide as rules an agent can apply.

  • Canonical terms and banned variants, per concept, with the contexts where each applies
  • Voice rules with examples: you not the user, active voice, no blame, no alarm words
  • Error-message anatomy: what happened, why if known and useful, what to do next
  • Readability and casing targets: a grade-level ceiling, sentence case for buttons and labels
  • Localisation flags: concatenated strings, embedded variables that change word order, idioms

The same rules file does double duty: it is the audit standard today and the reviewer for every new string written tomorrow.

Slide notes

This is the content-review equivalent of writing critique dimensions down in Module 1: the standard has to exist in a checkable form before an agent can apply it consistently. Be friendly but professional produces findings that are themselves vibes. Use Transfer, never Send or Pay, for account-to-account movement produces findings someone can accept or decline. The conversion work — turning a prose style guide into canonical terms, banned terms, voice rules with examples, error anatomy, readability ceilings, and localisation flags — is a few hours of a content designer's time, and it is the highest-leverage few hours in the whole workflow.

Get the content owner to sign off on the rules before the audit runs. The audit is only as legitimate as the standard it applies, and the fastest way to get a fix list ignored is to audit against rules nobody agreed to. Where the guide is silent or ambiguous, the agent should report the conflict rather than choose — those gaps feed the next revision of the guide.

The double-duty point is what makes the investment compound. Stored as a subagent definition, the same rules that audit the existing copy review every new string in pull requests, which catches drift one string at a time instead of 1,800 strings at a time. If a team has no style guide at all, run a lighter discovery audit first to surface the de facto conventions, write the guide from what is found, and then audit against it.

Narration for this slide

For an agent to audit copy, the standard has to be written down in a form it can check. Be friendly but professional is not checkable. So the style guide gets rewritten as rules: the canonical term for each concept and the banned variants. Voice rules with examples — you, not the user; active voice; no blame; no alarm words for routine states. The three-part error anatomy. A readability ceiling and sentence case. And localisation flags for concatenated strings and idioms. Get the content owner to sign off before the run — the audit is only as legitimate as its standard. And here is the compounding part: the same rules file later reviews every new string in pull requests.

Slide 9 of 1416:9

Localisation and length are part of content review

The cheapest time to fix a string is before it exists in four languages.

  • Concatenated strings and mid-sentence variables break when word order changes across languages
  • Idioms, humour, and culturally specific references rarely survive translation
  • Length: interface copy that barely fits in English overflows in German and wraps awkwardly in Japanese
  • Inconsistent source strings multiply translation cost — every variant gets translated separately
  • The audit produces a localisation risk log that goes straight to the translation vendor

Fixing the worst strings once, in English, before translation costs a fraction of fixing them in every target language afterwards.

Slide notes

Localisation readiness is the part of content review with the clearest financial argument, so use it when making the case for the audit internally. In the workflow's B2B case study, a platform preparing to localise into German, Japanese, and Brazilian Portuguese audited its 312 error strings first: 41 percent failed the three-part anatomy, and 38 strings were assembled by concatenating fragments at runtime — a pattern that translates badly because word order changes across languages. The vendor quoted lower because the source strings were being fixed before translation rather than after, and rewriting the worst strings once in English cost a fraction of fixing them in four languages later.

The technical risks the agent flags are specific and checkable: strings built by concatenation, embedded variables whose position assumes English word order, idioms and humour, and strings whose length leaves no room for expansion. Length is a design problem as much as a writing problem — a label that barely fits its button in English is a layout bug waiting to happen in German — which is why the localisation risk log is worth reading alongside the visual QA work from Module 3.

Keep the claims conservative: the audit flags risk, it does not guarantee translatability, and it says nothing about whether the translated copy reads well. That still needs native-speaker review. What it does is hand the vendor a clean source and a known list of problem strings, which is the difference between translating a product and translating its accumulated accidents.

Narration for this slide

One more dimension of content review that pays for itself: localisation. Strings that are assembled from fragments at runtime, variables dropped into the middle of a sentence, idioms, humour — all of these break when word order and culture change. And length: copy that barely fits in English overflows in German. The audit flags all of this into a localisation risk log that goes straight to the translation vendor. The economics are simple. Fixing the worst strings once, in English, before translation, costs a fraction of fixing them in every target language afterwards. One team's vendor quoted lower simply because the source was being cleaned up first.

Slide 10 of 1416:9

Good findings vs bad findings

A sweep is only as useful as its weakest finding. Vague findings get argued with; precise findings get fixed.

Weak findingFix-ready finding
Contrast is poor in placesP0: PlaceOrderButton uses #8FB4FF on white, 2.9:1; needs 4.5:1 (WCAG 1.4.3); use the primary action token
The dashboard is hard to use with a keyboardP0: SidePanel renders before FilterRail, so tab order skips the filters (WCAG 2.4.3); reorder and remove tabindex=1
The error messages feel cold and technicalP1 errors.payee_invalid: "Error 4012: transaction declined" has no next step; suggest a rewrite naming the action to take
Terminology is inconsistent across the appP1: account-to-account transfers are labelled Transfer (14 strings), Send (9), Pay (6); locations in the terminology map

A fix-ready finding names the criterion or rule, the exact location, the evidence, and the fix. Spot-check the output before the gate review and tighten the reviewer instructions if findings drift toward generalities.

Slide notes

This quality bar applies identically to both reviews, which is why the table mixes accessibility and content rows. The anatomy of a fix-ready finding is the same in both: severity, the exact location — file and line for code, locale key or screen for copy — the criterion or rule broken, the evidence, and a concrete fix the owner can accept or decline without further research. The weak versions in the left column are not wrong; they are just unactionable, and unactionable findings are what get a report politely ignored.

The practical habit to teach is the spot-check. Before the severity gate review, read a sample of findings and ask whether each one could be handed to a developer or a writer and fixed without a follow-up question. If findings are drifting toward generalities — add more ARIA, the copy is too long — the fix is not to argue with each finding but to tighten the reviewer agent's instructions: require the criterion, require the location, require the quoted string or the measured ratio, and require the suggested fix.

There is also a trust dimension. The first sweep a team runs sets expectations for every later one. A first report full of precise, verifiable findings builds the habit of acting on the output; a first report full of vibes teaches the team that the sweep is noise. It is worth holding the first run to a higher bar than feels necessary.

Narration for this slide

Whether the finding is about contrast or about copy, the quality bar is the same. Contrast is poor in places gets argued with. The place-order button is two point nine to one against white, needs four and a half, here is the token to use — that gets fixed. The error messages feel cold is a vibe. This exact string, at this locale key, has no next step, here is a rewrite — that is a decision someone can make in a minute. Before the gate review, spot-check a sample of findings and ask one question: could someone fix this without coming back to ask what it means? If not, tighten the reviewer's instructions, not the individual findings.

Slide 11 of 1416:9

Worked example: one checkout flow through both reviews

Drawn from the two workflow case studies: a rebranded checkout swept for accessibility, and the same product area's copy audited before a brand refresh.

Accessibility sweepContent audit
Scope and run5 checkout routes, one agent per route, about 50 minutes5 surfaces, about 1,840 strings, about 50 minutes
Headline findings23 contrast failures incl. the Place order button at 2.9:1; discount error signalled by colour only96 findings; the core action named Transfer, Send, and Pay across 29 strings
Human gateP0s confirmed; off-palette colour traced to the rebrand, not the design systemTransfer chosen as canonical; Pay kept for bill payments; exceptions logged
Fix and re-verifyToken swap and an aria-describedby error message; re-run scan showed zero AA contrast failures on checkout27 strings renamed in two days; re-audit found 4 stragglers in email templates, no new variants

Both reviews finished inside an hour of agent time each. The expensive part — and the part that stayed human — was deciding the canonical term and confirming what counted as P0.

Slide notes

These figures come from the case studies in the two workflows this module is built on — an e-commerce checkout that started failing conversion benchmarks after a rebrand to a lighter palette, and a fintech payments area audited before a brand refresh. They are traced runs from those write-ups, not benchmarks, and the numbers will vary with codebase size and how much drift has accumulated. Say that plainly.

Walk the columns in parallel because the structure is the lesson. Both runs took roughly the same agent time. Both produced findings that were precise enough to fix quickly because every finding carried its location: the contrast failures named the component and the measured ratio, the terminology variants came with their locale keys, which is why 27 strings could be renamed in two days. Both had a human gate where the actual decisions happened — confirming that the off-palette button colour was a rebrand accident rather than an intentional exception, and choosing Transfer as the canonical term while keeping Pay for bill payments, a decision informed by support data showing users searching for whether send and transfer were different things.

Also point at the re-verification row. The accessibility fix was confirmed by re-running the same scan script, not by anyone's impression that it looked better. The content fix was confirmed by a re-audit two weeks later that found four stragglers in email templates and no new variants. Closing the loop with the same instrument that found the problem is what makes either review trustworthy enough to repeat every release.

Narration for this slide

Let's trace one product area through both reviews, using the case studies behind the two workflows. The accessibility sweep covered five checkout routes in about fifty minutes and found twenty-three contrast failures — including the place-order button at two point nine to one — plus a discount error signalled only by a red border. The content audit read about eighteen hundred strings in about the same time and found the core action named Transfer, Send, and Pay, sometimes in the same flow. In both cases the agent time was under an hour, the fixes landed within days because every finding carried its location, and a re-run confirmed the result. The slow part, and the human part, was the decisions: what counts as P0, and which word the product uses.

Slide 12 of 1416:9

What these reviews cannot prove

Passing the sweep means the failures it can detect are gone. That is valuable, and it is less than accessible or well written.

  • Real assistive technology behaviour across screen reader and browser combinations still needs hands-on testing
  • Usability for disabled users still needs research with disabled users
  • Neither review certifies formal conformance — that requires a documented human evaluation
  • The audit cannot decide whether the standard is right: which term, which voice, which trade-off
  • Coverage stops at the inputs: hard-coded strings, third-party widgets, and unscanned routes are simply not seen

The agent can say a finding fails a criterion. Whether that blocks the release, given the audience and the deadline, is a decision the team owns.

Slide notes

This slide protects the credibility of everything before it, so deliver it without hedging. On the accessibility side: an agent sweep plus automated checks does not make a product accessible; it makes it free of the failures those passes can detect. Real assistive technology behaviour — how an actual screen reader announces an actual widget in a particular browser — still needs hands-on testing, and usability for disabled users still needs research with disabled users. Formal conformance claims require a documented human evaluation; do not let a green sweep report be presented as one.

On the content side: the audit measures consistency against a standard, and it cannot prove the standard is right. Whether Transfer or Send is the better word is a naming decision informed by research and brand, made by humans, and only then enforced by the audit. Readability scores are rough instruments for flagging outliers, not a substitute for testing comprehension. Legal and regulatory wording, and any rewrite that might lose a nuance compliance needs, route to the owners who can decide — and their decisions get recorded as exceptions so the next audit does not relitigate them.

Finally, coverage. Both reviews only see what was put in front of them: the routes in the scan list, the strings that were exported. Hard-coded copy, third-party widgets, and flows nobody added to the route list are invisible. A team that treats the sweep as exhaustive will be surprised by exactly the things it never looked at, which is an argument for keeping the scope list visible alongside the findings.

Narration for this slide

Now the limits, stated plainly, because over-claiming here is how these practices lose credibility. Passing the sweep means the failures it can detect are gone — which is genuinely valuable, and which is less than accessible. Real screen reader behaviour still needs hands-on testing. Usability for disabled users still needs research with disabled users. Formal conformance needs a documented human evaluation, not a green report. On the content side, the audit can show you that three terms are in use; it cannot tell you which one is right for your users. And both reviews only see what you put in front of them — unscanned routes and unexported strings simply do not exist to the sweep. The agent detects. The team decides.

Slide 13 of 1416:9

Exercise: a single-flow accessibility pass on your product

Pick one flow where failure is expensive — checkout, onboarding, a core form — and run a small version of the sweep on it this week.

  • List the routes in the flow and run an automated check on each — axe-core or your browser's accessibility tooling
  • Do one manual keyboard walk per route and write down the focus order you actually get
  • Give an agent the results plus the source and ask for findings: criterion, severity, evidence, concrete fix — read-only, no code changes
  • Sort the findings yourself: P0 to P3, and fix versus redesign — note where you disagreed with the agent's severity
  • Pick the three worst strings the pass surfaced and rewrite them against the three-part error anatomy

Keep the findings list and your severity calls. Module 5 wires checks like these into every pull request, and your list is the seed for those rules.

Slide notes

Keep the scope deliberately small: one flow, a handful of routes, an hour or two of total effort. The point of the exercise is not to fix the product this week; it is to experience each band of the diagram once — the tool pass, the agent pass, and your own gate review — on a surface where the findings will feel real.

The step most people are tempted to skip is the manual keyboard walk, and it is the one that produces the most surprise. Recording the focus order you actually get, rather than the one you assume, is usually the moment the value of the recorded evidence lands. The disagreement note in step four matters too: where your severity call differs from the agent's, that difference is exactly the judgment that the workflow keeps human, and writing it down makes it discussable rather than implicit.

The last bullet pulls the content review in without requiring a full audit: the three worst strings are usually error messages, and rewriting them against the what happened, why, what next anatomy gives a concrete before-and-after to share with whoever owns the copy. If participants want to go further, the accessibility-sweep and content-audit workflows in the school's workflow library contain the full prompts, the scan script, and the reviewer agent definitions used in this module.

Narration for this slide

Your exercise for this module: one flow, this week. Pick somewhere failure is expensive — checkout, onboarding, your most important form. Run an automated check on each route. Then do something most people skip: a manual keyboard walk, writing down the focus order you actually get. Hand the results and the source to an agent and ask for findings — criterion, severity, evidence, fix — read-only, no changes. Then do the human part yourself: rank them, split fixes from redesigns, and note where you disagree with the agent's severity. Finally, take the three worst strings you found and rewrite them as what happened, why, and what to do next. Keep the list — you will need it in Module 5.

Slide 14 of 1416:9

Summary, and what comes next

  • Accessibility and content review get deferred because they are exhaustive reading at scale — which is exactly what agents make routine
  • Three bands: tools prove the mechanical failures, agents review meaning and order, humans make severity, naming, and release decisions
  • Both reviews follow the same shape: capture evidence, fan out reviewers, merge and rank, hold a human gate, fix in passes, re-verify
  • Standards must be written to be checkable: WCAG criteria for the sweep, a rules-based voice standard for the copy
  • Neither review proves accessibility or good writing — they clear the detectable failures so judgment is spent on decisions

Module 5 takes everything this course has built — critique dimensions, evaluations, regression evidence, these two sweeps — and wires it into every pull request, with severity levels and human escalation.

Slide notes

Recap by tying the two reviews back to the course's central argument: critique improves when the mechanical part becomes continuous and the human part becomes deliberate. Accessibility and content were the clearest cases of reviews that stayed rare because they were expensive; the sweeps make them cheap enough to run every release, and the severity gate keeps the decisions — what blocks, what is excepted, what the product is called — exactly where they belong.

Remind participants of the artifacts worth keeping beyond the fix lists: the scan script and reviewer definitions on the accessibility side, and the terminology map plus the rules-based voice standard on the content side. Both compound: each subsequent run is comparable with the last, and the same standards reused at review time are what Module 5 builds on.

Preview Module 5 concretely. It is the end state the course has been pointing at: every change that touches the interface gets a design review automatically — token and component checks, screenshot evidence, and checks drawn from exactly the accessibility and content rules written in this module — with severity levels deciding what blocks, what warns, and what merely notes, and clear escalation for the findings that need a human designer. The exercise findings from this module become the seed rules for that per-PR review.

Narration for this slide

Let's close. Accessibility and content review get deferred because they are careful reading at scale — and careful reading at scale is precisely what agents do well. The model is three bands: tools prove the mechanical failures, agents review meaning, order, and voice, and humans make the calls on severity, naming, and release. Both reviews share one shape: capture evidence, fan out reviewers, merge and rank, hold the gate, fix in passes, re-verify. And both depend on standards written down in checkable form. Neither pass proves your product is accessible or well written — it clears the detectable failures so your judgment goes where only it can. Module 5 wires all of this into every pull request. See you there.

Module transcript
Module 4, narrated slide by slide

Slide 1The Reviews Everyone Defers

Welcome to Module 4. This one is about the two reviews everyone agrees matter and almost everyone defers: accessibility and content. They get deferred for the same reason — done properly, both are careful reading at scale, and careful reading at scale always loses to the deadline. We are going to take the critique structure from the earlier modules — named criteria, findings with evidence, severity, a human gate — and apply it to these two reviews so they become routine passes rather than special projects. And we will be honest about the limits: the sweep finds the failures it can detect. The judgment about what to do with them stays with you.

Slide 2Why these two reviews stay deferred

Why do these two reviews keep slipping? Not because anyone disputes that they matter. They slip because the problems accumulate one small decision at a time — an unlabelled icon button here, an error string written at midnight there — and no single change looks like the problem. Reviewing properly means reading everything against the same standard, and the people who can do that well are spread thin across many teams. So the review waits for a forcing function: an audit, a procurement requirement, a localisation push. And by then the findings arrive in the hundreds and the fix is a project. The agent's job here is to make the reading continuous, so the specialists spend their time on decisions instead.

Slide 3Accessibility beyond the automated checker

Let's start with accessibility, and with an honest claim about tools. Automated checkers are precise about what they can detect — contrast ratios, missing labels, invalid ARIA, unlabelled form fields. But they cover well under half of what WCAG actually asks for, and the missing half is the half that needs judgment. Does the accessible name describe the control, or is it the icon's file name? Does focus order follow the task, or just the accident of the DOM? Does the error message tell the user what to do next? That reading is what the agent pass does. Run the tools first, then the agent — the tool results give every agent finding a concrete anchor.

Slide 4Three bands of review: tools, agents, judgment

Here is the whole module in one picture. Three bands. At the top, automated checks — axe-core, contrast tools, scripted keyboard walks. Precise, but they miss everything that needs interpretation. In the middle, the agent review: names, focus order, error messages, colour-only signals, and on the content side, terminology and voice drift — judgment about meaning, applied at scale. At the bottom, human judgment: severity calls, naming decisions, testing with real screen readers and real users, and the call on what blocks a release. The arrows matter most: each band passes its findings down as evidence, and each band makes the next one cheaper. Nothing in the bottom band moves up.

Slide 5Keyboard and screen-reader walkthroughs as agent runs

Here is how the accessibility sweep actually runs. First, a scan script walks every route in scope — it runs axe-core and records the keyboard focus order, tab by tab. That recording matters more than it sounds: it turns keyboard users keep getting lost into a sequence anyone can read and compare against the task order. Then one read-only reviewer agent per route checks the source against WCAG 2.2 AA — names, focus order, announcements, colour-only signals — with the scan results as anchors. Every finding names the route, the file, the criterion, the severity, the evidence, and a concrete fix. And the scan script is not throwaway: it re-runs after every fix pass, so you watch the numbers move instead of trusting that things feel better.

Slide 6Findings that are fixes vs findings that are redesigns

Once the findings are merged and ranked, the human gate sorts them before anything changes. Two questions per finding. First, severity: does it block the task or fail AA on a core flow, or is it friction? Second, and this is the one teams get wrong: is it a fix or a redesign? Adding a label, swapping in the accessible token, rewriting an error into what happened, why, what next — agents can apply those, one severity tier at a time, re-verified after each pass. Reordering a flow, choosing between Transfer and Send, anything legal — those are decisions, and they go to the people who own them. And nothing below the gate gets touched until everything above it is fixed and verified.

Slide 7Content review: voice, clarity, terminology, errors

Now the second review: the words. Copy drifts for a structural reason — it is written by many hands over many years, and nobody ever reads it as a whole. Users do, and they notice when the same action is called Transfer on one screen and Send on the next. The audit works like the accessibility sweep: split the product into surfaces — buttons, empty states, errors, onboarding, emails — give each surface its own agent with the same rules, then merge everything into a terminology map and a ranked fix list. The map is the artifact that keeps paying: the canonical term for every concept, every variant in use, and exactly where each one lives.

Slide 8The voice standard: written down so it can be checked

For an agent to audit copy, the standard has to be written down in a form it can check. Be friendly but professional is not checkable. So the style guide gets rewritten as rules: the canonical term for each concept and the banned variants. Voice rules with examples — you, not the user; active voice; no blame; no alarm words for routine states. The three-part error anatomy. A readability ceiling and sentence case. And localisation flags for concatenated strings and idioms. Get the content owner to sign off before the run — the audit is only as legitimate as its standard. And here is the compounding part: the same rules file later reviews every new string in pull requests.

Slide 9Localisation and length are part of content review

One more dimension of content review that pays for itself: localisation. Strings that are assembled from fragments at runtime, variables dropped into the middle of a sentence, idioms, humour — all of these break when word order and culture change. And length: copy that barely fits in English overflows in German. The audit flags all of this into a localisation risk log that goes straight to the translation vendor. The economics are simple. Fixing the worst strings once, in English, before translation, costs a fraction of fixing them in every target language afterwards. One team's vendor quoted lower simply because the source was being cleaned up first.

Slide 10Good findings vs bad findings

Whether the finding is about contrast or about copy, the quality bar is the same. Contrast is poor in places gets argued with. The place-order button is two point nine to one against white, needs four and a half, here is the token to use — that gets fixed. The error messages feel cold is a vibe. This exact string, at this locale key, has no next step, here is a rewrite — that is a decision someone can make in a minute. Before the gate review, spot-check a sample of findings and ask one question: could someone fix this without coming back to ask what it means? If not, tighten the reviewer's instructions, not the individual findings.

Slide 11Worked example: one checkout flow through both reviews

Let's trace one product area through both reviews, using the case studies behind the two workflows. The accessibility sweep covered five checkout routes in about fifty minutes and found twenty-three contrast failures — including the place-order button at two point nine to one — plus a discount error signalled only by a red border. The content audit read about eighteen hundred strings in about the same time and found the core action named Transfer, Send, and Pay, sometimes in the same flow. In both cases the agent time was under an hour, the fixes landed within days because every finding carried its location, and a re-run confirmed the result. The slow part, and the human part, was the decisions: what counts as P0, and which word the product uses.

Slide 12What these reviews cannot prove

Now the limits, stated plainly, because over-claiming here is how these practices lose credibility. Passing the sweep means the failures it can detect are gone — which is genuinely valuable, and which is less than accessible. Real screen reader behaviour still needs hands-on testing. Usability for disabled users still needs research with disabled users. Formal conformance needs a documented human evaluation, not a green report. On the content side, the audit can show you that three terms are in use; it cannot tell you which one is right for your users. And both reviews only see what you put in front of them — unscanned routes and unexported strings simply do not exist to the sweep. The agent detects. The team decides.

Slide 13Exercise: a single-flow accessibility pass on your product

Your exercise for this module: one flow, this week. Pick somewhere failure is expensive — checkout, onboarding, your most important form. Run an automated check on each route. Then do something most people skip: a manual keyboard walk, writing down the focus order you actually get. Hand the results and the source to an agent and ask for findings — criterion, severity, evidence, fix — read-only, no changes. Then do the human part yourself: rank them, split fixes from redesigns, and note where you disagree with the agent's severity. Finally, take the three worst strings you found and rewrite them as what happened, why, and what to do next. Keep the list — you will need it in Module 5.

Slide 14Summary, and what comes next

Let's close. Accessibility and content review get deferred because they are careful reading at scale — and careful reading at scale is precisely what agents do well. The model is three bands: tools prove the mechanical failures, agents review meaning, order, and voice, and humans make the calls on severity, naming, and release. Both reviews share one shape: capture evidence, fan out reviewers, merge and rank, hold the gate, fix in passes, re-verify. And both depend on standards written down in checkable form. Neither pass proves your product is accessible or well written — it clears the detectable failures so your judgment goes where only it can. Module 5 wires all of this into every pull request. See you there.