Slide 1 — The Reviews Everyone Defers
Welcome to Module 4. This one is about the two reviews everyone agrees matter and almost everyone defers: accessibility and content. They get deferred for the same reason — done properly, both are careful reading at scale, and careful reading at scale always loses to the deadline. We are going to take the critique structure from the earlier modules — named criteria, findings with evidence, severity, a human gate — and apply it to these two reviews so they become routine passes rather than special projects. And we will be honest about the limits: the sweep finds the failures it can detect. The judgment about what to do with them stays with you.
Slide 2 — Why these two reviews stay deferred
Why do these two reviews keep slipping? Not because anyone disputes that they matter. They slip because the problems accumulate one small decision at a time — an unlabelled icon button here, an error string written at midnight there — and no single change looks like the problem. Reviewing properly means reading everything against the same standard, and the people who can do that well are spread thin across many teams. So the review waits for a forcing function: an audit, a procurement requirement, a localisation push. And by then the findings arrive in the hundreds and the fix is a project. The agent's job here is to make the reading continuous, so the specialists spend their time on decisions instead.
Slide 3 — Accessibility beyond the automated checker
Let's start with accessibility, and with an honest claim about tools. Automated checkers are precise about what they can detect — contrast ratios, missing labels, invalid ARIA, unlabelled form fields. But they cover well under half of what WCAG actually asks for, and the missing half is the half that needs judgment. Does the accessible name describe the control, or is it the icon's file name? Does focus order follow the task, or just the accident of the DOM? Does the error message tell the user what to do next? That reading is what the agent pass does. Run the tools first, then the agent — the tool results give every agent finding a concrete anchor.
Slide 4 — Three bands of review: tools, agents, judgment
Here is the whole module in one picture. Three bands. At the top, automated checks — axe-core, contrast tools, scripted keyboard walks. Precise, but they miss everything that needs interpretation. In the middle, the agent review: names, focus order, error messages, colour-only signals, and on the content side, terminology and voice drift — judgment about meaning, applied at scale. At the bottom, human judgment: severity calls, naming decisions, testing with real screen readers and real users, and the call on what blocks a release. The arrows matter most: each band passes its findings down as evidence, and each band makes the next one cheaper. Nothing in the bottom band moves up.
Slide 5 — Keyboard and screen-reader walkthroughs as agent runs
Here is how the accessibility sweep actually runs. First, a scan script walks every route in scope — it runs axe-core and records the keyboard focus order, tab by tab. That recording matters more than it sounds: it turns keyboard users keep getting lost into a sequence anyone can read and compare against the task order. Then one read-only reviewer agent per route checks the source against WCAG 2.2 AA — names, focus order, announcements, colour-only signals — with the scan results as anchors. Every finding names the route, the file, the criterion, the severity, the evidence, and a concrete fix. And the scan script is not throwaway: it re-runs after every fix pass, so you watch the numbers move instead of trusting that things feel better.
Slide 6 — Findings that are fixes vs findings that are redesigns
Once the findings are merged and ranked, the human gate sorts them before anything changes. Two questions per finding. First, severity: does it block the task or fail AA on a core flow, or is it friction? Second, and this is the one teams get wrong: is it a fix or a redesign? Adding a label, swapping in the accessible token, rewriting an error into what happened, why, what next — agents can apply those, one severity tier at a time, re-verified after each pass. Reordering a flow, choosing between Transfer and Send, anything legal — those are decisions, and they go to the people who own them. And nothing below the gate gets touched until everything above it is fixed and verified.
Slide 7 — Content review: voice, clarity, terminology, errors
Now the second review: the words. Copy drifts for a structural reason — it is written by many hands over many years, and nobody ever reads it as a whole. Users do, and they notice when the same action is called Transfer on one screen and Send on the next. The audit works like the accessibility sweep: split the product into surfaces — buttons, empty states, errors, onboarding, emails — give each surface its own agent with the same rules, then merge everything into a terminology map and a ranked fix list. The map is the artifact that keeps paying: the canonical term for every concept, every variant in use, and exactly where each one lives.
Slide 8 — The voice standard: written down so it can be checked
For an agent to audit copy, the standard has to be written down in a form it can check. Be friendly but professional is not checkable. So the style guide gets rewritten as rules: the canonical term for each concept and the banned variants. Voice rules with examples — you, not the user; active voice; no blame; no alarm words for routine states. The three-part error anatomy. A readability ceiling and sentence case. And localisation flags for concatenated strings and idioms. Get the content owner to sign off before the run — the audit is only as legitimate as its standard. And here is the compounding part: the same rules file later reviews every new string in pull requests.
Slide 9 — Localisation and length are part of content review
One more dimension of content review that pays for itself: localisation. Strings that are assembled from fragments at runtime, variables dropped into the middle of a sentence, idioms, humour — all of these break when word order and culture change. And length: copy that barely fits in English overflows in German. The audit flags all of this into a localisation risk log that goes straight to the translation vendor. The economics are simple. Fixing the worst strings once, in English, before translation, costs a fraction of fixing them in every target language afterwards. One team's vendor quoted lower simply because the source was being cleaned up first.
Slide 10 — Good findings vs bad findings
Whether the finding is about contrast or about copy, the quality bar is the same. Contrast is poor in places gets argued with. The place-order button is two point nine to one against white, needs four and a half, here is the token to use — that gets fixed. The error messages feel cold is a vibe. This exact string, at this locale key, has no next step, here is a rewrite — that is a decision someone can make in a minute. Before the gate review, spot-check a sample of findings and ask one question: could someone fix this without coming back to ask what it means? If not, tighten the reviewer's instructions, not the individual findings.
Slide 11 — Worked example: one checkout flow through both reviews
Let's trace one product area through both reviews, using the case studies behind the two workflows. The accessibility sweep covered five checkout routes in about fifty minutes and found twenty-three contrast failures — including the place-order button at two point nine to one — plus a discount error signalled only by a red border. The content audit read about eighteen hundred strings in about the same time and found the core action named Transfer, Send, and Pay, sometimes in the same flow. In both cases the agent time was under an hour, the fixes landed within days because every finding carried its location, and a re-run confirmed the result. The slow part, and the human part, was the decisions: what counts as P0, and which word the product uses.
Slide 12 — What these reviews cannot prove
Now the limits, stated plainly, because over-claiming here is how these practices lose credibility. Passing the sweep means the failures it can detect are gone — which is genuinely valuable, and which is less than accessible. Real screen reader behaviour still needs hands-on testing. Usability for disabled users still needs research with disabled users. Formal conformance needs a documented human evaluation, not a green report. On the content side, the audit can show you that three terms are in use; it cannot tell you which one is right for your users. And both reviews only see what you put in front of them — unscanned routes and unexported strings simply do not exist to the sweep. The agent detects. The team decides.
Slide 13 — Exercise: a single-flow accessibility pass on your product
Your exercise for this module: one flow, this week. Pick somewhere failure is expensive — checkout, onboarding, your most important form. Run an automated check on each route. Then do something most people skip: a manual keyboard walk, writing down the focus order you actually get. Hand the results and the source to an agent and ask for findings — criterion, severity, evidence, fix — read-only, no changes. Then do the human part yourself: rank them, split fixes from redesigns, and note where you disagree with the agent's severity. Finally, take the three worst strings you found and rewrite them as what happened, why, and what to do next. Keep the list — you will need it in Module 5.
Slide 14 — Summary, and what comes next
Let's close. Accessibility and content review get deferred because they are careful reading at scale — and careful reading at scale is precisely what agents do well. The model is three bands: tools prove the mechanical failures, agents review meaning, order, and voice, and humans make the calls on severity, naming, and release. Both reviews share one shape: capture evidence, fan out reviewers, merge and rank, hold the gate, fix in passes, re-verify. And both depend on standards written down in checkable form. Neither pass proves your product is accessible or well written — it clears the detectable failures so your judgment goes where only it can. Module 5 wires all of this into every pull request. See you there.