AAgentic Design School
Module 2 of 5
45–55 minutes

Design Review and Critique with Agents

Heuristic Evaluation and Cognitive Walkthroughs at Scale

Two classic methods most teams skip because of cost, run across an entire product by agents: standard heuristics applied screen by screen, task-based walkthroughs step by step, and the human pass that turns findings into decisions.

Duration45–55 minutes

Slides13 slides with notes and narration

Learning objectives

  • Set up a heuristic evaluation across a full product surface with consistent criteria.
  • Run cognitive walkthroughs against defined user tasks rather than screens in isolation.
  • Triage large finding sets without drowning the team in low-severity noise.
Slide deck

Work through the module

Each slide is shown in its 16:9 frame, exactly as it appears in the video version. Open the notes under any slide for the longer explanation, and the narration if you prefer to read along.

Slide 1 of 1316:9

Heuristic Evaluation and Cognitive Walkthroughs at Scale

Design Review and Critique with Agents · Module 2 of 5

  • The methods were never wrong — they were just expensive to run properly
  • Heuristic evaluation: the same criteria, applied to every screen and state
  • Cognitive walkthroughs: defined tasks, four questions at every step
  • Severity, evidence, and triage: turning hundreds of findings into decisions

Agents change the economics of inspection methods, not their logic. The rigour still comes from the method; the coverage now comes from the agents.

Slide notes

Module 1 built the critique loop: named dimensions, the agent as first reviewer, findings with evidence, and the human deciding what matters. This module takes that loop and points it at two formal inspection methods that pre-date all of this by decades — heuristic evaluation and the cognitive walkthrough. They were never discredited; they were rationed. Most teams run them once, on the flow everyone already worries about, and call the rest of the product covered.

The argument of the module is narrow and worth stating up front: agents change the cost of running these methods properly, and nothing else. The heuristics are still Nielsen's plus whatever the team has committed to in writing. The four walkthrough questions are still the four questions. The agents supply patient, consistent application across every flow and every task-persona combination — the part humans do unevenly by the fifth flow of the afternoon — and the humans still confirm severity and decide what gets fixed.

Flag the structure: heuristic evaluation first, walkthroughs second, then the parts they share — evidence rules, triage, and the human validation pass — a worked example, and an exercise small enough to run this week.

Narration for this slide

Welcome back. In Module 1 we built the critique loop — named dimensions, evidence, the human making the calls. This module points that loop at two methods that have been in the UX toolkit for over thirty years: heuristic evaluation and the cognitive walkthrough. The methods were never wrong. They were just expensive — expensive enough that most teams run them once, on one flow, and quietly skip the rest. Agents change that economics. The criteria stay the same, the four questions stay the same, but coverage stops being the limiting factor. What you have to get right instead is the setup, the evidence rules, and the triage. That is this module.

Slide 2 of 1316:9

Heuristic evaluation: a refresher

An evaluator walks the interface against a short list of usability principles and records where it breaks them, with evidence.

  • Nielsen's ten heuristics are the standard starting set: visibility of status, consistency, error prevention, recognition over recall, and so on
  • It needs no participants — it predicts problems from principles rather than observing them
  • It catches a useful share of real problems cheaply, especially consistency and feedback failures
  • Its known weakness is coverage: done by hand, it gets applied to three flows, not the whole product

Heuristic evaluation predicts problems; usability testing observes them. Keep the two claims separate and the method stays honest.

Slide notes

Keep the refresher conservative. Heuristic evaluation, as defined by Nielsen and Molich, is a small group of evaluators inspecting an interface against a short list of usability principles and recording violations. It is cheap, it needs no participants, and it reliably surfaces a worthwhile share of real problems — particularly the procedural ones: inconsistent patterns, missing feedback, errors with no recovery path. It does not measure how often a problem occurs or how much it costs, and it cannot stand in for testing with real users. Those limits are part of the method, not a flaw agents introduce.

The practical weakness has always been labour. Walking every step of every flow, holding each screen against each heuristic, and writing findings precise enough that someone else can locate the problem is patient, repetitive work. A two-person team facing seven core flows either spends a week or evaluates the three flows everyone already worries about. That rationing — not any doubt about the method — is why most products carry years of unexamined usability debt.

If the audience includes people who have never run one by hand, it is worth saying that doing one screen manually at least once is the fastest way to develop a feel for what a good finding looks like. The agent-scaled version is the same activity with the tedium removed, and you review it better if you have felt the tedium.

Narration for this slide

Quick refresher first. Heuristic evaluation is an evaluator walking the interface against a short list of usability principles — Nielsen's ten are the standard set — and writing down where the interface breaks them, with evidence. It needs no participants, it is cheap, and it catches a real share of genuine problems, especially consistency and feedback failures. It also has honest limits: it predicts problems from principles, it does not observe them happening, and it cannot tell you how often or how badly. Its biggest practical weakness has always been coverage. Done by hand, it gets applied to the three flows everyone worries about, and the rest of the product never gets looked at.

Slide 3 of 1316:9

Scaling it: every flow, the same contract

The scaled version is one evaluator agent per flow, all working from the same written heuristics file and the same evidence rules.

  • The heuristics file is the contract: Nielsen's ten plus five to eight project-specific heuristics, each with an ID, a definition, and an example
  • One agent per flow — fresh attention for flow seven, identical criteria to flow one
  • Evidence is captured up front: ordered screenshots per flow, plus access to the live build for missing states
  • A merge agent dedupes recurring problems and ranks the result; severities stay candidates until a human confirms them
heuristics.md (excerpt)
# Heuristics file: Lumen banking app, June 2026

Findings must reference exactly one heuristic ID.

## Nielsen's 10 (NN-01 to NN-10)
NN-01 Visibility of system status
NN-04 Consistency and standards
NN-09 Help users recognise, diagnose, and recover from errors
...

## Project heuristics (LUM-01 to LUM-05)
LUM-01 The current balance is visible within one tap from launch.
LUM-04 Fees and rates are shown before the user commits, not in the receipt.

The evaluation is only as good as the heuristics file. Keep the project additions short — twenty added heuristics means everything violates something.

Slide notes

The scaling move is structural, not clever prompting. Each key flow gets its own evaluator agent, and every agent works from the same three inputs: the heuristics file, a folder of ordered screenshots for its flow, and access to the live or staging build for the states the screenshots missed — errors, empty states, what happens after a wrong input. Because each flow gets a fresh agent, the seventh flow is evaluated with the same attention as the first, which is precisely where human evaluators degrade.

The heuristics file deserves the most discussion time. Nielsen's ten are the stable base; the project heuristics are the rules that are true for this product but not products in general — a banking app's rule that the balance is never hidden behind promotional content, a B2B tool's rule that admin actions are reversible. Each needs an ID, a one-sentence definition, and one example of a violation, because every finding will reference an ID and the merge stage dedupes partly by it. Keep the project list to five to eight; a long list dilutes everything.

The other discipline is evidence. Capture the screenshots before the run with a small script, name them by flow and step, and require every finding to cite a filename or a described action in the build. The merge agent then combines all per-flow findings, dedupes problems that recur across flows into one finding with multiple locations, and ranks by candidate severity and frequency. The word candidate is doing real work there — the agent proposes, the human confirms, and that division is the design of the workflow rather than a disclaimer bolted on.

Narration for this slide

Here is the scaled version. You write the heuristics file once: Nielsen's ten, plus five to eight project heuristics — the rules that are true for your product specifically, each with an ID, a definition, and an example. You capture evidence up front: ordered screenshots for every flow, plus access to the staging build for the states screenshots miss. Then one evaluator agent per flow, all reading the same file, all bound by the same rule — every finding cites a heuristic ID and a piece of evidence. A merge agent dedupes the problems that recur across flows and ranks the result. And every severity is labelled a candidate, because confirming severity is your job, not the agent's.

Slide 4 of 1316:9

The shape of an evaluation at scale

Both methods share the same pipeline: human-defined inputs, an agent fan-out, one merge, and a human gate before anything reaches a backlog.

Pipeline diagram of evaluation at scale. On the left, two human-led input cards: the product surface (key flows, screens, states, or user tasks worth walking, with screenshots and a live build) and the shared contract (a heuristics file or a task list with personas and the four walkthrough questions, plus evidence rules and a candidate severity scale). These fan out to four agent-run evaluator cards — one per flow for heuristic evaluation, one per task and persona for walkthroughs — which fan back into a single merge agent that dedupes recurring problems, groups them, ranks them by candidate severity and frequency, and lists what was checked and passed. The merged report passes through a human severity pass, where a design lead opens the evidence, confirms or adjusts severity, and rejects unsupported findings, before the confirmed set feeds a fix backlog, a redesign brief, or an archived baseline.
The product surface and the shared contract are written by humans; the evaluators and the merge are agent-run; the severity pass is the gate. Nothing reaches a backlog or a client without passing through it.

The fan-out buys coverage. The merge buys a readable report. The severity pass is what makes the result trustworthy.

Slide notes

Walk the diagram left to right and name who owns each column. The two left cards are human work and they are where the quality is decided: choosing which flows or tasks count as the product surface, and writing the shared contract — the heuristics file for an evaluation, or the task list, personas, and four questions for a walkthrough, plus the evidence rules and the candidate severity scale that both methods use.

The middle two columns are agent-run. The fan-out is one evaluator per flow, or one walkthrough per task-and-persona cell, each returning evidence-backed candidate findings. The merge agent is what makes the output usable: without it you have several hundred scattered notes; with it you have recurring problems collapsed into single findings with multiple locations, grouped, ranked, and accompanied by a list of what was checked and found clean — which matters, because an evaluation that only reports problems gives no sense of coverage.

The right side is the human gate and the outputs. The severity pass is a sample-and-confirm activity, not a re-run of the evaluation; the next slides cover how to do it without it becoming a second audit. The confirmed set then feeds one of three things — a fix backlog with owners, a redesign brief, or an archived baseline the next run is compared against. If the team cannot say which of those three the run feeds, the run is not ready to start.

Narration for this slide

This diagram is the whole module on one screen, and it covers both methods. On the left, the human work: deciding what surface gets evaluated, and writing the shared contract — the heuristics file, or the tasks, personas, and four questions. In the middle, the agent work: a fan-out of evaluators, one per flow or per task and persona, each returning findings with evidence, and then a single merge that dedupes, groups, and ranks. On the right, the human gate: the severity pass, where you open the evidence, confirm or adjust severities, and reject what does not hold up. Only then does anything feed a backlog, a brief, or a baseline. Fan-out buys coverage, merge buys readability, and the gate buys trust.

Slide 5 of 1316:9

Cognitive walkthroughs: four questions at every step

A walkthrough evaluates a task, not a screen: who is doing it, what they already know, and whether each step would make sense to them.

  • Q1 — Will the user try to achieve the right effect? Do they even know this step is needed?
  • Q2 — Will they notice that the correct action is available?
  • Q3 — Will they associate the action with the effect they want? Does the label mean to them what it means to the team?
  • Q4 — If they perform it, will they see that progress was made?
  • Any "no" makes the step a failure point, with a screenshot and a severity guess

The agent is simulating an evaluator applying a method, not simulating a user. The output is hypotheses to test, not user behaviour observed.

Slide notes

The cognitive walkthrough is the second classic method, and it answers a different question to heuristic evaluation. Heuristics ask whether the interface follows good principles; the walkthrough asks whether a specific person, with a specific level of knowledge, could learn their way through a specific task. It is the method for first-run experiences, onboarding, admin consoles where new staff must self-serve — anywhere learnability is the question.

The discipline is the four questions, asked at every step, in order, without skipping the ones that feel obvious. Each answer is a yes or no with a one-sentence justification grounded in what is visible on the screen, plus a screenshot reference. Any no makes the step a failure point with a severity guess: blocks the task, causes hesitation or error, or minor friction. Write the questions into a per-step template the agents must follow, so every step of every task gets the same treatment and the merge stage can compare like with like.

State the caveat plainly and early, because it is the one this method most needs: the agent is simulating an evaluator applying an inspection method, not simulating a user's mind. The verdicts are predictions under the method. The way to confirm the ones that matter is usability testing with real participants — and the walkthrough's most useful output is often a sharper test plan, because it tells you exactly which screens and steps to watch.

Narration for this slide

The second method is the cognitive walkthrough, and it evaluates tasks, not screens. You define who is doing the task and what they already know, and at every step you ask four questions. Will they try to achieve the right effect — do they even know this step is needed? Will they notice the correct action is available? Will they connect the action to their goal — does the label mean to them what it means to the team? And if they do it, will they see that progress was made? Any no makes that step a failure point, with a screenshot and a severity guess. One thing to say plainly: the agent is simulating an evaluator running a method, not simulating a user. These are predictions. Usability testing is how you confirm the ones that matter.

Slide 6 of 1316:9

Tasks and personas: the anchors of a walkthrough

The task list is the answer key the interface is checked against. The personas are knowledge profiles, not posters.

  • Each task names a starting point and the correct action sequence — writing that sequence often surfaces the first findings on its own
  • A persona is mostly a statement of what the person already knows: domain, product, and the conventions the interface assumes
  • Two or three personas per run — typically a domain novice, a practitioner new to this product, and a power user
  • One walkthrough agent per task-and-persona cell; nine tasks across three personas is twenty-seven walkthroughs
  • Reuse the same tasks and personas across runs so results stay comparable

The novice failing a step the expert sails through is not noise — that difference is the finding, and it locates the problem as learnability.

Slide notes

Two human-authored inputs decide the quality of a walkthrough run, and neither can be delegated. The first is the task list. Each task needs a starting point and the correct action sequence — the answer key the walkthrough checks the interface against. Writing that sequence is itself diagnostic: if the team cannot agree what the correct sequence is, or cannot describe a task in a sentence, that is a finding before any agent runs. Choose tasks where learnability matters: the first-run path, the things new staff must do unaided, the tasks that generate support tickets.

The second input is the personas, and for a walkthrough a persona is almost entirely a knowledge profile: what they know about the domain, what they know about this product, what conventions they can be assumed to recognise, and what they have already done — installed the CLI, received the invite email, hold an account. Resist demographic decoration; the four questions never ask how old the persona is. Two or three personas is usually enough, and more multiplies cost faster than insight, because the run size is the product of tasks and personas.

The payoff of running multiple personas is differential findings. A step that fails for the novice and passes for the expert is precisely located as a learnability gap rather than a flaw in the flow itself, which changes the fix — often a label or one line of explanation rather than a redesign. Keep the task list and persona files in the repository and reuse them; rerunning the same grid after each significant release turns the walkthrough into a cheap learnability regression check.

Narration for this slide

Two inputs anchor a walkthrough, and both are yours to write. First, the task list — each task with its starting point and the correct sequence of actions. That sequence is the answer key, and just writing it down often surfaces the first findings, because the team discovers it cannot agree on what the correct path is. Second, the personas — and for this method a persona is a knowledge profile, not a poster. What do they know about the domain, about this product, about the conventions the interface assumes? Two or three is enough. Then it is one agent per task and persona combination. And the most useful findings are the differences: when the novice fails a step the expert sails through, you have located a learnability problem precisely.

Slide 7 of 1316:9

Severity, confidence, and evidence on every finding

A finding is useful when someone who was not in the run can locate the problem, see the evidence, and understand what to do about it.

Weak findingUseful finding
Criterion"The send-money flow is confusing"Names one heuristic ID or one walkthrough question that failed
EvidenceNone — restated opinionCites a screenshot filename or a described action in the build
SeverityStated as factA candidate rating on an agreed scale, awaiting human confirmation
Location"The app is inconsistent"Recurs in 4 flows — one finding, four listed locations
Suggested fix"Redesign the navigation"One or two sentences, scoped within the existing design system

The fastest quality check: every finding names its criterion and its evidence. Findings that cannot do both go back.

Slide notes

This table is the quality bar for both methods, and it is the same bar Module 1 set for critique findings: criterion, evidence, location, candidate severity, scoped suggestion. The difference at scale is volume — a seven-flow evaluation or a twenty-seven-cell walkthrough produces hundreds of candidate findings, and the only way the team can review that many honestly is if every one of them is checkable without re-running the work. That is what the evidence rule buys.

Walk a couple of the contrasts. "The send-money flow is confusing" is an opinion; "NN-09, send-money step 4, screenshot 04-transfer-failed.png: a failed transfer shows error code TRX-409 with no explanation or retry path" is a finding someone can open, verify, and fix. "The app is inconsistent" is a vibe; "destructive actions use three different confirmation patterns, recurring in four flows, locations listed" is one deduplicated finding with a scoped fix. Severity stated as fact is the most dangerous weak pattern, because agents will assign severity with unearned confidence if the prompt allows it — which is why the scale is agreed in advance, every rating is labelled candidate, and confirmation is structurally a human step.

Confidence is worth a sentence too: agents over-report near-duplicates and occasionally stretch a heuristic to cover an aesthetic preference. Requiring the criterion ID makes the stretch visible — if the finding does not genuinely fit the heuristic it cites, it is either a judgment to be labelled as such or a guess to be rejected at the severity pass.

Narration for this slide

Every finding, from either method, has to clear the same bar. It names the criterion — a heuristic ID or the walkthrough question that failed. It cites evidence — a screenshot filename or an action in the build that anyone can reproduce. Its severity is a candidate on a scale you agreed in advance, not a fact. If it recurs, it is one finding with multiple locations, not five copies. And the suggested fix is a sentence or two inside your existing design system, not a call to redesign the navigation. The fastest quality check is brutal and simple: criterion plus evidence. A finding that cannot name both is either a judgment to label as such, or a guess to send back.

Slide 8 of 1316:9

Triage: from hundreds of findings to a workable queue

The risk of evaluation at scale is not missing problems — it is burying the team in low-severity noise.

  • Dedupe first: recurring problems become one finding with multiple locations, ranked partly by frequency
  • Group by heuristic or by screen — clusters scope one fix instead of dozens of task-level patches
  • Rank by candidate severity, then frequency; read the top of the list, sample the rest
  • Note what was checked and found clean — coverage is part of the report, not just violations
  • Decide the destination before the run: fix backlog, redesign brief, test plan, or archived baseline

A list of 96 findings is not a plan. Triage is what turns an evaluation into three blockers, a consolidation theme, and a baseline.

Slide notes

Scale creates a new failure mode the manual versions of these methods never had: the team receives hundreds of candidate findings, feels obliged to process all of them, and either burns a sprint on severity-1 friction or shelves the whole report. Triage is the answer, and most of it is mechanical enough for the merge agent to do before any human reads a line.

The sequence: dedupe, group, rank. Deduplication collapses the same unlabelled icon button found in four flows into one finding with four locations — and frequency then becomes part of the ranking signal. Grouping by heuristic or by screen is what surfaces the patterns worth more than any individual finding: nineteen consistency findings traceable to three competing button patterns is a consolidation workstream, not nineteen tickets; twenty-two failure points clustered on one permissions screen is one redesign, not nine task-level fixes. Ranking by candidate severity then frequency tells the human reviewer where to spend the confirmation effort.

Two more habits keep the queue workable. First, the report should list what was checked and found clean — heuristics with zero findings, tasks that passed for every persona — because coverage is half the value and it is invisible if only violations are reported. Second, decide before the run what the output feeds: a fix backlog with owners, a redesign brief, a usability test plan, or an archived baseline for the next comparable run. Findings without a destination become a document nobody opens twice.

Narration for this slide

Here is the new problem scale creates: the run works, and you get back ninety-six findings. A list that long is not a plan — it is a way to demoralise a team. So triage. The merge stage dedupes recurring problems into single findings with multiple locations, groups them by heuristic or by screen, and ranks by candidate severity and then frequency. The groups are usually worth more than the individual findings — nineteen consistency violations that all trace back to three competing button patterns is one consolidation effort, not nineteen tickets. Make sure the report also says what was checked and came back clean. And decide before the run where the output goes: a backlog, a redesign brief, a test plan, or a baseline. Findings without a destination just become a long document.

Slide 9 of 1316:9

The human evaluation pass: validate a sample, not everything

Confirming severity is a gate, not a second audit. The reviewer's job is to check that the evidence holds and the ranking can be trusted.

  • Open the evidence for every severity-3 and severity-4 candidate — these are the ones that drive decisions
  • Sample the middle and low severities rather than re-reading all of them
  • Reject findings whose evidence does not hold up when you open the screenshot
  • Mark duplicates the merge missed, and adjust severities you disagree with — your adjustment is the record of judgment
  • Expect a non-zero rejection rate: confirming everything means the review was a rubber stamp

The rejection rate is information, not friction. An evaluation where every candidate survives has not been reviewed.

Slide notes

The severity pass is where the module's recurring theme — agents propose, humans decide — becomes a concrete working session, and it is worth being precise about its size. For a seven-flow evaluation the agent run itself typically finishes well inside the hour; budget most of the one-to-two-hour total for this pass. It is not a re-run of the evaluation. The reviewer opens the evidence behind every high-severity candidate, samples the middle and low bands, confirms or adjusts severities, marks duplicates the merge missed, and rejects findings whose evidence does not support the claim when the screenshot is actually opened.

The rejection rate deserves emphasis because teams misread it. Rejecting some findings is not the workflow failing — it is the review doing its job, and it is information about how the agents are misfiring: too many near-duplicates suggests the merge instructions need tightening, heuristics stretched to cover aesthetic preferences suggests the heuristics file needs sharper definitions or the evaluator agent needs a firmer rule. A pass that confirms every candidate has not reviewed anything; it has rubber-stamped the agent's judgment and attached a human name to it.

The pass is also where accountability transfers. Once the design lead has confirmed severities, the report stops being agent output and becomes the team's assessment — which is exactly what you want before it reaches a backlog, an executive, or a client. Archive the confirmed set together with the heuristics file and the evidence folder, so the next run is compared against this one rather than against memory.

Narration for this slide

After the merge comes the human pass, and the key is to size it correctly: it is a gate, not a second audit. Open the evidence behind every high-severity candidate, because those are the findings that will drive decisions. Sample the middle and the low end rather than re-reading everything. Reject findings whose evidence does not hold up when you actually open the screenshot — and expect to reject some. That rejection rate is information: it tells you how the agents are misfiring and it proves the review happened. A pass that confirms every single candidate is a rubber stamp. Once you have confirmed severities, the report stops being agent output and becomes the team's assessment, with your name on it. That is the point of the gate.

Slide 10 of 1316:9

Worked example: a 7-flow evaluation before a redesign

A fintech team ran the heuristic workflow across seven flows to establish a usability-debt baseline before redesign work began.

StageWhat happened
Setup7 flows, Nielsen's 10 plus 5 project heuristics, screenshots captured per flow plus staging access
Fan-outOne evaluator agent per flow returned 96 candidate findings
MergeDeduped to 61 findings, grouped by heuristic, ranked by candidate severity and frequency
Severity pass54 confirmed, 7 rejected for evidence that did not hold up
Headline pattern19 of 54 findings were consistency (NN-04), traced to three competing button and confirmation patterns
Destination3 blockers fixed, a consolidation workstream added to the redesign brief, the run archived as the baseline

The blockers were half-known already. The pattern — one consistency problem wearing nineteen disguises — is what the team could not have seen flow by flow.

Slide notes

This run is documented in the heuristic-evaluation workflow on the school's site; treat the numbers as one traced run, not a benchmark. The team was about to start a major redesign and wanted a baseline of usability debt before any new design work began — one of the classic moments to reach for the method, alongside before/after comparisons of the same flow and an agency's structured first read of a prospect's product.

Walk the funnel: 96 candidate findings from the fan-out, 61 after the merge deduplicated problems recurring across flows, 54 confirmed at the severity pass with 7 rejected because the evidence did not support the claim. The confirmed set included three severity-4 task blockers — two in the send-money flow, including fees that appeared only on the receipt screen in violation of a project heuristic, and a failed transfer that showed a raw error code with no recovery path — plus an onboarding step that silently timed out and discarded entered data.

The most useful output was not the blockers, which the team half-knew about. It was the heuristic-level pattern: nineteen of the fifty-four confirmed findings referenced consistency, and almost all traced back to three different button and confirmation patterns that had accumulated over four years. That cluster gave the redesign brief a consolidation workstream it would not otherwise have had — and it is exactly the kind of cross-flow pattern a manual evaluation of three flows would never reveal. The archived run became the baseline the redesigned flows were later measured against, which is what makes a one-off audit compound over time.

Narration for this slide

Let's trace one real run, from the workflow this module is built on. A fintech team, seven flows, about to start a redesign. They wrote the heuristics file — Nielsen's ten plus five project rules — captured screenshots per flow, and ran one evaluator per flow. Ninety-six candidate findings came back. The merge deduped that to sixty-one. The severity pass confirmed fifty-four and rejected seven whose evidence did not hold up. Three were genuine blockers, and the team half-knew about those already. The finding that mattered was the pattern: nineteen consistency violations, almost all traceable to three competing button and confirmation patterns built up over four years. That became a consolidation workstream in the redesign brief — something no flow-by-flow review would have surfaced — and the whole run became the baseline the redesign was later measured against.

Slide 11 of 1316:9

What these methods cannot prove

Both methods predict problems from principles. Neither observes real people, and the agents add caveats of their own.

  • Neither measures frequency or impact — findings are hypotheses with evidence attached, not observed behaviour
  • Neither replaces usability testing; the walkthrough's best output is often a sharper test plan
  • Walkthrough verdicts are an evaluator's predictions under a method, not a simulation of users
  • Agents over-report near-duplicates, stretch heuristics to cover preferences, and assign severity with unearned confidence if allowed
  • Framing stays human: which flows count, whether a project heuristic is a real commitment, and what a client or executive sees

Honesty about what the run cannot claim is part of the deliverable — the method page in an audit often does more work than the findings.

Slide notes

Close the methods section with the limits, because the credibility of everything else depends on stating them unprompted. Heuristic evaluation and cognitive walkthroughs are inspection methods: they predict problems from principles and structured questions. They cannot tell you how often a problem occurs, how much it costs, or whether real users actually stumble where the method says they should. Usability testing and analytics are how predictions get confirmed, and the workflows are explicit that the walkthrough in particular pairs naturally with a usability test booked before the report is written.

The agent-specific failure modes are worth naming because the severity pass is designed around them: over-reporting near-duplicates, stretching a heuristic to cover an aesthetic preference, and assigning severity with unearned confidence when the prompt does not forbid it. None of these is fatal; all of them are why severity is structurally a human decision and why the rejection rate at the gate is information.

The last bullet is the one that protects teams politically. Which flows count as key, whether a project heuristic reflects a genuine commitment or an aspiration, and what gets shown to a client or an executive are judgment calls the evaluation cannot make. The agency case in the workflow is instructive: the page that explained the method, what was and was not evaluated, and whose judgment the severities reflect did more work in the pitch than the findings did. Honesty about limits is not hedging — it is what makes the rest of the report believable.

Narration for this slide

Before the exercise, the limits — because stating them is part of the deliverable. Both methods predict problems; neither observes them. They cannot tell you how often something happens or how much it costs, and they do not replace usability testing — in fact the walkthrough's best output is often a sharper test plan. The agents add their own caveats: they over-report near-duplicates, they sometimes stretch a heuristic to cover a preference, and they will rate severity with unearned confidence if you let them. That is exactly why the human gate exists. And the framing stays yours: which flows count, which heuristics are real commitments, and what a client or an executive actually sees. Saying all of this out loud in the report is what makes the findings worth believing.

Slide 12 of 1316:9

Exercise: three tasks and a one-screen walkthrough

Set up the smallest honest version of both methods for your own product. Budget about thirty minutes; the point is the setup, not the run.

  • Write three user tasks worth walking, each with a starting point and the correct action sequence
  • Define one persona as a knowledge profile: what they know about the domain, the product, and what they have already done
  • Pick one screen from the first task and answer the four questions for it yourself, by hand, with a screenshot
  • Draft your project heuristics: three to five rules that are true for your product but not for products in general, each with an ID and an example
  • Note which output the full run would feed: fix backlog, redesign brief, test plan, or baseline

Doing one screen by hand is what calibrates you to review five hundred agent-evaluated steps later. Keep the page — it becomes the contract for your first scaled run.

Slide notes

The exercise deliberately stays manual. Everything produced here — the task list, the persona, the project heuristics, the destination decision — is the human-authored contract from the left side of the diagram, and it transfers directly into a scaled run when the participant is ready. Steer people towards tasks that are real but bounded: things new users must do unaided, things that generate support tickets, not redesign our app.

The one-screen manual walkthrough is the most important step. Answering the four questions by hand for a single screen takes about ten minutes and does two things: it surfaces how much discipline the questions actually demand — most people discover they want to skip Q1 because it feels obvious — and it calibrates the participant for reviewing agent output later. Someone who has produced one careful per-step record can tell at a glance whether an agent's record is grounded in the screen or generated from vibes.

The project heuristics step is where the exercise connects back to Module 1's named dimensions: the rules that are true for this product specifically, written down with an ID and an example so they can be cited and checked. Common discoveries are that the team has strong opinions it has never written anywhere, and that some "rules" turn out to be aspirations the product violates everywhere — both useful findings before any agent runs. Ask participants to keep the page; it becomes the inputs file for the first scaled evaluation, and Module 5 will assume those criteria exist when wiring review into pull requests.

Narration for this slide

Your turn, and you do not need an agent for any of it. Write three tasks worth walking — real ones, with a starting point and the correct sequence of actions. Define one persona as a knowledge profile: what they know, what they have already done. Then pick one screen from the first task and answer the four questions yourself, by hand, with a screenshot in front of you. It takes ten minutes and it calibrates you for reviewing agent output later. Draft three to five project heuristics — rules true for your product specifically, each with an ID and an example. And decide what a full run would feed: a backlog, a brief, a test plan, or a baseline. Keep the page. It is the contract for your first scaled run.

Slide 13 of 1316:9

Summary, and the bridge to regression evidence

  • Heuristic evaluation and cognitive walkthroughs were rationed by cost, not discredited — agents change the economics, not the logic
  • The contract is written once: a heuristics file with IDs, or tasks, knowledge personas, and the four questions
  • Every finding carries a criterion, evidence, a location, and a candidate severity — humans confirm, adjust, and reject
  • Triage by deduping, grouping, and ranking; the clusters are usually worth more than any single finding
  • Both methods predict problems; usability testing confirms them, and the framing stays human

Module 3 swaps principles for pixels: screenshot baselines and regression sweeps, so visual review runs on evidence instead of memory.

Slide notes

Recap by walking the pipeline rather than the bullets in isolation: the human-authored contract on the left, the fan-out and merge in the middle, the severity gate on the right, and a named destination for the confirmed findings. The two methods differ in what they ask — principles per screen versus four questions per task step — but they share the same shape, the same evidence rules, and the same division of labour, which is why one diagram covered both.

If participants did the exercise, the page they produced is the left-hand column of that diagram for their own product. The natural next step is to run the smallest scaled version: three flows or three tasks against one persona, with the severity pass done properly, before scaling to the whole surface. The workflows referenced through this module — heuristic evaluation at scale and cognitive walkthrough at scale — contain the prompts, agent definitions, and traced case studies to copy from.

Preview Module 3 concretely. This module evaluated the product against principles and tasks; the next one evaluates it against itself over time. Screenshot baselines for the surfaces that matter, agent-run capture sweeps across states and breakpoints, and diff reports that distinguish genuine regressions from intended change — so the answer to "did anything break visually" stops depending on whoever last looked at the screen and starts depending on evidence.

Narration for this slide

Let's close. The methods were never the problem — the cost was. Agents change that economics: the same heuristics, the same four questions, applied to every flow, every task, every persona, with evidence attached to every finding. Your work moves to the contract — the heuristics file, the tasks, the personas — and to the gate, where you confirm severities, reject what does not hold up, and decide what the findings feed. Triage turns hundreds of candidates into a handful of decisions, and the clusters usually matter more than any single finding. And both methods predict; testing with real users confirms. In Module 3 we move from principles to pixels: screenshot baselines and regression sweeps, so visual review runs on evidence instead of memory. See you there.

Module transcript
Module 2, narrated slide by slide

Slide 1Heuristic Evaluation and Cognitive Walkthroughs at Scale

Welcome back. In Module 1 we built the critique loop — named dimensions, evidence, the human making the calls. This module points that loop at two methods that have been in the UX toolkit for over thirty years: heuristic evaluation and the cognitive walkthrough. The methods were never wrong. They were just expensive — expensive enough that most teams run them once, on one flow, and quietly skip the rest. Agents change that economics. The criteria stay the same, the four questions stay the same, but coverage stops being the limiting factor. What you have to get right instead is the setup, the evidence rules, and the triage. That is this module.

Slide 2Heuristic evaluation: a refresher

Quick refresher first. Heuristic evaluation is an evaluator walking the interface against a short list of usability principles — Nielsen's ten are the standard set — and writing down where the interface breaks them, with evidence. It needs no participants, it is cheap, and it catches a real share of genuine problems, especially consistency and feedback failures. It also has honest limits: it predicts problems from principles, it does not observe them happening, and it cannot tell you how often or how badly. Its biggest practical weakness has always been coverage. Done by hand, it gets applied to the three flows everyone worries about, and the rest of the product never gets looked at.

Slide 3Scaling it: every flow, the same contract

Here is the scaled version. You write the heuristics file once: Nielsen's ten, plus five to eight project heuristics — the rules that are true for your product specifically, each with an ID, a definition, and an example. You capture evidence up front: ordered screenshots for every flow, plus access to the staging build for the states screenshots miss. Then one evaluator agent per flow, all reading the same file, all bound by the same rule — every finding cites a heuristic ID and a piece of evidence. A merge agent dedupes the problems that recur across flows and ranks the result. And every severity is labelled a candidate, because confirming severity is your job, not the agent's.

Slide 4The shape of an evaluation at scale

This diagram is the whole module on one screen, and it covers both methods. On the left, the human work: deciding what surface gets evaluated, and writing the shared contract — the heuristics file, or the tasks, personas, and four questions. In the middle, the agent work: a fan-out of evaluators, one per flow or per task and persona, each returning findings with evidence, and then a single merge that dedupes, groups, and ranks. On the right, the human gate: the severity pass, where you open the evidence, confirm or adjust severities, and reject what does not hold up. Only then does anything feed a backlog, a brief, or a baseline. Fan-out buys coverage, merge buys readability, and the gate buys trust.

Slide 5Cognitive walkthroughs: four questions at every step

The second method is the cognitive walkthrough, and it evaluates tasks, not screens. You define who is doing the task and what they already know, and at every step you ask four questions. Will they try to achieve the right effect — do they even know this step is needed? Will they notice the correct action is available? Will they connect the action to their goal — does the label mean to them what it means to the team? And if they do it, will they see that progress was made? Any no makes that step a failure point, with a screenshot and a severity guess. One thing to say plainly: the agent is simulating an evaluator running a method, not simulating a user. These are predictions. Usability testing is how you confirm the ones that matter.

Slide 6Tasks and personas: the anchors of a walkthrough

Two inputs anchor a walkthrough, and both are yours to write. First, the task list — each task with its starting point and the correct sequence of actions. That sequence is the answer key, and just writing it down often surfaces the first findings, because the team discovers it cannot agree on what the correct path is. Second, the personas — and for this method a persona is a knowledge profile, not a poster. What do they know about the domain, about this product, about the conventions the interface assumes? Two or three is enough. Then it is one agent per task and persona combination. And the most useful findings are the differences: when the novice fails a step the expert sails through, you have located a learnability problem precisely.

Slide 7Severity, confidence, and evidence on every finding

Every finding, from either method, has to clear the same bar. It names the criterion — a heuristic ID or the walkthrough question that failed. It cites evidence — a screenshot filename or an action in the build that anyone can reproduce. Its severity is a candidate on a scale you agreed in advance, not a fact. If it recurs, it is one finding with multiple locations, not five copies. And the suggested fix is a sentence or two inside your existing design system, not a call to redesign the navigation. The fastest quality check is brutal and simple: criterion plus evidence. A finding that cannot name both is either a judgment to label as such, or a guess to send back.

Slide 8Triage: from hundreds of findings to a workable queue

Here is the new problem scale creates: the run works, and you get back ninety-six findings. A list that long is not a plan — it is a way to demoralise a team. So triage. The merge stage dedupes recurring problems into single findings with multiple locations, groups them by heuristic or by screen, and ranks by candidate severity and then frequency. The groups are usually worth more than the individual findings — nineteen consistency violations that all trace back to three competing button patterns is one consolidation effort, not nineteen tickets. Make sure the report also says what was checked and came back clean. And decide before the run where the output goes: a backlog, a redesign brief, a test plan, or a baseline. Findings without a destination just become a long document.

Slide 9The human evaluation pass: validate a sample, not everything

After the merge comes the human pass, and the key is to size it correctly: it is a gate, not a second audit. Open the evidence behind every high-severity candidate, because those are the findings that will drive decisions. Sample the middle and the low end rather than re-reading everything. Reject findings whose evidence does not hold up when you actually open the screenshot — and expect to reject some. That rejection rate is information: it tells you how the agents are misfiring and it proves the review happened. A pass that confirms every single candidate is a rubber stamp. Once you have confirmed severities, the report stops being agent output and becomes the team's assessment, with your name on it. That is the point of the gate.

Slide 10Worked example: a 7-flow evaluation before a redesign

Let's trace one real run, from the workflow this module is built on. A fintech team, seven flows, about to start a redesign. They wrote the heuristics file — Nielsen's ten plus five project rules — captured screenshots per flow, and ran one evaluator per flow. Ninety-six candidate findings came back. The merge deduped that to sixty-one. The severity pass confirmed fifty-four and rejected seven whose evidence did not hold up. Three were genuine blockers, and the team half-knew about those already. The finding that mattered was the pattern: nineteen consistency violations, almost all traceable to three competing button and confirmation patterns built up over four years. That became a consolidation workstream in the redesign brief — something no flow-by-flow review would have surfaced — and the whole run became the baseline the redesign was later measured against.

Slide 11What these methods cannot prove

Before the exercise, the limits — because stating them is part of the deliverable. Both methods predict problems; neither observes them. They cannot tell you how often something happens or how much it costs, and they do not replace usability testing — in fact the walkthrough's best output is often a sharper test plan. The agents add their own caveats: they over-report near-duplicates, they sometimes stretch a heuristic to cover a preference, and they will rate severity with unearned confidence if you let them. That is exactly why the human gate exists. And the framing stays yours: which flows count, which heuristics are real commitments, and what a client or an executive actually sees. Saying all of this out loud in the report is what makes the findings worth believing.

Slide 12Exercise: three tasks and a one-screen walkthrough

Your turn, and you do not need an agent for any of it. Write three tasks worth walking — real ones, with a starting point and the correct sequence of actions. Define one persona as a knowledge profile: what they know, what they have already done. Then pick one screen from the first task and answer the four questions yourself, by hand, with a screenshot in front of you. It takes ten minutes and it calibrates you for reviewing agent output later. Draft three to five project heuristics — rules true for your product specifically, each with an ID and an example. And decide what a full run would feed: a backlog, a brief, a test plan, or a baseline. Keep the page. It is the contract for your first scaled run.

Slide 13Summary, and the bridge to regression evidence

Let's close. The methods were never the problem — the cost was. Agents change that economics: the same heuristics, the same four questions, applied to every flow, every task, every persona, with evidence attached to every finding. Your work moves to the contract — the heuristics file, the tasks, the personas — and to the gate, where you confirm severities, reject what does not hold up, and decide what the findings feed. Triage turns hundreds of candidates into a handful of decisions, and the clusters usually matter more than any single finding. And both methods predict; testing with real users confirms. In Module 3 we move from principles to pixels: screenshot baselines and regression sweeps, so visual review runs on evidence instead of memory. See you there.