Slide 1 — Heuristic Evaluation and Cognitive Walkthroughs at Scale
Welcome back. In Module 1 we built the critique loop — named dimensions, evidence, the human making the calls. This module points that loop at two methods that have been in the UX toolkit for over thirty years: heuristic evaluation and the cognitive walkthrough. The methods were never wrong. They were just expensive — expensive enough that most teams run them once, on one flow, and quietly skip the rest. Agents change that economics. The criteria stay the same, the four questions stay the same, but coverage stops being the limiting factor. What you have to get right instead is the setup, the evidence rules, and the triage. That is this module.
Slide 2 — Heuristic evaluation: a refresher
Quick refresher first. Heuristic evaluation is an evaluator walking the interface against a short list of usability principles — Nielsen's ten are the standard set — and writing down where the interface breaks them, with evidence. It needs no participants, it is cheap, and it catches a real share of genuine problems, especially consistency and feedback failures. It also has honest limits: it predicts problems from principles, it does not observe them happening, and it cannot tell you how often or how badly. Its biggest practical weakness has always been coverage. Done by hand, it gets applied to the three flows everyone worries about, and the rest of the product never gets looked at.
Slide 3 — Scaling it: every flow, the same contract
Here is the scaled version. You write the heuristics file once: Nielsen's ten, plus five to eight project heuristics — the rules that are true for your product specifically, each with an ID, a definition, and an example. You capture evidence up front: ordered screenshots for every flow, plus access to the staging build for the states screenshots miss. Then one evaluator agent per flow, all reading the same file, all bound by the same rule — every finding cites a heuristic ID and a piece of evidence. A merge agent dedupes the problems that recur across flows and ranks the result. And every severity is labelled a candidate, because confirming severity is your job, not the agent's.
Slide 4 — The shape of an evaluation at scale
This diagram is the whole module on one screen, and it covers both methods. On the left, the human work: deciding what surface gets evaluated, and writing the shared contract — the heuristics file, or the tasks, personas, and four questions. In the middle, the agent work: a fan-out of evaluators, one per flow or per task and persona, each returning findings with evidence, and then a single merge that dedupes, groups, and ranks. On the right, the human gate: the severity pass, where you open the evidence, confirm or adjust severities, and reject what does not hold up. Only then does anything feed a backlog, a brief, or a baseline. Fan-out buys coverage, merge buys readability, and the gate buys trust.
Slide 5 — Cognitive walkthroughs: four questions at every step
The second method is the cognitive walkthrough, and it evaluates tasks, not screens. You define who is doing the task and what they already know, and at every step you ask four questions. Will they try to achieve the right effect — do they even know this step is needed? Will they notice the correct action is available? Will they connect the action to their goal — does the label mean to them what it means to the team? And if they do it, will they see that progress was made? Any no makes that step a failure point, with a screenshot and a severity guess. One thing to say plainly: the agent is simulating an evaluator running a method, not simulating a user. These are predictions. Usability testing is how you confirm the ones that matter.
Slide 6 — Tasks and personas: the anchors of a walkthrough
Two inputs anchor a walkthrough, and both are yours to write. First, the task list — each task with its starting point and the correct sequence of actions. That sequence is the answer key, and just writing it down often surfaces the first findings, because the team discovers it cannot agree on what the correct path is. Second, the personas — and for this method a persona is a knowledge profile, not a poster. What do they know about the domain, about this product, about the conventions the interface assumes? Two or three is enough. Then it is one agent per task and persona combination. And the most useful findings are the differences: when the novice fails a step the expert sails through, you have located a learnability problem precisely.
Slide 7 — Severity, confidence, and evidence on every finding
Every finding, from either method, has to clear the same bar. It names the criterion — a heuristic ID or the walkthrough question that failed. It cites evidence — a screenshot filename or an action in the build that anyone can reproduce. Its severity is a candidate on a scale you agreed in advance, not a fact. If it recurs, it is one finding with multiple locations, not five copies. And the suggested fix is a sentence or two inside your existing design system, not a call to redesign the navigation. The fastest quality check is brutal and simple: criterion plus evidence. A finding that cannot name both is either a judgment to label as such, or a guess to send back.
Slide 8 — Triage: from hundreds of findings to a workable queue
Here is the new problem scale creates: the run works, and you get back ninety-six findings. A list that long is not a plan — it is a way to demoralise a team. So triage. The merge stage dedupes recurring problems into single findings with multiple locations, groups them by heuristic or by screen, and ranks by candidate severity and then frequency. The groups are usually worth more than the individual findings — nineteen consistency violations that all trace back to three competing button patterns is one consolidation effort, not nineteen tickets. Make sure the report also says what was checked and came back clean. And decide before the run where the output goes: a backlog, a redesign brief, a test plan, or a baseline. Findings without a destination just become a long document.
Slide 9 — The human evaluation pass: validate a sample, not everything
After the merge comes the human pass, and the key is to size it correctly: it is a gate, not a second audit. Open the evidence behind every high-severity candidate, because those are the findings that will drive decisions. Sample the middle and the low end rather than re-reading everything. Reject findings whose evidence does not hold up when you actually open the screenshot — and expect to reject some. That rejection rate is information: it tells you how the agents are misfiring and it proves the review happened. A pass that confirms every single candidate is a rubber stamp. Once you have confirmed severities, the report stops being agent output and becomes the team's assessment, with your name on it. That is the point of the gate.
Slide 10 — Worked example: a 7-flow evaluation before a redesign
Let's trace one real run, from the workflow this module is built on. A fintech team, seven flows, about to start a redesign. They wrote the heuristics file — Nielsen's ten plus five project rules — captured screenshots per flow, and ran one evaluator per flow. Ninety-six candidate findings came back. The merge deduped that to sixty-one. The severity pass confirmed fifty-four and rejected seven whose evidence did not hold up. Three were genuine blockers, and the team half-knew about those already. The finding that mattered was the pattern: nineteen consistency violations, almost all traceable to three competing button and confirmation patterns built up over four years. That became a consolidation workstream in the redesign brief — something no flow-by-flow review would have surfaced — and the whole run became the baseline the redesign was later measured against.
Slide 11 — What these methods cannot prove
Before the exercise, the limits — because stating them is part of the deliverable. Both methods predict problems; neither observes them. They cannot tell you how often something happens or how much it costs, and they do not replace usability testing — in fact the walkthrough's best output is often a sharper test plan. The agents add their own caveats: they over-report near-duplicates, they sometimes stretch a heuristic to cover a preference, and they will rate severity with unearned confidence if you let them. That is exactly why the human gate exists. And the framing stays yours: which flows count, which heuristics are real commitments, and what a client or an executive actually sees. Saying all of this out loud in the report is what makes the findings worth believing.
Slide 12 — Exercise: three tasks and a one-screen walkthrough
Your turn, and you do not need an agent for any of it. Write three tasks worth walking — real ones, with a starting point and the correct sequence of actions. Define one persona as a knowledge profile: what they know, what they have already done. Then pick one screen from the first task and answer the four questions yourself, by hand, with a screenshot in front of you. It takes ten minutes and it calibrates you for reviewing agent output later. Draft three to five project heuristics — rules true for your product specifically, each with an ID and an example. And decide what a full run would feed: a backlog, a brief, a test plan, or a baseline. Keep the page. It is the contract for your first scaled run.
Slide 13 — Summary, and the bridge to regression evidence
Let's close. The methods were never the problem — the cost was. Agents change that economics: the same heuristics, the same four questions, applied to every flow, every task, every persona, with evidence attached to every finding. Your work moves to the contract — the heuristics file, the tasks, the personas — and to the gate, where you confirm severities, reject what does not hold up, and decide what the findings feed. Triage turns hundreds of candidates into a handful of decisions, and the clusters usually matter more than any single finding. And both methods predict; testing with real users confirms. In Module 3 we move from principles to pixels: screenshot baselines and regression sweeps, so visual review runs on evidence instead of memory. See you there.