Section 1
Why walkthroughs are underused
The cognitive walkthrough is one of the oldest inspection methods in usability practice: pick a task, define who is doing it and what they already know, and at every step ask whether the user will try the right action, notice it, connect it to their goal, and understand the feedback. It finds learnability problems early and cheaply, before any participant is recruited.
Teams use it less than they should because it is tedious in exactly the way agents are not. Doing it properly means stepping through every screen of every task for every persona, answering the same four questions each time, and writing down the evidence. Most teams do it once for one task, in a meeting, from memory of the interface rather than from the interface itself.
This workflow runs the method at scale: many tasks, several personas, the four questions answered at every step, with screenshots attached. One thing must be said plainly before anything else — the agents are simulating an evaluator applying a method, not simulating users. The output is a structured set of hypotheses about where people will struggle, and the way to confirm those hypotheses is usability testing with actual users.
Section 2
When to reach for this workflow
Use it when learnability is the question: first-run experiences, onboarding flows, admin consoles where new staff must self-serve, and any flow you are about to put in front of usability test participants and want to walk through systematically first. It works on live products, staging environments, and clickable prototypes the agent can drive in a browser.
It is a Foundation-level workflow because the method carries the rigor: the four questions, asked at every step, do most of the work. What you need is a list of tasks worth walking, two or three honest persona definitions, and a running version of the interface.
- A first-run or onboarding flow before it ships or before a usability study.
- A product area with many similar tasks where coverage matters more than depth on one.
- Preparation for usability testing: the walkthrough chooses what to watch for.
- A regression check after a navigation or terminology change.
Section 3
The four questions, written down once
The method's value comes from answering the same four questions at every step, in order, without skipping the ones that feel obvious. Write them into a template the walkthrough agents must follow, so every step of every task gets the same treatment and the merge stage can compare like with like.
Each answer needs a yes or no, a one-sentence justification grounded in what is visible on the screen, and a screenshot reference. A walkthrough finding without the screen it refers to is an opinion that cannot be checked.
# Cognitive walkthrough — per-step record
Task: <task name>
Persona: <persona name and knowledge level>
Step N: <the correct action at this point>
Screenshot: <file captured at this step>
Q1. Will the user try to achieve the right effect?
Does the persona, with their knowledge, even know this step is needed?
Answer: yes / no — one sentence grounded in what the screen shows.
Q2. Will the user notice that the correct action is available?
Is the control visible without scrolling, hidden in a menu, or ambiguous?
Answer: yes / no — one sentence.
Q3. Will the user associate the correct action with the effect they want?
Does the label or icon mean to this persona what it means to the team?
Answer: yes / no — one sentence.
Q4. If the correct action is performed, will the user see that progress was made?
Is the feedback visible, immediate, and in the persona's language?
Answer: yes / no — one sentence.
Verdict: pass / failure point (any "no" makes this step a failure point).
Severity guess: blocks the task / causes hesitation or error / minor friction.Section 4
Personas are knowledge profiles, not posters
For a walkthrough, a persona is mostly a statement of what the person already knows: their familiarity with the domain, with this product, and with the conventions the interface assumes. A novice persona who has never seen the product will fail steps an expert sails through, and that difference is the finding.
Two or three personas is usually enough — typically a domain novice, an experienced practitioner new to this product, and where relevant an existing power user. Write each as a short paragraph plus a list of things they do and do not know. Resist demographic decoration; the walkthrough questions never ask how old the persona is.
- Define each persona by what they know: domain knowledge, product familiarity, assumed conventions.
- Two or three personas per run; more multiplies cost faster than insight.
- State explicitly what the persona has already done (installed the CLI, received an invite email, holds an account).
- Reuse the same persona definitions across runs so results stay comparable.
Section 5
The orchestration pattern: one agent per task and persona
This runs as a Claude Code dynamic workflow: a JavaScript script that Claude writes and runs in the background, orchestrating subagents while intermediate results — every per-step record and screenshot path — stay in script variables rather than in Claude's own context. A nine-task, three-persona run produces twenty-seven walkthroughs and a few hundred step records; without that separation the later walkthroughs would be done from a context full of the earlier ones.
Each cell of the task-by-persona grid gets one walkthrough agent. The agent drives the interface through a browser connection (Playwright MCP or Chrome DevTools MCP), captures a screenshot at each step, and answers the four questions in the template before moving on. A merge agent then groups every failure point by screen and step, so a screen that breaks six different tasks shows up as one heavily flagged screen rather than six scattered notes.
Workflows can run up to 16 agents concurrently and up to 1,000 in a run, are resumable if a long run is interrupted, and the finished prompt can be saved to .claude/workflows/ in the project (or ~/.claude/workflows/ for personal use) as a reusable command like /walkthrough. You trigger a run by including the word workflow in the prompt or via /effort ultracode, and the walkthrough agent definition lives in .claude/agents/*.md so the four-question discipline travels with the project.
Design decision
Define tasks
Design decision
Define personas
Design decision
Walk each task per persona
Design decision
Capture screenshots per step
Design decision
Merge failures by screen
Design decision
Human review
Design decision
Plan usability testing
Tasks and personas form a grid; each cell is one walkthrough; failures merge by screen before humans review.
Section 6
The workflow prompt
Give the workflow the task list, the persona files, the question template, and the URL of the running product or prototype. The two rules that protect the output: every answer must be grounded in a captured screenshot, and the agent must never claim to know what real users will do — the verdicts are the evaluator's predictions under the method, nothing more.
Run this as a workflow. Input: ./tasks.md lists 9 tasks, each with a starting point and the correct action sequence. ./personas/ contains 3 persona files describing knowledge levels. ./walkthrough-questions.md is the per-step template. The product runs at http://localhost:3000 with the test accounts listed in tasks.md. Stage 1 - Walkthroughs: For each task and persona combination, launch one agent. It steps through the task in the browser, captures a screenshot at every step into ./output/screens/, and fills in the four-question template per step from that persona's knowledge level. Answers must reference what is visible in the captured screenshot. Any "no" answer makes the step a failure point with a severity guess. Stage 2 - Merge: A merge agent groups all failure points by screen and step across tasks and personas. For each screen it lists: which tasks and personas fail there, which of the four questions fail most often, the severity guesses, and the screenshot references. Stage 3 - Report: Write walkthrough-report.md with the merged failure points ordered by how many task-persona combinations each screen breaks, plus a short list of steps that passed for experts but failed for novices. Rules: these are evaluator predictions, not user behavior; say so in the report header. Do not propose redesigns; describe the failure and the question it fails. Every failure cites a screenshot file.
Section 7
The walkthrough agent, defined once
The subagent definition is where the method's discipline lives. Keeping it in .claude/agents/ means every walkthrough in every future run answers the same four questions in the same order, and the merge stage stays comparable across runs and product areas.
--- name: walkthrough-evaluator description: Performs a cognitive walkthrough of one task with one persona, answering the four walkthrough questions at every step with screenshot evidence. Use during walkthrough workflows. tools: Read, Write, Bash model: sonnet --- You perform a cognitive walkthrough of exactly one task with exactly one persona. Method: - Adopt only the knowledge stated in the persona file. Do not use your own knowledge of the product or of design conventions the persona would not know. - At each step, capture a screenshot before answering, then answer the four questions from walkthrough-questions.md in order. Ground every answer in what the screenshot shows. - Any "no" answer makes the step a failure point. Record a severity guess: blocks the task, causes hesitation or error, or minor friction. - If you cannot complete a step at all from the persona's knowledge, record where you got stuck and stop the task there; an abandoned task is a finding. - You are an evaluator applying an inspection method. Never describe your verdicts as what users will do; they are predictions to test with real users.
Section 8
What the orchestration script roughly does
Claude writes the orchestration script when you trigger the workflow. The sketch below shows the grid fan-out and the merge with an agent() pseudo-API; it is illustrative, not the literal generated code.
const fs = require("node:fs")
const tasks = parseTasks(fs.readFileSync("./tasks.md", "utf8")) // 9 tasks
const personas = fs.readdirSync("./personas").map((f) => "./personas/" + f) // 3 personas
const template = fs.readFileSync("./walkthrough-questions.md", "utf8")
// Stage 1: one walkthrough agent per task x persona cell.
const cells = tasks.flatMap((task) => personas.map((persona) => ({ task, persona })))
const walkthroughs = await Promise.all(
cells.map(({ task, persona }) =>
agent(
"Cognitive walkthrough of one task with one persona.\n" +
"Task:\n" + task.text + "\nPersona file: " + persona + "\n" +
"Template:\n" + template + "\n" +
"Drive the product at " + task.url + ", screenshot every step into ./output/screens/, " +
"and return the per-step records as JSON.",
{ model: "sonnet" }
)
)
)
// Stage 2: merge failure points by screen and step.
const report = await agent(
"Group these walkthrough failure points by screen and step. For each screen list the " +
"tasks and personas that fail there, which of the four questions fail most often, " +
"severity guesses, and screenshot references. Order screens by how many task-persona " +
"combinations they break. State clearly that these are evaluator predictions, not user behavior.\n\n" +
JSON.stringify(walkthroughs),
{ model: "opus" }
)
fs.writeFileSync("./output/walkthrough-report.md", report)Section 9
Step by step through one product area
Write the task list first, with the correct action sequence for each task — that sequence is the answer key the walkthrough checks the interface against, and writing it often surfaces the first findings on its own. Define the personas as knowledge profiles, point the workflow at a running build, and let it run; nine tasks across three personas typically completes within an hour, with most of the time in the browser-driving stage.
Read the merged report by screen, not by task. Pick the two or three most-flagged failure points and check them yourself against the screenshots before sharing anything; if the agent misread a screen, you want to find that before the team does. Then decide what each surviving failure point becomes: a fix you are confident about, a question for the usability test, or a known tradeoff you accept.
Plan the usability test from the report. The walkthrough tells you which tasks and screens to include and what to watch for; the test with real participants is what turns predictions into findings. The usability test prep workflow on this site picks up exactly where this one ends.
Design decision
Write task list
Design decision
Define knowledge personas
Design decision
Run the walkthroughs
Design decision
Read report by screen
Design decision
Verify top failure points
Design decision
Sort fix, test, accept
Design decision
Plan usability test
The task list is the answer key, the merged report is read by screen, and the usability test verifies the predictions.
Section 10
Case study: a developer tool's first-run experience
A developer tools team walked its first-run experience — install, authenticate, connect a repository, run the first analysis — with a novice persona (a developer who had never used a static analysis tool) and an expert persona (a developer migrating from a competitor). Twelve tasks, two personas, twenty-four walkthroughs, run in just over an hour against a staging build.
The novice persona failed Q3 repeatedly on terminology: the product asked users to create a baseline before showing any results, and nothing on the screen explained what a baseline was or why results were withheld until one existed. The expert persona passed the same steps without hesitation, which located the problem precisely as a learnability gap rather than a flaw in the flow itself. The merged report flagged the baseline screen in nine of the twelve novice walkthroughs.
The team made two changes before the usability study — a one-line explanation on the baseline screen and a renamed primary button — and kept the screen in the study script to check whether the fix held. Three of five novice participants still hesitated there, but all five recovered, where the walkthrough had predicted abandonment; a useful reminder that the method predicts friction better than it predicts what people do about it.
Section 11
Case study: an enterprise admin console
An enterprise software team walked nine common administration tasks — inviting users, assigning roles, configuring SSO, setting data retention, and similar — with three personas: a new IT administrator, an experienced administrator from a different suite, and a non-technical office manager who had been handed admin rights.
The merge stage made the headline finding hard to miss: 6 of the 9 tasks failed at the same permissions screen, across all three personas, mostly on Q2 and Q4. The screen required selecting a role before the relevant settings became visible, gave no indication that hidden settings existed, and saved changes with no confirmation beyond the button briefly disabling. Twenty-two of the run's thirty-one failure points sat on that one screen.
Because the failures clustered, the team scoped one redesign instead of nine task-level fixes, and the follow-up walkthrough on the revised screen cleared all but four failure points. The remaining four were Q1 failures for the office-manager persona — not knowing the task was theirs to do at all — which no screen redesign fixes and which the team took to onboarding and documentation instead.
Section 12
Case study: a payee-creation flow before a usability study
A mobile banking team walked its add-a-payee flow with two personas — a customer adding their first payee and a customer who pays bills weekly — the week before a scheduled usability study, mainly to sharpen the study script. Five tasks covering domestic payees, international payees, and editing an existing payee, ten walkthroughs, under an hour of runtime against a TestFlight build driven through a device mirror.
The walkthrough predicted three failure points: a Q4 failure after submitting a new payee, where the confirmation screen did not say when the payee would become available to pay; a Q3 failure on the distinction between BSB and SWIFT fields for the first-time persona; and a Q2 failure on the edit affordance, which was reachable only by swiping a list row with no visible hint.
The usability study confirmed the first two with five of six participants and did not confirm the third — most participants found the swipe gesture quickly, having learned it elsewhere in the app. The team's own retrospective note is the right summary of the method: the walkthrough made the study sharper and cheaper, and the study corrected the walkthrough where the evaluator's prediction was wrong.
Section 13
Good vs bad walkthrough output
A weak walkthrough report reads like a generic heuristic review: confident statements about what users will find confusing, no method visible, no evidence attached. A strong one shows its work — the persona, the question that failed, the screenshot, and the modest framing that this is a prediction awaiting a usability test.
Users will find the onboarding confusing
Q3 failure, novice persona, step 4: the label "Create baseline" does not connect to the goal of seeing results; nothing on screen-04.png defines the term
The permissions page has usability issues
Q2 failure across 6 of 9 tasks, all personas: role-specific settings are hidden until a role is selected and no indicator shows they exist; screens 11-13 attached
Users won't know the payee was added (stated as fact)
Q4 failure, both personas, step 6: the confirmation screen does not say when the payee becomes available; prediction to verify in next week's usability sessions
We recommend redesigning the navigation (no failure point cited)
No redesign proposed; 22 of 31 failure points cluster on one screen, which scopes the fix discussion for the team
Method-grounded predictions can be checked against screens and tested with users; vague claims about users cannot.
Section 14
Limits: what this workflow cannot prove
A cognitive walkthrough — run by a human evaluator or simulated by an agent — predicts where people are likely to struggle. It does not observe behavior, it cannot measure how often a problem occurs or how severe it is in practice, and it systematically misses problems of motivation, trust, and context that only show up with real people in real situations.
The agent adds its own caveat on top of the method's: it is simulating an evaluator's discipline, not a user's mind, and it can apply a persona's stated knowledge only as well as the persona is written. Findings are hypotheses. The decisions about which ones to fix immediately, which to put in front of participants, and which to accept are made by the team, ideally with the usability test booked before the walkthrough report is written.
- Cannot observe real user behavior or measure task success; only usability testing does that.
- Cannot judge motivation, trust, or willingness to continue; the four questions do not ask.
- Cannot rank severity reliably; the severity guesses order the conversation, not the roadmap.
- Cannot exceed the quality of the task list and persona definitions it is given.
Section 15
The reusable walkthrough workflow
Save the question template, the persona files, the evaluator agent definition, and the prompt in the repository, and save the prompt to .claude/workflows/ as a /walkthrough command. Rerunning the same tasks and personas after each significant release turns the walkthrough from a one-off review into a cheap learnability regression check.
1. Write the task list with the correct action sequence and starting point for each task. 2. Define 2-3 personas as knowledge profiles: what they know, what they have already done. 3. Confirm the four-question template and the severity guesses it allows. 4. Run the workflow: one walkthrough agent per task-persona cell, screenshots at every step. 5. Merge failure points by screen and step; order screens by how many combinations they break. 6. Verify the top 2-3 failure points yourself against the screenshots. 7. Sort findings into: fix now, test with users, accept as a tradeoff. 8. Feed the surviving questions into the usability test plan and rerun the walkthrough after fixes ship.
Sources

