AAgentic Design School
UX workflow
Foundation

Cognitive Walkthrough at Scale

A workflow for running structured cognitive walkthroughs across many tasks and personas at once: each agent steps through one task with one persona's knowledge level, answers the four classic walkthrough questions at every step with screenshot evidence, and a merge agent groups failure points by screen so the team sees where journeys break.

OrchestrationPer-task walkthrough agents with merge stage

Typical run1–2 hours per product area

Last reviewed2026-06-02

View the code samples on GitHub

Section 1

Why walkthroughs are underused

The cognitive walkthrough is one of the oldest inspection methods in usability practice: pick a task, define who is doing it and what they already know, and at every step ask whether the user will try the right action, notice it, connect it to their goal, and understand the feedback. It finds learnability problems early and cheaply, before any participant is recruited.

Teams use it less than they should because it is tedious in exactly the way agents are not. Doing it properly means stepping through every screen of every task for every persona, answering the same four questions each time, and writing down the evidence. Most teams do it once for one task, in a meeting, from memory of the interface rather than from the interface itself.

This workflow runs the method at scale: many tasks, several personas, the four questions answered at every step, with screenshots attached. One thing must be said plainly before anything else — the agents are simulating an evaluator applying a method, not simulating users. The output is a structured set of hypotheses about where people will struggle, and the way to confirm those hypotheses is usability testing with actual users.

Section 2

When to reach for this workflow

Use it when learnability is the question: first-run experiences, onboarding flows, admin consoles where new staff must self-serve, and any flow you are about to put in front of usability test participants and want to walk through systematically first. It works on live products, staging environments, and clickable prototypes the agent can drive in a browser.

It is a Foundation-level workflow because the method carries the rigor: the four questions, asked at every step, do most of the work. What you need is a list of tasks worth walking, two or three honest persona definitions, and a running version of the interface.

  • A first-run or onboarding flow before it ships or before a usability study.
  • A product area with many similar tasks where coverage matters more than depth on one.
  • Preparation for usability testing: the walkthrough chooses what to watch for.
  • A regression check after a navigation or terminology change.

Section 3

The four questions, written down once

The method's value comes from answering the same four questions at every step, in order, without skipping the ones that feel obvious. Write them into a template the walkthrough agents must follow, so every step of every task gets the same treatment and the merge stage can compare like with like.

Each answer needs a yes or no, a one-sentence justification grounded in what is visible on the screen, and a screenshot reference. A walkthrough finding without the screen it refers to is an opinion that cannot be checked.

walkthrough-questions.md (the per-step template)
# Cognitive walkthrough — per-step record

Task: <task name>
Persona: <persona name and knowledge level>
Step N: <the correct action at this point>
Screenshot: <file captured at this step>

Q1. Will the user try to achieve the right effect?
    Does the persona, with their knowledge, even know this step is needed?
    Answer: yes / no — one sentence grounded in what the screen shows.

Q2. Will the user notice that the correct action is available?
    Is the control visible without scrolling, hidden in a menu, or ambiguous?
    Answer: yes / no — one sentence.

Q3. Will the user associate the correct action with the effect they want?
    Does the label or icon mean to this persona what it means to the team?
    Answer: yes / no — one sentence.

Q4. If the correct action is performed, will the user see that progress was made?
    Is the feedback visible, immediate, and in the persona's language?
    Answer: yes / no — one sentence.

Verdict: pass / failure point (any "no" makes this step a failure point).
Severity guess: blocks the task / causes hesitation or error / minor friction.

Section 4

Personas are knowledge profiles, not posters

For a walkthrough, a persona is mostly a statement of what the person already knows: their familiarity with the domain, with this product, and with the conventions the interface assumes. A novice persona who has never seen the product will fail steps an expert sails through, and that difference is the finding.

Two or three personas is usually enough — typically a domain novice, an experienced practitioner new to this product, and where relevant an existing power user. Write each as a short paragraph plus a list of things they do and do not know. Resist demographic decoration; the walkthrough questions never ask how old the persona is.

  • Define each persona by what they know: domain knowledge, product familiarity, assumed conventions.
  • Two or three personas per run; more multiplies cost faster than insight.
  • State explicitly what the persona has already done (installed the CLI, received an invite email, holds an account).
  • Reuse the same persona definitions across runs so results stay comparable.

Section 5

The orchestration pattern: one agent per task and persona

This runs as a Claude Code dynamic workflow: a JavaScript script that Claude writes and runs in the background, orchestrating subagents while intermediate results — every per-step record and screenshot path — stay in script variables rather than in Claude's own context. A nine-task, three-persona run produces twenty-seven walkthroughs and a few hundred step records; without that separation the later walkthroughs would be done from a context full of the earlier ones.

Each cell of the task-by-persona grid gets one walkthrough agent. The agent drives the interface through a browser connection (Playwright MCP or Chrome DevTools MCP), captures a screenshot at each step, and answers the four questions in the template before moving on. A merge agent then groups every failure point by screen and step, so a screen that breaks six different tasks shows up as one heavily flagged screen rather than six scattered notes.

Workflows can run up to 16 agents concurrently and up to 1,000 in a run, are resumable if a long run is interrupted, and the finished prompt can be saved to .claude/workflows/ in the project (or ~/.claude/workflows/ for personal use) as a reusable command like /walkthrough. You trigger a run by including the word workflow in the prompt or via /effort ultracode, and the walkthrough agent definition lives in .claude/agents/*.md so the four-question discipline travels with the project.

diagramWalkthrough fan-out and merge
1

Design decision

Define tasks

2

Design decision

Define personas

3

Design decision

Walk each task per persona

4

Design decision

Capture screenshots per step

5

Design decision

Merge failures by screen

6

Design decision

Human review

7

Design decision

Plan usability testing

Tasks and personas form a grid; each cell is one walkthrough; failures merge by screen before humans review.

Section 6

The workflow prompt

Give the workflow the task list, the persona files, the question template, and the URL of the running product or prototype. The two rules that protect the output: every answer must be grounded in a captured screenshot, and the agent must never claim to know what real users will do — the verdicts are the evaluator's predictions under the method, nothing more.

Cognitive walkthrough workflow prompt
Run this as a workflow.

Input: ./tasks.md lists 9 tasks, each with a starting point and the correct action sequence. ./personas/ contains 3 persona files describing knowledge levels. ./walkthrough-questions.md is the per-step template. The product runs at http://localhost:3000 with the test accounts listed in tasks.md.

Stage 1 - Walkthroughs: For each task and persona combination, launch one agent. It steps through the task in the browser, captures a screenshot at every step into ./output/screens/, and fills in the four-question template per step from that persona's knowledge level. Answers must reference what is visible in the captured screenshot. Any "no" answer makes the step a failure point with a severity guess.

Stage 2 - Merge: A merge agent groups all failure points by screen and step across tasks and personas. For each screen it lists: which tasks and personas fail there, which of the four questions fail most often, the severity guesses, and the screenshot references.

Stage 3 - Report: Write walkthrough-report.md with the merged failure points ordered by how many task-persona combinations each screen breaks, plus a short list of steps that passed for experts but failed for novices.

Rules: these are evaluator predictions, not user behavior; say so in the report header. Do not propose redesigns; describe the failure and the question it fails. Every failure cites a screenshot file.

Section 7

The walkthrough agent, defined once

The subagent definition is where the method's discipline lives. Keeping it in .claude/agents/ means every walkthrough in every future run answers the same four questions in the same order, and the merge stage stays comparable across runs and product areas.

.claude/agents/walkthrough-evaluator.md
---
name: walkthrough-evaluator
description: Performs a cognitive walkthrough of one task with one persona, answering the four walkthrough questions at every step with screenshot evidence. Use during walkthrough workflows.
tools: Read, Write, Bash
model: sonnet
---

You perform a cognitive walkthrough of exactly one task with exactly one persona.

Method:
- Adopt only the knowledge stated in the persona file. Do not use your own knowledge of the product or of design conventions the persona would not know.
- At each step, capture a screenshot before answering, then answer the four questions from walkthrough-questions.md in order. Ground every answer in what the screenshot shows.
- Any "no" answer makes the step a failure point. Record a severity guess: blocks the task, causes hesitation or error, or minor friction.
- If you cannot complete a step at all from the persona's knowledge, record where you got stuck and stop the task there; an abandoned task is a finding.
- You are an evaluator applying an inspection method. Never describe your verdicts as what users will do; they are predictions to test with real users.

Section 8

What the orchestration script roughly does

Claude writes the orchestration script when you trigger the workflow. The sketch below shows the grid fan-out and the merge with an agent() pseudo-API; it is illustrative, not the literal generated code.

Dynamic workflow sketch (pseudo-code)
const fs = require("node:fs")
const tasks = parseTasks(fs.readFileSync("./tasks.md", "utf8"))          // 9 tasks
const personas = fs.readdirSync("./personas").map((f) => "./personas/" + f) // 3 personas
const template = fs.readFileSync("./walkthrough-questions.md", "utf8")

// Stage 1: one walkthrough agent per task x persona cell.
const cells = tasks.flatMap((task) => personas.map((persona) => ({ task, persona })))
const walkthroughs = await Promise.all(
  cells.map(({ task, persona }) =>
    agent(
      "Cognitive walkthrough of one task with one persona.\n" +
        "Task:\n" + task.text + "\nPersona file: " + persona + "\n" +
        "Template:\n" + template + "\n" +
        "Drive the product at " + task.url + ", screenshot every step into ./output/screens/, " +
        "and return the per-step records as JSON.",
      { model: "sonnet" }
    )
  )
)

// Stage 2: merge failure points by screen and step.
const report = await agent(
  "Group these walkthrough failure points by screen and step. For each screen list the " +
    "tasks and personas that fail there, which of the four questions fail most often, " +
    "severity guesses, and screenshot references. Order screens by how many task-persona " +
    "combinations they break. State clearly that these are evaluator predictions, not user behavior.\n\n" +
    JSON.stringify(walkthroughs),
  { model: "opus" }
)

fs.writeFileSync("./output/walkthrough-report.md", report)

Section 9

Step by step through one product area

Write the task list first, with the correct action sequence for each task — that sequence is the answer key the walkthrough checks the interface against, and writing it often surfaces the first findings on its own. Define the personas as knowledge profiles, point the workflow at a running build, and let it run; nine tasks across three personas typically completes within an hour, with most of the time in the browser-driving stage.

Read the merged report by screen, not by task. Pick the two or three most-flagged failure points and check them yourself against the screenshots before sharing anything; if the agent misread a screen, you want to find that before the team does. Then decide what each surviving failure point becomes: a fix you are confident about, a question for the usability test, or a known tradeoff you accept.

Plan the usability test from the report. The walkthrough tells you which tasks and screens to include and what to watch for; the test with real participants is what turns predictions into findings. The usability test prep workflow on this site picks up exactly where this one ends.

diagramOne product area, end to end
1

Design decision

Write task list

2

Design decision

Define knowledge personas

3

Design decision

Run the walkthroughs

4

Design decision

Read report by screen

5

Design decision

Verify top failure points

6

Design decision

Sort fix, test, accept

7

Design decision

Plan usability test

The task list is the answer key, the merged report is read by screen, and the usability test verifies the predictions.

Section 10

Case study: a developer tool's first-run experience

A developer tools team walked its first-run experience — install, authenticate, connect a repository, run the first analysis — with a novice persona (a developer who had never used a static analysis tool) and an expert persona (a developer migrating from a competitor). Twelve tasks, two personas, twenty-four walkthroughs, run in just over an hour against a staging build.

The novice persona failed Q3 repeatedly on terminology: the product asked users to create a baseline before showing any results, and nothing on the screen explained what a baseline was or why results were withheld until one existed. The expert persona passed the same steps without hesitation, which located the problem precisely as a learnability gap rather than a flaw in the flow itself. The merged report flagged the baseline screen in nine of the twelve novice walkthroughs.

The team made two changes before the usability study — a one-line explanation on the baseline screen and a renamed primary button — and kept the screen in the study script to check whether the fix held. Three of five novice participants still hesitated there, but all five recovered, where the walkthrough had predicted abandonment; a useful reminder that the method predicts friction better than it predicts what people do about it.

Section 11

Case study: an enterprise admin console

An enterprise software team walked nine common administration tasks — inviting users, assigning roles, configuring SSO, setting data retention, and similar — with three personas: a new IT administrator, an experienced administrator from a different suite, and a non-technical office manager who had been handed admin rights.

The merge stage made the headline finding hard to miss: 6 of the 9 tasks failed at the same permissions screen, across all three personas, mostly on Q2 and Q4. The screen required selecting a role before the relevant settings became visible, gave no indication that hidden settings existed, and saved changes with no confirmation beyond the button briefly disabling. Twenty-two of the run's thirty-one failure points sat on that one screen.

Because the failures clustered, the team scoped one redesign instead of nine task-level fixes, and the follow-up walkthrough on the revised screen cleared all but four failure points. The remaining four were Q1 failures for the office-manager persona — not knowing the task was theirs to do at all — which no screen redesign fixes and which the team took to onboarding and documentation instead.

Section 12

Case study: a payee-creation flow before a usability study

A mobile banking team walked its add-a-payee flow with two personas — a customer adding their first payee and a customer who pays bills weekly — the week before a scheduled usability study, mainly to sharpen the study script. Five tasks covering domestic payees, international payees, and editing an existing payee, ten walkthroughs, under an hour of runtime against a TestFlight build driven through a device mirror.

The walkthrough predicted three failure points: a Q4 failure after submitting a new payee, where the confirmation screen did not say when the payee would become available to pay; a Q3 failure on the distinction between BSB and SWIFT fields for the first-time persona; and a Q2 failure on the edit affordance, which was reachable only by swiping a list row with no visible hint.

The usability study confirmed the first two with five of six participants and did not confirm the third — most participants found the swipe gesture quickly, having learned it elsewhere in the app. The team's own retrospective note is the right summary of the method: the walkthrough made the study sharper and cheaper, and the study corrected the walkthrough where the evaluator's prediction was wrong.

Section 13

Good vs bad walkthrough output

A weak walkthrough report reads like a generic heuristic review: confident statements about what users will find confusing, no method visible, no evidence attached. A strong one shows its work — the persona, the question that failed, the screenshot, and the modest framing that this is a prediction awaiting a usability test.

tableWalkthrough quality comparison
1Bad

Users will find the onboarding confusing

2Good

Q3 failure, novice persona, step 4: the label "Create baseline" does not connect to the goal of seeing results; nothing on screen-04.png defines the term

3Bad

The permissions page has usability issues

4Good

Q2 failure across 6 of 9 tasks, all personas: role-specific settings are hidden until a role is selected and no indicator shows they exist; screens 11-13 attached

5Bad

Users won't know the payee was added (stated as fact)

6Good

Q4 failure, both personas, step 6: the confirmation screen does not say when the payee becomes available; prediction to verify in next week's usability sessions

7Bad

We recommend redesigning the navigation (no failure point cited)

8Good

No redesign proposed; 22 of 31 failure points cluster on one screen, which scopes the fix discussion for the team

Method-grounded predictions can be checked against screens and tested with users; vague claims about users cannot.

Section 14

Limits: what this workflow cannot prove

A cognitive walkthrough — run by a human evaluator or simulated by an agent — predicts where people are likely to struggle. It does not observe behavior, it cannot measure how often a problem occurs or how severe it is in practice, and it systematically misses problems of motivation, trust, and context that only show up with real people in real situations.

The agent adds its own caveat on top of the method's: it is simulating an evaluator's discipline, not a user's mind, and it can apply a persona's stated knowledge only as well as the persona is written. Findings are hypotheses. The decisions about which ones to fix immediately, which to put in front of participants, and which to accept are made by the team, ideally with the usability test booked before the walkthrough report is written.

  • Cannot observe real user behavior or measure task success; only usability testing does that.
  • Cannot judge motivation, trust, or willingness to continue; the four questions do not ask.
  • Cannot rank severity reliably; the severity guesses order the conversation, not the roadmap.
  • Cannot exceed the quality of the task list and persona definitions it is given.

Section 15

The reusable walkthrough workflow

Save the question template, the persona files, the evaluator agent definition, and the prompt in the repository, and save the prompt to .claude/workflows/ as a /walkthrough command. Rerunning the same tasks and personas after each significant release turns the walkthrough from a one-off review into a cheap learnability regression check.

Cognitive walkthrough workflow
1. Write the task list with the correct action sequence and starting point for each task.
2. Define 2-3 personas as knowledge profiles: what they know, what they have already done.
3. Confirm the four-question template and the severity guesses it allows.
4. Run the workflow: one walkthrough agent per task-persona cell, screenshots at every step.
5. Merge failure points by screen and step; order screens by how many combinations they break.
6. Verify the top 2-3 failure points yourself against the screenshots.
7. Sort findings into: fix now, test with users, accept as a tradeoff.
8. Feed the surviving questions into the usability test plan and rerun the walkthrough after fixes ship.

Sources

Sources & further reading

Browse the full library on the workflows page or open the code samples in the public repository.

Newsletter

Get the walkthrough question template and inspection-method updates by email.

The newsletter is the update channel for article revisions, tool changes, and field-tested workflows.

Processed by Buttondown. You can unsubscribe from any email.

Further reading

For deeper reading, see The Agentic Designer and Claude Code for Designers.

The Agentic Designer cover
Curriculum
The Agentic Designer
How AI agents are transforming product design.

The operating model for product designers, design leads, and builders who need to understand what changes when agents join design work.

Claude Code for Designers cover
Curriculum
Claude Code for Designers
A designer's guide to AI-assisted workflows.

A practical guide for designers who want to work directly with coding agents without turning it into a programming manual.