Section 1
Why heuristic evaluation stopped scaling
Heuristic evaluation is one of the oldest tools in the UX kit: a small group of evaluators walks the interface against a short list of usability principles and writes down where it breaks them. It is cheap, it needs no participants, and it catches a surprising share of real problems. The trouble is coverage. A two-person team evaluating a product with seven core flows either spends a week on it or quietly evaluates the three flows everyone already worries about and calls it done.
The work itself is patient and procedural: open the flow, walk every step, hold each screen against each heuristic, and write down what you observe with enough evidence that someone else can find the same problem. That is exactly the shape of work agents do consistently and humans do unevenly on the fifth flow of the afternoon.
This workflow fans one agent out per flow, all working from the same heuristics file, the same screenshots, and the same evidence rules. A merge agent then dedupes findings that recur across flows and ranks the result. The agents flag candidate violations; humans confirm severity and decide what gets fixed. That division is not a disclaimer, it is the design of the workflow.
Section 2
When to reach for this workflow
Reach for it when the team needs a structured, evidence-backed picture of usability debt across a whole product area, not a deep read of one screen. The classic moments are before a redesign kickoff, when comparing an old and new version of the same flow, and when an agency needs a credible audit of a prospect's product without weeks of access.
Do not reach for it as a substitute for usability testing. Heuristic evaluation predicts problems from principles; testing observes problems happening to real people. The two answer different questions, and this workflow is honest about which one it answers.
- Pre-redesign audit: establish the baseline of known problems before new design work starts.
- Before/after comparison: evaluate the old and new flow against the same heuristics to show what improved and what regressed.
- Pre-sales or onboarding audit: an agency producing a structured first read of a prospective client's product.
- Quarterly usability debt review across all key flows, run with the same heuristics file each time so results are comparable.
Section 3
The heuristics file is the contract
The evaluation is only as good as the heuristics it runs against. Start from Nielsen's ten — they are stable, well documented, and most designers can argue about them fluently — and add a project-specific heuristics file the team maintains: the rules that are true for your product but not for products in general. A banking app might add a heuristic about never hiding the balance behind promotion content; a B2B tool might add one about admin actions always being reversible.
Each heuristic needs an ID, a one-sentence definition, and one example of a violation. The IDs matter because every finding will reference one, and the merge agent dedupes partly by heuristic. Keep the project-specific list short; five to eight added heuristics is workable, twenty means everything violates something.
# Heuristics file: Lumen banking app, June 2026 Findings must reference exactly one heuristic ID. If a problem fits none, propose NEW-XX with a one-sentence definition. ## Nielsen's 10 (NN-01 to NN-10) NN-01 Visibility of system status - the UI keeps users informed about what is happening. NN-02 Match between system and the real world - words and concepts familiar to the user. NN-03 User control and freedom - clearly marked exits, undo, and redo. NN-04 Consistency and standards - same words and patterns mean the same thing everywhere. NN-05 Error prevention - the design prevents problems before they happen. NN-06 Recognition rather than recall - options visible, instructions retrievable. NN-07 Flexibility and efficiency of use - accelerators for experienced users. NN-08 Aesthetic and minimalist design - no irrelevant or rarely needed information. NN-09 Help users recognize, diagnose, and recover from errors - plain-language errors with a way out. NN-10 Help and documentation - easy to search, focused on the task. ## Project heuristics (LUM-01 to LUM-05) LUM-01 The current balance is visible within one tap from launch and never behind promotional content. LUM-02 Any money movement shows the amount, recipient, and date on a confirmation step before commit. LUM-03 Destructive or irreversible actions require explicit confirmation that names what will happen. LUM-04 Fees and exchange rates are shown before the user commits, not in the receipt. LUM-05 Every error state names the next action the user can take, in the user's language not the system's.
Section 4
The orchestration pattern: fan out per flow, merge once
This runs as a Claude Code dynamic workflow: a JavaScript script that Claude writes and runs in the background, orchestrating subagents while the intermediate results — every screenshot list, every per-flow findings array — stay in script variables rather than in Claude's own context. That separation is what lets seven flows be evaluated with the same care as one; nothing read for the checkout flow crowds out the attention available for settings.
Each flow gets its own evaluation agent. Workflows can run up to 16 agents concurrently and up to 1,000 in a run, and a run is resumable, so an evaluation interrupted halfway picks up where it stopped instead of restarting. You trigger the workflow by including the word workflow in the prompt or via /effort ultracode, and once the prompt is stable you can save it to .claude/workflows/ in the project (or ~/.claude/workflows/ for personal use) so it becomes a reusable command like /heuristic-audit. The evaluator and merge agents themselves are defined as markdown files in .claude/agents/, which keeps their instructions versioned next to the heuristics file.
After the fan-out, a single merge agent receives every per-flow finding, dedupes problems that recur across flows (the same unlabeled icon button in four places is one finding with four locations, not four findings), and ranks the result by candidate severity and frequency. The merged report is what humans review.
Design decision
Define flows
Design decision
Agree heuristics file
Design decision
Capture evidence
Design decision
Evaluate per flow
Design decision
Merge and dedupe
Design decision
Confirm severity
Design decision
Plan fixes
Flows are evaluated in parallel against the same heuristics file, then merged, deduped, and handed to humans for severity confirmation.
Section 5
Give every agent the same evidence
Each evaluation agent works from two kinds of evidence: a folder of screenshots covering every step of its flow, and access to the live build through Playwright (directly via a capture script, or interactively via Playwright MCP) so it can check states the screenshots missed — errors, empty states, what happens after a wrong input. Findings must name their evidence: a screenshot filename, or a step the agent performed in the live build.
Capture the screenshots before the run with a small script so the evidence is identical across flows and across reruns. Name files by flow and step, and keep them on disk; agents read paths, not pasted images, which keeps the context cheap and the evidence reusable for the next audit.
import { chromium } from "playwright"
import { mkdir } from "node:fs/promises"
// flows.json maps each flow to the ordered routes (or actions) that make it up.
// Output: ./evidence/<flow>/<step>-<route>.png at a fixed 1440px width.
const flows = JSON.parse(await readFile("./flows.json", "utf8"))
const browser = await chromium.launch()
const page = await browser.newPage({ viewport: { width: 1440, height: 1200 } })
for (const [flow, steps] of Object.entries(flows)) {
await mkdir("./evidence/" + flow, { recursive: true })
let i = 0
for (const step of steps) {
i += 1
await page.goto(baseUrl + step.route, { waitUntil: "networkidle" })
const name = String(i).padStart(2, "0") + "-" + step.slug + ".png"
await page.screenshot({ path: "./evidence/" + flow + "/" + name, fullPage: true })
}
}
await browser.close()
console.log("Captured evidence for " + Object.keys(flows).length + " flows")Section 6
The workflow prompt
The prompt names the flows, the heuristics file, the evidence folder, and the rules that protect the output: every finding carries a heuristic ID, named evidence, a candidate severity, and a suggested fix; severity is explicitly labeled candidate; and nothing in the UI changes as a result of the run.
Run this as a workflow.
Input: ./flows.json lists 7 key flows (onboarding, login, send-money, pay-bill, statements, settings, support). ./heuristics.md is the heuristics file (Nielsen NN-01 to NN-10 plus project heuristics LUM-01 to LUM-05). ./evidence/<flow>/ holds ordered screenshots for each flow. The staging build is at https://staging.example.com and may be inspected with Playwright.
Stage 1 - Evaluation: For each flow, launch one agent that walks the flow step by step using the screenshots and, where states are missing (errors, empty states, validation), the staging build. It returns findings as { flow, step, heuristic_id, observation, evidence, candidate_severity (1-4), suggested_fix }. Every finding must cite a screenshot filename or a described action in the staging build. Do not report a finding without evidence. Do not assign final severity; mark all severities as candidate.
Stage 2 - Merge: A merge agent combines all findings, dedupes problems that recur across flows into one finding with multiple locations, groups by heuristic, and ranks by candidate severity then frequency. It also lists heuristics with zero findings, so the team can see what was checked and passed.
Output: findings.md (the merged, ranked report), findings.csv (one row per finding for the team's tracker), and coverage.md (which flows, steps, and heuristics were evaluated). Do not change any UI or code as part of this run.Section 7
What the orchestration script roughly does
Claude writes the orchestration script itself when you run the prompt; the sketch below shows the shape with an agent() pseudo-API so you can see where the fan-out happens and where the evidence rules are enforced. It is illustrative, not the literal generated code.
const fs = require("node:fs")
const flows = JSON.parse(fs.readFileSync("./flows.json", "utf8"))
const heuristics = fs.readFileSync("./heuristics.md", "utf8")
// Stage 1: one evaluation agent per flow, run in parallel.
const perFlow = await Promise.all(
Object.keys(flows).map((flow) =>
agent(
"You are running a heuristic evaluation of one flow.\n" +
heuristics +
"\nFlow: " + flow + ". Evidence folder: ./evidence/" + flow + "/\n" +
"Walk every step. Return JSON findings: { flow, step, heuristic_id, " +
"observation, evidence, candidate_severity, suggested_fix }. " +
"Every finding must cite a screenshot filename or a staging-build action. " +
"Severities are candidates only.",
{ model: "sonnet" }
)
)
)
// Stage 2: merge agent dedupes recurring problems and ranks the result.
const report = await agent(
"Merge these heuristic findings. Dedupe problems that recur across flows into " +
"one finding with multiple locations. Group by heuristic, rank by candidate " +
"severity then frequency, and list heuristics with zero findings.\n\n" +
JSON.stringify(perFlow.flat()),
{ model: "opus" }
)
fs.writeFileSync("./output/findings.md", report)Section 8
Define the evaluator as a subagent
Keeping the evaluator's instructions in a subagent definition under .claude/agents/ means the evidence rules live in one versioned place instead of being re-typed into every prompt. The workflow prompt stays short, and the next audit uses the same evaluator the last one did.
--- name: heuristic-evaluator description: Evaluates one product flow against the shared heuristics file and returns evidence-backed candidate findings. tools: Read, Glob, Bash model: sonnet --- You evaluate exactly one flow against the heuristics file you are given. Rules: - Walk the flow step by step in the order the screenshots are numbered. - Every finding references exactly one heuristic ID from the file. - Every finding cites evidence: a screenshot filename, or a described action in the staging build. - Report what you observe, not what you assume happens on steps you did not see. - Severity is a candidate rating from 1 (cosmetic) to 4 (blocks the task); a human confirms it later. - Suggested fixes are one or two sentences and stay within the existing design system. - If a problem fits no heuristic, propose NEW-XX with a one-sentence definition rather than forcing a fit. - Do not modify any code or UI.
Section 9
Step by step through one audit
List the flows with the team and write down what each one is for; a flow nobody can describe in a sentence is not ready to evaluate. Update the heuristics file, capture the evidence with the script, and run the workflow. For seven flows the run itself usually finishes inside half an hour; budget the rest of the one to two hours for the human pass.
The human pass is severity confirmation. Walk the merged findings, confirm or adjust each candidate severity, mark duplicates the merge agent missed, and reject findings whose evidence does not hold up when you open the screenshot. Expect to reject some — an evaluation that confirms every candidate finding has not been reviewed, it has been rubber-stamped.
Close by deciding what the audit feeds: a fix backlog, a redesign brief, or a client deliverable. Archive the heuristics file, the evidence folder, and the confirmed findings together so the next audit compares against this one instead of starting from memory.
Merged candidate findings
Open the evidence
Confirm or adjust severity
Reject unsupported findings
Approve the report
Re-audit after fixes
feeds next cycleCandidate findings cycle through human confirmation before anything reaches a backlog or a client.
Section 10
Case study: a 7-flow fintech app before a redesign kickoff
A fintech team about to start a major redesign ran the workflow across seven flows — onboarding, login, send money, pay a bill, statements, settings, and support — to establish a baseline of usability debt before any new design work began. The fan-out produced 96 candidate findings; the merge agent deduped them to 61; the human severity pass confirmed 54 and rejected 7 whose evidence did not hold up.
The confirmed set broke down as 3 severity-4 findings (task blockers), 14 severity-3, 26 severity-2, and 11 severity-1. Two of the three blockers were in send money: fees appeared only on the receipt screen, violating the project heuristic LUM-04, and a failed transfer showed a raw error code with no recovery path, violating NN-09. The third was an onboarding step that silently timed out and discarded entered data.
The most useful output for the redesign was not the blockers, which the team half-knew about, but the heuristic-level pattern: 19 of the 54 findings referenced consistency (NN-04), almost all traceable to three different button and confirmation patterns that had accumulated over four years. The redesign brief gained a consolidation workstream it would not otherwise have had, and the audit became the baseline the redesigned flows were later measured against.
Section 11
Case study: old checkout vs new checkout, same heuristics
An e-commerce team used the workflow to compare the live checkout against a redesigned checkout running on staging, evaluating both against the identical heuristics file in the same run — two flows, two agents, one merge. The point was to make the comparison symmetric: same heuristics, same evidence rules, same evaluator definition.
The old checkout carried 17 confirmed findings, the new one 9. Eight of the old findings were genuinely resolved, including the highest-severity one (a coupon error that emptied the cart, NN-09 and NN-05). But the comparison also caught two regressions the team had not noticed: the new design removed the order summary from the payment step (NN-06, recognition over recall) and the new express-pay button used a different label than the same action elsewhere in the product (NN-04).
Both regressions were fixed before launch. The before/after table — findings by heuristic for each version — went into the launch review as the evidence that the redesign had improved things and as the record of what it had traded away.
Section 12
Case study: an agency pre-sales heuristic audit
A design agency used the workflow to produce a heuristic audit of a prospective client's booking product as part of a pitch, working only from the public product and a trial account. Three flows — search, booking, and account management — were evaluated in an afternoon, with a senior designer spending about 90 minutes confirming severities and rewriting the executive summary in the agency's voice.
The deliverable was deliberately scoped: 28 confirmed findings, the five most consequential illustrated with annotated screenshots, and an explicit page on method — which heuristics were used, what was and was not evaluated, and that severities reflect the agency's judgment without access to the client's analytics or research. That honesty page did more work in the pitch than the findings did; it showed the prospective client how the agency reasons.
The agency won the engagement, and the audit's heuristics file became the starting point for the client's own project heuristics during onboarding. The same workflow now runs as a saved /heuristic-audit command on every new business opportunity where a trial account is available.
Section 13
Good vs bad findings
A finding is useful when someone who was not in the run can locate the problem, see the evidence, and understand what to do about it. The fastest quality check is to require every finding to name its heuristic and its evidence; findings that cannot do both are either judgments to be labeled as such or guesses to be sent back.
The send money flow is confusing and should be simplified
NN-09, send-money step 4 (04-transfer-failed.png): a failed transfer shows error code TRX-409 with no explanation or retry path; candidate severity 4; fix: plain-language error naming the cause and a retry action
Fees should be more transparent
LUM-04, send-money step 3 (03-review.png): the FX fee appears only on the receipt, not on the review step before commit; candidate severity 3; fix: show fee and rate on the review step
The app is inconsistent
NN-04, recurs in 4 flows (locations listed): destructive actions use three different confirmation patterns; candidate severity 2; fix: adopt the dialog pattern from settings everywhere
Useful findings cite a heuristic, evidence, and a scoped fix; weak findings restate opinions.
Section 14
Limits: what a heuristic evaluation cannot prove
Heuristic evaluation predicts problems from principles. It cannot tell you how often a problem occurs, how much it costs, or whether real users stumble where the heuristics say they should. Findings are hypotheses with evidence attached; usability testing and analytics are how you confirm the ones that matter.
Agents add their own failure modes: they over-report near-duplicates, they sometimes stretch a heuristic to cover an aesthetic preference, and they will assign severity with unearned confidence if the prompt lets them. That is why severity is structurally a human decision in this workflow, and why the rejection rate during confirmation is information, not friction.
Humans also own the framing. Which flows count as key, whether a project heuristic reflects a real commitment or an aspiration, and what gets shown to a client or an executive are judgment calls the evaluation cannot make.
- Cannot measure frequency or impact; it predicts problems, it does not observe them.
- Cannot replace usability testing for novel patterns or unfamiliar audiences.
- Cannot assign final severity; agents propose, humans confirm.
- Cannot judge brand, tone, or strategy questions that fall outside the heuristics file.
Section 15
The reusable evaluation workflow
Save the prompt to .claude/workflows/ and the evaluator and merge agent definitions to .claude/agents/, and the audit becomes a command the team reruns each quarter, before each redesign, and on each new business opportunity — against the same heuristics file, so the results stay comparable over time.
1. List the key flows and write one sentence about what each is for. 2. Update the heuristics file: Nielsen's 10 plus the project heuristics the team maintains. 3. Capture evidence: ordered screenshots per flow at a fixed width, plus staging access. 4. Run the workflow: one evaluation agent per flow, all using the same heuristics file and evidence rules. 5. Merge and dedupe: recurring problems become one finding with multiple locations. 6. Confirm severity by hand; reject findings whose evidence does not hold up. 7. Decide what the audit feeds: fix backlog, redesign brief, or client deliverable. 8. Archive the heuristics file, evidence, and confirmed findings as the baseline for the next audit.
Sources
Sources & further reading
- Claude Code dynamic workflows documentation
- Subagents vs workflows: who holds the plan (CloudYeti)
- The Agentic Designer (book)
- Nielsen Norman Group: 10 usability heuristics for user interface design
- Nielsen Norman Group: how to conduct a heuristic evaluation
- Nielsen Norman Group: severity ratings for usability problems

