AAgentic Design School
UX workflow
Intermediate

Usability Test Prep and Synthesis

A two-stage workflow that prepares a usability study — research questions, test plan, tasks, screener, discussion guide, prototype check — and then synthesizes the sessions afterwards, with one coding agent per session, a synthesis merge, and a challenge pass that hunts for over-claims.

OrchestrationTwo-stage workflow: prep loop and synthesis fan-out

Typical run2–3 hours across a study

Last reviewed2026-06-02

View the code samples on GitHub

Section 1

Why studies fail before and after the sessions

Usability studies rarely fail in the room. They fail beforehand, when the tasks do not actually exercise the design decisions the team is unsure about, and afterwards, when the sessions sit unsynthesized for two weeks and the findings shrink to whatever the observers happen to remember. The moderation in between is usually the strongest part.

Both failure points are preparation and digestion problems, and both are made of patient, structured work: turning a vague goal into research questions, turning research questions into tasks a participant can attempt without being led, screening the right people in, and then coding hours of notes and transcripts against the questions the study was supposed to answer. Agents are good at exactly this kind of work when the rules are explicit.

This workflow runs in two stages around the sessions. Stage one prepares the study and is reviewed by the researcher before a single participant is recruited. Stage two synthesizes the sessions: one agent codes each session against the research questions, a synthesis agent merges, and a challenge agent hunts for over-claims and disconfirming evidence. The sessions themselves stay human work — moderation, observation, and judgment do not delegate.

Section 2

When to reach for this workflow

Use it for any study with a defined design question and more than a couple of sessions: moderated studies of five to eight participants, unmoderated tests with dozens of recordings, and benchmark studies that repeat the same tasks across products or releases. The bigger the gap between session count and synthesis time, the more this workflow pays back.

Skip the synthesis stage if you ran two sessions and watched both; you do not need a pipeline to remember two conversations. And do not use the prep stage to avoid talking to your team — the research questions still come out of a conversation about what the team would do differently depending on the answer.

  • Moderated studies of 5–8 participants on a new design or navigation model.
  • Unmoderated prototype tests with 15–30 participants and auto-generated transcripts.
  • Benchmark studies repeating the same tasks across products, releases, or competitors.
  • Studies where the prototype's readiness is uncertain and needs checking against the tasks before sessions.

Section 3

Anonymize before any agent sees anything

Nothing from a session reaches an agent until it is anonymized: participant names become IDs, employers and locations are stripped unless they are the subject of the study, and the mapping between IDs and identities lives in a file outside the agent's working folder. The same applies to recordings — agents work from transcripts and notes, not video.

Confirm that your consent language covers automated analysis of transcripts and that your tooling agreement keeps the data out of model training. These are decisions the researcher owns before the study runs, not afterthoughts during synthesis. If consent does not cover it, the synthesis stage of this workflow is not available to you, and the prep stage still is.

Section 4

The orchestration pattern: a prep loop, then a synthesis fan-out

Both stages run as Claude Code dynamic workflows: JavaScript scripts that Claude writes and runs in the background, orchestrating subagents while intermediate results — draft tasks, per-session codings, the merged theme set — stay in script variables instead of Claude's own context. For synthesis that separation is what keeps session eight coded as carefully as session one.

Stage one is a small loop rather than a fan-out: a planning agent drafts the research questions, test plan, task scenarios, screener, and discussion guide; a critique agent checks the drafts against each other (does every task trace to a research question, does any task lead the participant, does the screener actually exclude the wrong people); the planning agent revises; and a prototype-check agent walks the prototype against the final tasks to confirm every step a task requires actually exists. Stage two fans out one coding agent per session, then runs a synthesis agent and a challenge agent in sequence.

Workflows can run up to 16 agents concurrently and up to 1,000 per run, and runs are resumable, which matters when an unmoderated study produces thirty transcripts. You trigger a workflow by including the word workflow in the prompt or via /effort ultracode, and the stable prompts can be saved to .claude/workflows/ in the project (or ~/.claude/workflows/ for personal use) as commands like /test-prep and /test-synthesis. The session-coder and challenge agents are defined as markdown files in .claude/agents/ so their evidence rules are versioned alongside the codebook.

diagramStudy pipeline stages
1

Design decision

Define research questions

2

Design decision

Prep loop: plan, critique, revise

3

Design decision

Prototype readiness check

4

Design decision

Run sessions (human)

5

Design decision

Code per session

6

Design decision

Merge and challenge

7

Design decision

Team review

The prep loop runs before recruitment; the synthesis fan-out runs after the last session; humans gate both.

Section 5

Stage one: the prep prompt

The prep prompt starts from the design decision the team is unsure about, because that is what research questions are for. The output is a draft package the researcher edits — the agent's job is to make the first version complete and internally consistent, not to own the study design.

Study prep workflow prompt (stage one)
Run this as a workflow.

Context: We are testing a new sidebar navigation model for our project management product. The design decision in question: whether users can find archived projects, cross-project search, and notification settings under the new structure. Prototype: Figma file linked in ./study/prototype.md. Sessions: 6 moderated remote sessions, 45 minutes each.

Stage 1 - Draft: A planning agent drafts: research-questions.md (3-5 questions, each tied to a decision the team will make differently depending on the answer), test-plan.md (method, participants, schedule, roles, recording and consent notes), tasks.md (one task per research question, written as a goal in the participant's words with no interface vocabulary, plus success criteria), screener.md (questions with accept/reject logic), and discussion-guide.md (intro script, warm-up, task order, follow-up probes, wrap-up).

Stage 2 - Critique: A critique agent checks the package: every task maps to a research question, no task names a UI element or leads the participant, the screener excludes people who could not plausibly use the product, the guide's probes are open questions, and the timing fits 45 minutes. The planning agent revises until the critique passes.

Stage 3 - Prototype check: A prototype-readiness agent walks each task against the prototype description and flags any screen, state, or interaction a task requires that the prototype does not support, so we fix the prototype or the task before recruitment.

Output the five study files plus prototype-gaps.md. Do not invent participant quotas or accessibility requirements; flag them as decisions for the researcher.

Section 6

Review the prep package like you wrote it

The researcher reads every file and edits with intent. The most common fixes are the same ones you would make to a junior researcher's draft: research questions that are really feature requests, tasks that quietly tell the participant where to click, and screener questions a motivated respondent can game. The critique agent catches many of these; it does not catch the ones that depend on knowing your users.

The prototype-gaps file is the highest-leverage output of stage one. A task that dead-ends in an unbuilt screen costs you a participant slot and forty-five minutes of everyone's time; finding it three days before the sessions costs nothing.

discussion-guide.md (excerpt of the drafted structure)
# Discussion guide: sidebar navigation study, June 2026

## Intro (5 min)
- Thanks, recording consent confirmed on the form; remind them they can stop any time.
- "We're testing the design, not you. There are no wrong answers."
- "Please think aloud as you go - tell me what you expect before you click."

## Warm-up (5 min)
- How they organize project work today; how many projects they touch in a week.

## Tasks (30 min, order rotated across participants)
- T1 (RQ1): "A project you finished last quarter has a file you need. Find it."
  Probe if stuck >2 min: "Where would you expect finished projects to live?"
- T2 (RQ2): "You remember a comment about budget approval but not which project. Find it."
- T3 (RQ3): "You're getting too many emails from this tool. Change that."

## Wrap-up (5 min)
- "If you could change one thing about what you used today, what would it be?"
- Anything we should have asked about?

Section 7

Stage two: the synthesis prompt

After the last session, point the synthesis workflow at the anonymized notes and transcripts and the research questions. Two rules protect the output and are stated explicitly: quotes must be verbatim and attributed to a participant ID, and the agents must never invent or paraphrase a finding into existence. A finding that cannot point at a session does not exist.

Synthesis workflow prompt (stage two)
Run this as a workflow.

Input: ./sessions contains anonymized notes and transcripts for P01-P06. ./study/research-questions.md lists RQ1-RQ4. ./study/tasks.md defines the tasks and success criteria.

Stage 1 - Coding: For each session, launch one agent that reads only that session's files. It returns: per-task outcome (success, partial, fail, not attempted) with the moment that decided it, observations coded against RQ1-RQ4, and verbatim quotes with participant ID and timestamp or line reference. Quotes must appear word for word in the source. If a quote cannot be found verbatim, do not report it.

Stage 2 - Synthesis: Merge the coded sessions into findings per research question. Each finding states: a one-sentence claim, which participants support it (by ID), task outcomes that relate to it, 2-3 verbatim quotes, and an explicit count (e.g. 4 of 6). Also produce a task-outcome table across all participants.

Stage 3 - Challenge: A separate agent reviews the findings against all coded sessions and flags: claims supported by fewer than 3 of 6 participants, quotes that do not appear verbatim in the sources, language that generalizes beyond this sample ("users want"), and disconfirming evidence the synthesis did not mention.

Output: findings.md, task-outcomes.md, evidence-table.md (every quote with participant and location), and challenges.md. Do not propose design changes; list open questions instead.

Section 8

What the orchestration script roughly does

Claude writes the orchestration script when you run the prompt. The sketch below uses an agent() pseudo-API to show where the per-session fan-out happens and where the challenge pass sits. It is illustrative, not the literal generated code.

Dynamic workflow sketch (pseudo-code, synthesis stage)
const fs = require("node:fs")
const sessions = fs.readdirSync("./sessions").filter((f) => f.endsWith(".md"))
const questions = fs.readFileSync("./study/research-questions.md", "utf8")
const tasks = fs.readFileSync("./study/tasks.md", "utf8")

// Stage 1: one coding agent per session.
const coded = await Promise.all(
  sessions.map((file) =>
    agent(
      "Code one usability session against the research questions and tasks.\n" +
        questions + "\n" + tasks + "\nSession file: ./sessions/" + file + "\n" +
        "Return JSON: { participant, task_outcomes, observations: [{ rq, note, " +
        "verbatim_quote, location }] }. Quotes must be verbatim or omitted.",
      { model: "sonnet" }
    )
  )
)

// Stage 2: synthesis merges per-RQ findings with coverage counts.
const findings = await agent(
  "Merge these coded sessions into findings per research question, with " +
    "participant IDs, verbatim quotes, and counts out of " + sessions.length + ".\n" +
    JSON.stringify(coded),
  { model: "opus" }
)

// Stage 3: challenge pass for over-claims and ignored disconfirming evidence.
const challenges = await agent(
  "Act as a skeptical second researcher. Flag thin evidence, non-verbatim " +
    "quotes, over-generalization, and disconfirming evidence the findings " +
    "ignore.\n\nFindings:\n" + findings + "\n\nCoded sessions:\n" + JSON.stringify(coded),
  { model: "opus" }
)

fs.writeFileSync("./output/findings.md", findings)
fs.writeFileSync("./output/challenges.md", challenges)

Section 9

Define the session coder as a subagent

The session coder's rules go in a subagent definition under .claude/agents/ so every study uses the same evidence standard without re-typing it. The challenge agent gets its own definition for the same reason.

.claude/agents/session-coder.md
---
name: session-coder
description: Codes one usability session's notes and transcript against the study's research questions and task success criteria.
tools: Read, Glob
model: sonnet
---

You code exactly one session. You never read other sessions.

Rules:
- Score each task outcome (success, partial, fail, not attempted) and name the moment in the session that decided it.
- Code observations against the research question IDs you were given; do not invent new questions.
- Quotes are copied verbatim with participant ID and a timestamp or line reference. If you cannot find the exact words, report the observation without a quote.
- Distinguish what the participant did from what the participant said they would do.
- Note moderator prompts that may have helped the participant, so the synthesis can weigh the outcome.
- Never infer emotion or intent that the participant did not express.

Section 10

Step by step through one study

Run the prep workflow a week before recruitment, edit the package, and fix whatever the prototype check flagged. Run the sessions as you normally would; the only addition is consistent note files per participant, named by ID, dropped into the sessions folder along with the anonymized transcripts.

Run the synthesis workflow the same day the last session ends, while the sessions are fresh enough for you to catch a coding that misreads what happened. Read the challenge file before the findings file, then spot-check: pick five quotes at random from the evidence table and confirm each exists verbatim in the named session. One fabricated quote means the coding stage reruns with a stricter prompt and the whole evidence table is rechecked.

Across the study the agent-facing work totals two to three hours: roughly an hour around prep and editing, and one to two hours for synthesis, review, and the team readout. The sessions themselves are on top of that, as they always were.

diagramSynthesis review loop
Step 1

Code sessions

Step 2

Merge findings

Step 3

Challenge pass

Step 4

Spot-check quotes

Step 5

Team readout

Step 6

Decide and document

feeds next cycle

The challenge file and the quote spot-check sit between the coded sessions and anything the team acts on.

Section 11

Case study: 6 moderated sessions on a new navigation model

A B2B team tested a new sidebar navigation with six participants over two days. The prep loop drafted four research questions and three tasks; the prototype check flagged that the cross-project search task required a results state the Figma prototype did not have, which the designer added two days before the first session instead of discovering live in session one.

Synthesis the evening after the last session produced a clear split: the archived-projects task succeeded for 5 of 6 participants, while the notification-settings task failed for 4 of 6, with three participants looking in their personal avatar menu first — quotes and timestamps attached for each. The challenge pass demoted one finding the synthesis had stated too strongly: a claim that participants preferred the new sidebar overall rested on two enthusiastic comments, both prompted by the moderator's wrap-up question.

The team shipped the new navigation with notification settings moved under the avatar menu, matching where participants looked, and logged the preference question as something a follow-up survey could answer properly. Total researcher time on prep and synthesis was about two and a half hours.

Section 12

Case study: an unmoderated checkout test with 20 participants

A retail team ran an unmoderated test of a redesigned checkout prototype with 20 participants through a panel tool, which produced 20 short recordings with auto-generated transcripts and task timings. Nobody on the team was going to watch seven hours of recordings carefully; that is exactly the situation the synthesis fan-out exists for.

One coding agent per transcript scored the three tasks and coded against the two research questions; the run took under an hour. The merged findings showed the promo-code step was where completions died: 7 of 20 participants stalled there, and 5 of those 7 said some version of expecting the code field to be on the payment step rather than the cart. The task-outcome table also surfaced that the 4 participants on older Android devices accounted for most of the slowest completion times, which the transcripts alone would not have shown.

The challenge pass flagged the right caveat: panel participants are not the team's actual customers, and 20 unmoderated sessions cannot establish a drop-off rate. The team treated the promo-code finding as a strong hypothesis, confirmed it against funnel analytics within the week, and moved the field. The fix was live before the original synthesis would have been scheduled.

Section 13

Case study: an agency benchmark across three client products

An agency running a usability benchmark for a client portfolio tested the same five tasks across three of the client's products, four participants per product, twelve sessions in all. The prep loop's main value was symmetry: the same research questions, tasks, and success criteria applied to all three products, drafted once and adjusted only where a product genuinely lacked a feature.

Synthesis ran per product and then a final merge compared them. Product B's account-creation task failed for 3 of 4 participants against 0 of 4 and 1 of 4 for the others, traced in every failing session to an email-verification loop that dead-ended on mobile. Because the evidence table carried participant IDs and timestamps for all twelve sessions, the agency's report could put the three products side by side without anyone re-watching recordings to check a disputed claim.

The challenge pass earned its place in the client meeting: when a stakeholder questioned whether the benchmark was unfair to product B, the agency could show that the tasks, criteria, and coding rules were identical across products, and that the per-product sample of four was explicitly labeled too small for percentage claims. The benchmark now reruns each release cycle with the saved /test-synthesis command.

Section 14

Good vs bad synthesis output

The single test is traceability: every finding survives the question of which participants, which sessions, which words. Agents under pressure to be helpful will otherwise produce findings that sound like the design rationale and quotes nobody said.

tableFindings quality comparison
1Bad

Users found the new navigation intuitive overall

2Good

5 of 6 participants (P01-P03, P05, P06) completed the archived-projects task without prompting; outcomes and timestamps in task-outcomes.md

3Bad

One participant said the settings were impossible to find

4Good

P04, 18:32: "I'd look under my little face icon, honestly" - while failing the notification-settings task

5Bad

Users want the promo code field moved to payment

6Good

7 of 20 participants stalled at the promo-code step; 5 said they expected it on the payment step; flagged as a hypothesis to confirm with funnel analytics

7Bad

The redesign tested well and is ready to ship

8Good

Notification-settings task failed for 4 of 6 participants; open question logged for the team's decision, with disconfirming evidence noted from P02

Traceable findings can be checked against sessions; vague or invented findings cannot.

Section 15

Limits: what this workflow cannot prove

Six sessions describe what happened to six people. Counts like 4 of 6 are honest descriptions of the sample, not estimates of prevalence, and nothing in this workflow turns them into statistics for a roadmap deck. Unmoderated panels add their own bias: participants who test products for incentives behave differently from your customers.

Agents inherit whatever the moderator did in the room. A leading probe produces a confident quote that the coding agent will faithfully attribute; the workflow can flag moderator prompts, but only a human who was there can judge how much they shaped the outcome. The same goes for ethics: consent, anonymization, and how findings are represented to stakeholders are decisions the researcher signs.

And the workflow cannot decide what to build. It can show that a task failed and where participants looked instead; whether that justifies changing the design, delaying the release, or running another study is the team's call, made with the evidence in front of them.

  • Cannot turn small samples into statistical claims about all users.
  • Cannot correct for sample bias, panel effects, or moderator influence on its own.
  • Cannot make consent and privacy decisions; those are settled before any agent runs.
  • Cannot decide design or roadmap changes; it produces evidence and open questions.

Section 16

The reusable study workflow

Save the two prompts to .claude/workflows/ as /test-prep and /test-synthesis, and the coder and challenge agents to .claude/agents/. The next study starts from the same templates and the same evidence standard, which is what makes findings comparable across studies and quarters.

Usability test prep and synthesis workflow
1. Agree the design decision at stake and run the prep workflow: research questions, test plan, tasks, screener, discussion guide.
2. Edit the package by hand and fix everything the prototype-readiness check flagged.
3. Recruit with the screener and run the sessions yourself; keep one note file per participant ID.
4. Anonymize notes and transcripts; confirm consent covers automated analysis.
5. Run the synthesis workflow: one coding agent per session against the research questions.
6. Read the challenge file first, then spot-check five random quotes against the sources.
7. Hold the team readout from findings.md and task-outcomes.md; record decisions and open questions.
8. Archive the study package, evidence table, and decisions for the next study or benchmark cycle.

Sources

Sources & further reading

Browse the full library on the workflows page or open the code samples in the public repository.

Newsletter

Get the test plan, screener, and codebook templates by email.

The newsletter is the update channel for article revisions, tool changes, and field-tested workflows.

Processed by Buttondown. You can unsubscribe from any email.

Further reading

For deeper reading, see The Agentic Designer and Claude Code for Designers.

The Agentic Designer cover
Curriculum
The Agentic Designer
How AI agents are transforming product design.

The operating model for product designers, design leads, and builders who need to understand what changes when agents join design work.

Claude Code for Designers cover
Curriculum
Claude Code for Designers
A designer's guide to AI-assisted workflows.

A practical guide for designers who want to work directly with coding agents without turning it into a programming manual.