Section 1
Why studies fail before and after the sessions
Usability studies rarely fail in the room. They fail beforehand, when the tasks do not actually exercise the design decisions the team is unsure about, and afterwards, when the sessions sit unsynthesized for two weeks and the findings shrink to whatever the observers happen to remember. The moderation in between is usually the strongest part.
Both failure points are preparation and digestion problems, and both are made of patient, structured work: turning a vague goal into research questions, turning research questions into tasks a participant can attempt without being led, screening the right people in, and then coding hours of notes and transcripts against the questions the study was supposed to answer. Agents are good at exactly this kind of work when the rules are explicit.
This workflow runs in two stages around the sessions. Stage one prepares the study and is reviewed by the researcher before a single participant is recruited. Stage two synthesizes the sessions: one agent codes each session against the research questions, a synthesis agent merges, and a challenge agent hunts for over-claims and disconfirming evidence. The sessions themselves stay human work — moderation, observation, and judgment do not delegate.
Section 2
When to reach for this workflow
Use it for any study with a defined design question and more than a couple of sessions: moderated studies of five to eight participants, unmoderated tests with dozens of recordings, and benchmark studies that repeat the same tasks across products or releases. The bigger the gap between session count and synthesis time, the more this workflow pays back.
Skip the synthesis stage if you ran two sessions and watched both; you do not need a pipeline to remember two conversations. And do not use the prep stage to avoid talking to your team — the research questions still come out of a conversation about what the team would do differently depending on the answer.
- Moderated studies of 5–8 participants on a new design or navigation model.
- Unmoderated prototype tests with 15–30 participants and auto-generated transcripts.
- Benchmark studies repeating the same tasks across products, releases, or competitors.
- Studies where the prototype's readiness is uncertain and needs checking against the tasks before sessions.
Section 3
Anonymize before any agent sees anything
Nothing from a session reaches an agent until it is anonymized: participant names become IDs, employers and locations are stripped unless they are the subject of the study, and the mapping between IDs and identities lives in a file outside the agent's working folder. The same applies to recordings — agents work from transcripts and notes, not video.
Confirm that your consent language covers automated analysis of transcripts and that your tooling agreement keeps the data out of model training. These are decisions the researcher owns before the study runs, not afterthoughts during synthesis. If consent does not cover it, the synthesis stage of this workflow is not available to you, and the prep stage still is.
Section 4
The orchestration pattern: a prep loop, then a synthesis fan-out
Both stages run as Claude Code dynamic workflows: JavaScript scripts that Claude writes and runs in the background, orchestrating subagents while intermediate results — draft tasks, per-session codings, the merged theme set — stay in script variables instead of Claude's own context. For synthesis that separation is what keeps session eight coded as carefully as session one.
Stage one is a small loop rather than a fan-out: a planning agent drafts the research questions, test plan, task scenarios, screener, and discussion guide; a critique agent checks the drafts against each other (does every task trace to a research question, does any task lead the participant, does the screener actually exclude the wrong people); the planning agent revises; and a prototype-check agent walks the prototype against the final tasks to confirm every step a task requires actually exists. Stage two fans out one coding agent per session, then runs a synthesis agent and a challenge agent in sequence.
Workflows can run up to 16 agents concurrently and up to 1,000 per run, and runs are resumable, which matters when an unmoderated study produces thirty transcripts. You trigger a workflow by including the word workflow in the prompt or via /effort ultracode, and the stable prompts can be saved to .claude/workflows/ in the project (or ~/.claude/workflows/ for personal use) as commands like /test-prep and /test-synthesis. The session-coder and challenge agents are defined as markdown files in .claude/agents/ so their evidence rules are versioned alongside the codebook.
Design decision
Define research questions
Design decision
Prep loop: plan, critique, revise
Design decision
Prototype readiness check
Design decision
Run sessions (human)
Design decision
Code per session
Design decision
Merge and challenge
Design decision
Team review
The prep loop runs before recruitment; the synthesis fan-out runs after the last session; humans gate both.
Section 5
Stage one: the prep prompt
The prep prompt starts from the design decision the team is unsure about, because that is what research questions are for. The output is a draft package the researcher edits — the agent's job is to make the first version complete and internally consistent, not to own the study design.
Run this as a workflow. Context: We are testing a new sidebar navigation model for our project management product. The design decision in question: whether users can find archived projects, cross-project search, and notification settings under the new structure. Prototype: Figma file linked in ./study/prototype.md. Sessions: 6 moderated remote sessions, 45 minutes each. Stage 1 - Draft: A planning agent drafts: research-questions.md (3-5 questions, each tied to a decision the team will make differently depending on the answer), test-plan.md (method, participants, schedule, roles, recording and consent notes), tasks.md (one task per research question, written as a goal in the participant's words with no interface vocabulary, plus success criteria), screener.md (questions with accept/reject logic), and discussion-guide.md (intro script, warm-up, task order, follow-up probes, wrap-up). Stage 2 - Critique: A critique agent checks the package: every task maps to a research question, no task names a UI element or leads the participant, the screener excludes people who could not plausibly use the product, the guide's probes are open questions, and the timing fits 45 minutes. The planning agent revises until the critique passes. Stage 3 - Prototype check: A prototype-readiness agent walks each task against the prototype description and flags any screen, state, or interaction a task requires that the prototype does not support, so we fix the prototype or the task before recruitment. Output the five study files plus prototype-gaps.md. Do not invent participant quotas or accessibility requirements; flag them as decisions for the researcher.
Section 6
Review the prep package like you wrote it
The researcher reads every file and edits with intent. The most common fixes are the same ones you would make to a junior researcher's draft: research questions that are really feature requests, tasks that quietly tell the participant where to click, and screener questions a motivated respondent can game. The critique agent catches many of these; it does not catch the ones that depend on knowing your users.
The prototype-gaps file is the highest-leverage output of stage one. A task that dead-ends in an unbuilt screen costs you a participant slot and forty-five minutes of everyone's time; finding it three days before the sessions costs nothing.
# Discussion guide: sidebar navigation study, June 2026 ## Intro (5 min) - Thanks, recording consent confirmed on the form; remind them they can stop any time. - "We're testing the design, not you. There are no wrong answers." - "Please think aloud as you go - tell me what you expect before you click." ## Warm-up (5 min) - How they organize project work today; how many projects they touch in a week. ## Tasks (30 min, order rotated across participants) - T1 (RQ1): "A project you finished last quarter has a file you need. Find it." Probe if stuck >2 min: "Where would you expect finished projects to live?" - T2 (RQ2): "You remember a comment about budget approval but not which project. Find it." - T3 (RQ3): "You're getting too many emails from this tool. Change that." ## Wrap-up (5 min) - "If you could change one thing about what you used today, what would it be?" - Anything we should have asked about?
Section 7
Stage two: the synthesis prompt
After the last session, point the synthesis workflow at the anonymized notes and transcripts and the research questions. Two rules protect the output and are stated explicitly: quotes must be verbatim and attributed to a participant ID, and the agents must never invent or paraphrase a finding into existence. A finding that cannot point at a session does not exist.
Run this as a workflow.
Input: ./sessions contains anonymized notes and transcripts for P01-P06. ./study/research-questions.md lists RQ1-RQ4. ./study/tasks.md defines the tasks and success criteria.
Stage 1 - Coding: For each session, launch one agent that reads only that session's files. It returns: per-task outcome (success, partial, fail, not attempted) with the moment that decided it, observations coded against RQ1-RQ4, and verbatim quotes with participant ID and timestamp or line reference. Quotes must appear word for word in the source. If a quote cannot be found verbatim, do not report it.
Stage 2 - Synthesis: Merge the coded sessions into findings per research question. Each finding states: a one-sentence claim, which participants support it (by ID), task outcomes that relate to it, 2-3 verbatim quotes, and an explicit count (e.g. 4 of 6). Also produce a task-outcome table across all participants.
Stage 3 - Challenge: A separate agent reviews the findings against all coded sessions and flags: claims supported by fewer than 3 of 6 participants, quotes that do not appear verbatim in the sources, language that generalizes beyond this sample ("users want"), and disconfirming evidence the synthesis did not mention.
Output: findings.md, task-outcomes.md, evidence-table.md (every quote with participant and location), and challenges.md. Do not propose design changes; list open questions instead.Section 8
What the orchestration script roughly does
Claude writes the orchestration script when you run the prompt. The sketch below uses an agent() pseudo-API to show where the per-session fan-out happens and where the challenge pass sits. It is illustrative, not the literal generated code.
const fs = require("node:fs")
const sessions = fs.readdirSync("./sessions").filter((f) => f.endsWith(".md"))
const questions = fs.readFileSync("./study/research-questions.md", "utf8")
const tasks = fs.readFileSync("./study/tasks.md", "utf8")
// Stage 1: one coding agent per session.
const coded = await Promise.all(
sessions.map((file) =>
agent(
"Code one usability session against the research questions and tasks.\n" +
questions + "\n" + tasks + "\nSession file: ./sessions/" + file + "\n" +
"Return JSON: { participant, task_outcomes, observations: [{ rq, note, " +
"verbatim_quote, location }] }. Quotes must be verbatim or omitted.",
{ model: "sonnet" }
)
)
)
// Stage 2: synthesis merges per-RQ findings with coverage counts.
const findings = await agent(
"Merge these coded sessions into findings per research question, with " +
"participant IDs, verbatim quotes, and counts out of " + sessions.length + ".\n" +
JSON.stringify(coded),
{ model: "opus" }
)
// Stage 3: challenge pass for over-claims and ignored disconfirming evidence.
const challenges = await agent(
"Act as a skeptical second researcher. Flag thin evidence, non-verbatim " +
"quotes, over-generalization, and disconfirming evidence the findings " +
"ignore.\n\nFindings:\n" + findings + "\n\nCoded sessions:\n" + JSON.stringify(coded),
{ model: "opus" }
)
fs.writeFileSync("./output/findings.md", findings)
fs.writeFileSync("./output/challenges.md", challenges)Section 9
Define the session coder as a subagent
The session coder's rules go in a subagent definition under .claude/agents/ so every study uses the same evidence standard without re-typing it. The challenge agent gets its own definition for the same reason.
--- name: session-coder description: Codes one usability session's notes and transcript against the study's research questions and task success criteria. tools: Read, Glob model: sonnet --- You code exactly one session. You never read other sessions. Rules: - Score each task outcome (success, partial, fail, not attempted) and name the moment in the session that decided it. - Code observations against the research question IDs you were given; do not invent new questions. - Quotes are copied verbatim with participant ID and a timestamp or line reference. If you cannot find the exact words, report the observation without a quote. - Distinguish what the participant did from what the participant said they would do. - Note moderator prompts that may have helped the participant, so the synthesis can weigh the outcome. - Never infer emotion or intent that the participant did not express.
Section 10
Step by step through one study
Run the prep workflow a week before recruitment, edit the package, and fix whatever the prototype check flagged. Run the sessions as you normally would; the only addition is consistent note files per participant, named by ID, dropped into the sessions folder along with the anonymized transcripts.
Run the synthesis workflow the same day the last session ends, while the sessions are fresh enough for you to catch a coding that misreads what happened. Read the challenge file before the findings file, then spot-check: pick five quotes at random from the evidence table and confirm each exists verbatim in the named session. One fabricated quote means the coding stage reruns with a stricter prompt and the whole evidence table is rechecked.
Across the study the agent-facing work totals two to three hours: roughly an hour around prep and editing, and one to two hours for synthesis, review, and the team readout. The sessions themselves are on top of that, as they always were.
Code sessions
Merge findings
Challenge pass
Spot-check quotes
Team readout
Decide and document
feeds next cycleThe challenge file and the quote spot-check sit between the coded sessions and anything the team acts on.
Section 12
Case study: an unmoderated checkout test with 20 participants
A retail team ran an unmoderated test of a redesigned checkout prototype with 20 participants through a panel tool, which produced 20 short recordings with auto-generated transcripts and task timings. Nobody on the team was going to watch seven hours of recordings carefully; that is exactly the situation the synthesis fan-out exists for.
One coding agent per transcript scored the three tasks and coded against the two research questions; the run took under an hour. The merged findings showed the promo-code step was where completions died: 7 of 20 participants stalled there, and 5 of those 7 said some version of expecting the code field to be on the payment step rather than the cart. The task-outcome table also surfaced that the 4 participants on older Android devices accounted for most of the slowest completion times, which the transcripts alone would not have shown.
The challenge pass flagged the right caveat: panel participants are not the team's actual customers, and 20 unmoderated sessions cannot establish a drop-off rate. The team treated the promo-code finding as a strong hypothesis, confirmed it against funnel analytics within the week, and moved the field. The fix was live before the original synthesis would have been scheduled.
Section 13
Case study: an agency benchmark across three client products
An agency running a usability benchmark for a client portfolio tested the same five tasks across three of the client's products, four participants per product, twelve sessions in all. The prep loop's main value was symmetry: the same research questions, tasks, and success criteria applied to all three products, drafted once and adjusted only where a product genuinely lacked a feature.
Synthesis ran per product and then a final merge compared them. Product B's account-creation task failed for 3 of 4 participants against 0 of 4 and 1 of 4 for the others, traced in every failing session to an email-verification loop that dead-ended on mobile. Because the evidence table carried participant IDs and timestamps for all twelve sessions, the agency's report could put the three products side by side without anyone re-watching recordings to check a disputed claim.
The challenge pass earned its place in the client meeting: when a stakeholder questioned whether the benchmark was unfair to product B, the agency could show that the tasks, criteria, and coding rules were identical across products, and that the per-product sample of four was explicitly labeled too small for percentage claims. The benchmark now reruns each release cycle with the saved /test-synthesis command.
Section 14
Good vs bad synthesis output
The single test is traceability: every finding survives the question of which participants, which sessions, which words. Agents under pressure to be helpful will otherwise produce findings that sound like the design rationale and quotes nobody said.
Users found the new navigation intuitive overall
5 of 6 participants (P01-P03, P05, P06) completed the archived-projects task without prompting; outcomes and timestamps in task-outcomes.md
One participant said the settings were impossible to find
P04, 18:32: "I'd look under my little face icon, honestly" - while failing the notification-settings task
Users want the promo code field moved to payment
7 of 20 participants stalled at the promo-code step; 5 said they expected it on the payment step; flagged as a hypothesis to confirm with funnel analytics
The redesign tested well and is ready to ship
Notification-settings task failed for 4 of 6 participants; open question logged for the team's decision, with disconfirming evidence noted from P02
Traceable findings can be checked against sessions; vague or invented findings cannot.
Section 15
Limits: what this workflow cannot prove
Six sessions describe what happened to six people. Counts like 4 of 6 are honest descriptions of the sample, not estimates of prevalence, and nothing in this workflow turns them into statistics for a roadmap deck. Unmoderated panels add their own bias: participants who test products for incentives behave differently from your customers.
Agents inherit whatever the moderator did in the room. A leading probe produces a confident quote that the coding agent will faithfully attribute; the workflow can flag moderator prompts, but only a human who was there can judge how much they shaped the outcome. The same goes for ethics: consent, anonymization, and how findings are represented to stakeholders are decisions the researcher signs.
And the workflow cannot decide what to build. It can show that a task failed and where participants looked instead; whether that justifies changing the design, delaying the release, or running another study is the team's call, made with the evidence in front of them.
- Cannot turn small samples into statistical claims about all users.
- Cannot correct for sample bias, panel effects, or moderator influence on its own.
- Cannot make consent and privacy decisions; those are settled before any agent runs.
- Cannot decide design or roadmap changes; it produces evidence and open questions.
Section 16
The reusable study workflow
Save the two prompts to .claude/workflows/ as /test-prep and /test-synthesis, and the coder and challenge agents to .claude/agents/. The next study starts from the same templates and the same evidence standard, which is what makes findings comparable across studies and quarters.
1. Agree the design decision at stake and run the prep workflow: research questions, test plan, tasks, screener, discussion guide. 2. Edit the package by hand and fix everything the prototype-readiness check flagged. 3. Recruit with the screener and run the sessions yourself; keep one note file per participant ID. 4. Anonymize notes and transcripts; confirm consent covers automated analysis. 5. Run the synthesis workflow: one coding agent per session against the research questions. 6. Read the challenge file first, then spot-check five random quotes against the sources. 7. Hold the team readout from findings.md and task-outcomes.md; record decisions and open questions. 8. Archive the study package, evidence table, and decisions for the next study or benchmark cycle.
Sources

