Section 1
Why surveys fail twice
Surveys fail in two places, and most teams only watch the second. The first failure happens before launch: a leading question, a double-barreled item, an answer scale with no honest middle, or a question order that primes everything after it. Once 1,200 people have answered a flawed question, no amount of analysis can repair it.
The second failure happens after the export lands: the team eyeballs a chart, quotes a percentage from memory, skips the open-text answers because there are 800 of them, and writes a report that says what the loudest stakeholder expected. The numbers in that report were often never computed at all — they were estimated by whoever was summarizing.
This workflow attacks both failures. Stage one is a critique loop that hunts for bias in the draft before anyone answers it. Stage two is an analysis fan-out where agents write and run scripts against the export, code the free-text answers against a codebook, and pass every claim through a challenge agent before it reaches the report. The model never estimates a number; it only reports what its code computed.
Section 2
When to reach for this workflow
Use it when the survey will inform a real decision — pricing, prioritization, a roadmap bet — and the team will be quoted on the numbers later. It earns its setup cost from roughly 200 responses upward, and it earns it twice if the survey has open-text questions nobody plans to read.
Skip it for a three-question pulse check with 30 respondents. The critique stage is still worth running on any survey, because a leading question costs the same to fix whether 30 or 3,000 people will see it, but the analysis fan-out is overkill below a couple of hundred rows.
- Pricing, packaging, and willingness-to-pay surveys where wording bias is expensive.
- Feature prioritization surveys that need per-segment cuts (plan tier, role, tenure).
- NPS or satisfaction follow-ups with large open-text volumes.
- Any survey whose results will be quoted to leadership with specific percentages.
Section 3
The orchestration pattern: critique loop, then fan-out
Both stages run as Claude Code dynamic workflows: a JavaScript script that Claude writes and runs in the background, orchestrating subagents while intermediate results stay in script variables rather than in Claude's own context. That separation matters in stage two especially — a 1,200-row export with open-text columns would swamp any context window, so the rows live on disk and in script variables, and only computed summaries and coded excerpts come back to the conversation.
Workflows can run up to 16 agents concurrently and up to 1,000 in a run, and a run is resumable, which matters when open-text coding involves hundreds of small jobs. You trigger a workflow by including the word workflow in the prompt or via /effort ultracode. Once the prompt is stable, save it to .claude/workflows/ in the project (or ~/.claude/workflows/ for personal use) and it becomes a reusable command like /survey-critique or /survey-analyze. The subagent definitions — the critique agent, the coder, the challenge agent — live as markdown files in .claude/agents/.
Stage one is a loop: draft, critique, revise, critique again, until the critique agent has nothing left above a nitpick. Stage two is a fan-out: one agent per segment computes distributions and cross-tabs by writing and running a script, one agent codes the open text against a codebook, and a challenge agent checks every claim against sample size and selection bias before the report is assembled.
Map research questions to constructs
Draft questions
Critique pass
Revise wording and order
Pilot read-aloud
Launch gate
feeds next cycleThe design stage cycles between drafting and critique until the launch gate; nothing ships while the critique agent still finds bias.
Section 4
Stage one starts with constructs, not questions
Before drafting a single question, write down the research questions and the constructs they imply. A research question like why are trial users not converting decomposes into constructs: perceived value, price sensitivity, blockers during setup, comparison with alternatives. Each construct earns one or two questions; anything that does not map to a construct is a question someone wanted to ask, not a question the study needs.
This mapping is the cheapest quality gate in the whole workflow, and it is the one the critique agent checks first. A survey that skips it tends to grow to 40 questions, and long surveys buy worse data twice — through drop-off and through the fatigue of whoever finishes.
Section 5
The critique agent and what it hunts for
The critique agent reads the draft the way a hostile methodologist would. It looks for leading questions that smuggle the desired answer into the wording, double-barreled items that ask two things and allow one answer, missing or unbalanced answer options, scales without a genuine neutral or a not-applicable escape, and ordering effects where an early question primes a later one. It also checks every question against the construct map and flags anything unmapped.
Each finding comes back with the offending wording quoted, the bias named, and a rewrite proposed. The loop runs until the critique pass returns only judgment calls — and those go to the researcher, not back into the loop.
--- name: survey-critic description: Reviews survey drafts for leading questions, double-barreled items, missing answer options, scale problems, and ordering bias. Use during survey design before launch. tools: Read, Grep --- You are a survey methodologist reviewing a draft before launch. For every question, check: 1. Leading or loaded wording — does the question suggest its own answer? 2. Double-barreled items — does it ask two things with one answer? 3. Answer options — exhaustive, mutually exclusive, balanced, with a neutral and a not-applicable option where honest answers need them? 4. Scale design — labeled endpoints, consistent direction, no agree/disagree stacking where construct-specific wording would be clearer? 5. Ordering effects — does an earlier question prime this one? Do sensitive or demographic questions appear before trust is earned? 6. Construct mapping — which construct from the brief does this question serve? Flag any question that serves none. Report each finding as: question number, verbatim wording, the bias by name, why it matters, and a proposed rewrite. Separate defects from judgment calls. Do not rewrite the survey yourself; the researcher owns the revision.
Section 6
The pilot read-aloud pass
Before launch, a final agent does a read-aloud pass: it answers the survey as three or four named personas — a rushed respondent on a phone, a non-native English speaker, a power user, someone who hit the exact problem the survey is about — and narrates where each one hesitates, misreads, or cannot find their honest answer among the options.
This is a cheap simulation, not a substitute for piloting with five real people, and the report says so. But it reliably catches the question that makes sense to the team and to nobody else, and it catches answer lists that have no option for the most common real situation.
Section 7
The stage-one workflow prompt
Paste the prompt below into Claude Code with the draft and the brief in the project folder. The word workflow in the first line is what makes Claude write an orchestration script and run the critique loop in the background instead of doing one shallow pass inline.
Run this as a workflow. Input: ./survey/brief.md (research questions and constructs) and ./survey/draft-v1.md (the question draft). Stage 1 - Construct check: verify every question maps to a construct in the brief and every construct has at least one question. List gaps and orphans. Stage 2 - Critique loop: launch the survey-critic agent against the current draft. For each finding it reports, propose a revised wording in ./survey/draft-v2.md, then run the critic again on the revision. Repeat up to 3 rounds or until the critic reports only judgment calls. Stage 3 - Pilot read-aloud: launch an agent that answers the final draft as 4 personas (rushed mobile respondent, non-native English speaker, power user, recently churned user) and reports where each hesitates, misreads, or has no honest answer option. Output: ./survey/critique-report.md (all findings by round, with verbatim wording and named biases), ./survey/draft-final.md, and ./survey/open-questions.md listing the judgment calls a human must decide. Do not launch or send the survey; the researcher owns the launch decision.
Section 8
Stage two: numbers come from scripts, not from the model
When the responses arrive, export them as CSV and resist the temptation to paste rows into chat. The analysis stage has one inviolable rule: every number in the report was computed by a script the workflow wrote and ran against the export. The model is good at writing the script and terrible at being the calculator, and the difference between 38 percent and roughly 40 percent is exactly the kind of thing it will smooth over if allowed to estimate.
The fan-out gives each segment its own analysis agent — by plan tier, by role, by tenure, whatever the brief defines — and each one runs the same analysis script with a different filter. A separate open-text agent codes the free-text answers against a codebook, the same discipline as interview coding: verbatim excerpts, response IDs, counts. Finally a challenge agent reads every claim destined for the report and checks it against sample size per cell, response rate, and who is missing from the sample.
Design decision
Clean and validate export
Design decision
Per-segment analysis scripts
Design decision
Cross-tabs
Design decision
Open-text coding
Design decision
Challenge pass
Design decision
Report assembly
After export, segments are analyzed in parallel by scripts, open text is coded against a codebook, and a challenge pass screens every claim before the report.
Section 9
What the orchestration script roughly does
Claude writes the orchestration script itself when you trigger the workflow. The sketch below shows the shape with an agent() pseudo-API so you can see where the rules sit — which agents run in parallel, what stays in variables, and where the challenge pass intervenes. It is illustrative, not the literal generated code.
const { execSync } = require("node:child_process")
const fs = require("node:fs")
const segments = ["free", "pro", "enterprise"]
// Stage 1: per-segment analysis. Each agent writes a script, runs it, and
// returns only the computed tables — the raw rows never enter context.
const segmentResults = await Promise.all(
segments.map((segment) =>
agent(
"Write and run a Node script against ./data/responses.csv filtered to " +
"plan_tier=" + segment + ". Compute n, completion rate, answer " +
"distributions per closed question, and cross-tabs against satisfaction. " +
"Return the script path and the computed tables. Never report a number " +
"your script did not print.",
{ model: "sonnet" }
)
)
)
// Stage 2: open-text coding against the codebook, batched by question.
const openText = await agent(
"Code the free-text answers in ./data/responses.csv (columns q9_text, q12_text) " +
"against ./analysis/codebook.md. Return { response_id, code, verbatim_excerpt } " +
"plus counts per code. Excerpts must be copied exactly.",
{ model: "sonnet" }
)
// Stage 3: challenge agent screens every claim before the report.
const challenges = await agent(
"Review these draft findings against sample sizes and selection bias. Flag any " +
"claim resting on a cell under n=50, any percentage quoted without its base, " +
"and any generalization beyond who actually answered.\n\n" +
JSON.stringify({ segmentResults, openText }),
{ model: "opus" }
)
fs.writeFileSync("./output/findings.md", JSON.stringify(segmentResults, null, 2))
fs.writeFileSync("./output/challenges.md", challenges)Section 10
The analysis script the agents run
The per-segment agents do not need a statistics package for most product surveys; counting, percentages with their bases, and cross-tabs cover the bulk of the work, and a plain Node script keeps everything inspectable. The repo version reads the CSV, filters by segment, and prints distributions and cross-tabs as markdown tables, so the report can paste them without retyping a single number.
Keep the script in the repo and let the workflow extend it rather than rewrite it. When a stakeholder later asks where a number came from, the answer is a file and a command, not a conversation.
import { readFile } from "node:fs/promises"
const [, , csvPath, segmentColumn, segmentValue] = process.argv
const rows = parseCsv(await readFile(csvPath, "utf8"))
const subset = segmentValue ? rows.filter((r) => r[segmentColumn] === segmentValue) : rows
console.log("# Segment: " + (segmentValue ?? "all") + " (n=" + subset.length + ")")
for (const question of closedQuestions(rows)) {
const counts = tally(subset, question)
console.log("\n## " + question + " (base n=" + counts.total + ")")
for (const [option, count] of counts.entries) {
const pct = ((count / counts.total) * 100).toFixed(1)
console.log("- " + option + ": " + count + " (" + pct + "%)")
}
}
// Cross-tab any closed question against another, e.g. priority x plan_tier
// node analyse-survey.mjs responses.csv plan_tier pro --crosstab q4 q1Section 11
Step by step through one survey project
Write the brief: research questions, constructs, segments you will cut by, and the decision the survey informs. Draft the questions, run the stage-one workflow, and revise until the critique loop returns only judgment calls. Pilot with five real people anyway — the read-aloud personas are a rehearsal, not a substitute. Launch.
When responses close, export the CSV, drop it into the project, and run the stage-two workflow. Read the challenge file before the findings file, every time. Spot-check three numbers by running the analysis script yourself with the same arguments, and spot-check three open-text excerpts against the raw export. Then write the report — the human writes the recommendation; the workflow supplies the evidence.
Across a typical project this is two to three hours of your attention: roughly an hour around the design loop, an hour around the analysis run and spot-checks, and the rest writing the report you would have had to write anyway, now with tables you can defend.
Section 12
Case study: a pricing sentiment survey with 1,200 responses
A SaaS team surveyed 1,200 customers about a planned price change. The closed questions looked reassuring: 64 percent selected acceptable or better for the new price point, and the team was ready to write that up as broad acceptance.
The open-text coding changed the conclusion. Of 814 free-text answers, 41 percent carried the code CONDITIONAL-ACCEPT — acceptance explicitly tied to a missing condition, most often an annual-plan discount or a feature that was still in beta — and another 17 percent were coded GRANDFATHERING, existing customers assuming the change would not apply to them. The challenge agent then flagged that the sample skewed heavily toward engaged customers who open product emails, so the 64 percent could not be read as a rate across the whole base.
The report that went to leadership said something more honest and more useful: acceptance is conditional, the conditions are nameable, and the silent majority of the customer base is not represented in this sample. The price change shipped two months later with an annual discount and a grandfathering window, and the support volume the team had braced for did not materialize.
Section 13
Case study: an NPS follow-up survey with three leading questions
A team drafted a follow-up survey for NPS detractors. The stage-one critique loop flagged three leading questions in the first pass, including the opener: what frustrated you most about the recent changes to the dashboard? — which presumes frustration, presumes the dashboard, and presumes the recent changes are the cause. Two other items stacked agree/disagree statements that all leaned the same direction.
The rewrite asked what prompted the score, with the dashboard as one option among several and an open-text field first. In the live results, only 31 percent of detractors mentioned the dashboard at all; the largest coded theme, at 38 percent, was billing surprises that the original draft never asked about. The critique loop did not just clean up wording — it prevented the team from confirming a theory the data did not support.
Section 14
Case study: feature prioritization segmented by plan tier
A product team ran a prioritization survey across 2,300 responses and cut the analysis by plan tier: free, pro, and enterprise. The aggregate ranking put an integrations marketplace first, and that is what would have shipped if the aggregate were the only table.
The per-segment fan-out told a different story. Enterprise respondents (n=212) ranked audit logs and permissions first by a wide margin, and they represented most of the revenue at stake; the marketplace ranking was driven almost entirely by free-tier respondents (n=1,460), who also had the lowest conversion intent in the cross-tabs. The challenge agent added the necessary caveat that the enterprise cell, while large enough to report, was dominated by admins rather than end users, because the survey went out through admin contacts.
The roadmap decision — permissions first, marketplace later — was a human call about revenue and strategy. What the workflow contributed was a set of per-segment tables nobody had to argue about, and a recorded caveat about who in the enterprise accounts had actually been heard.
Section 15
Good vs bad survey findings
The test for every claim in the report is whether it names its base, its segment, and the script or codebook entry it came from. A finding that cannot do that is either an estimate the model made — which this workflow forbids — or a generalization the sample cannot carry.
Most customers are fine with the new pricing
64.2% of respondents (771 of 1,201) selected acceptable or better; 41% of open-text answers coded CONDITIONAL-ACCEPT — see analyse-survey output and codebook counts
Users want an integrations marketplace
Marketplace ranked first among free-tier respondents (n=1,460) and fourth among enterprise (n=212); cross-tab table 3
Detractors are unhappy about the dashboard redesign
31% of detractor open-text answers mention the dashboard; the largest theme is billing surprises at 38% (codebook BILL-SURPRISE, 117 of 308 coded answers)
Roughly 70–80% would recommend the product
No recommendation question was asked in this survey; the claim has no source and was removed by the challenge pass
Good findings carry their base, their segment, and their source; bad findings round, generalize, and float free of the data.
Section 16
Limits: what this workflow cannot prove
A survey describes the people who answered it. The challenge agent will keep saying so, and it is right: response rates, panel composition, and who never saw the invitation all limit what the percentages mean, and no amount of analysis converts a convenience sample into the customer base. Statistical inference beyond plain description — significance tests, weighting, margin-of-error claims — is a judgment call for someone who knows the methods, not something to delegate to a script the workflow improvised.
Stated preference is not behavior. People over-report willingness to pay and under-report the workarounds they tolerate; the pricing case study above ended well partly because the team treated the survey as one input next to actual upgrade behavior. And the ethics of asking — what you collect, how long you keep it, whether respondents were told how their open text would be analyzed — are decisions the researcher owns before any agent touches the data.
Finally, the critique loop reduces wording bias; it does not certify the survey as unbiased. A human methodologist reviewing the final draft is still the stronger gate when the stakes justify it.
- Cannot generalize beyond who answered; selection bias survives every script.
- Cannot turn stated preference into predicted behavior.
- Cannot replace a statistician where significance, weighting, or margin of error matter.
- Cannot make the launch or roadmap decision; it can only put defensible tables in front of it.
Section 17
The reusable survey workflow
Save the two prompts to .claude/workflows/ as /survey-critique and /survey-analyze, keep the critic and challenge agent definitions in .claude/agents/, and keep the analysis script and codebook template in the repo. The next survey starts from a known-good setup instead of a blank document, and the evidence standard stays the same from study to study.
1. Write the brief: research questions, constructs, segments, and the decision at stake. 2. Draft questions mapped to constructs; cut anything unmapped. 3. Run the critique loop until only judgment calls remain; decide those yourself. 4. Run the persona read-aloud pass, then pilot with five real people and launch. 5. Export responses to CSV; never paste raw rows into chat. 6. Run the analysis fan-out: per-segment scripts, cross-tabs, open-text coding against the codebook. 7. Read the challenge file first; spot-check three numbers and three excerpts against the export. 8. Write the report yourself, with every claim carrying its base, segment, and source.
Sources

