Module 4 of 6

45–55 minutes

Surveys, Experiments, and Funnel Diagnosis

Quantitative work where agents help with design and analysis but cannot supply rigour: writing surveys that do not lead, reading experiment results without inventing significance, and diagnosing funnel drop-off with hypotheses tied to evidence.

Duration45–55 minutes

Slides13 slides with notes and narration

Learning objectives

Use agents to draft and critique survey instruments against known bias patterns.
Read experiment results with an agent without overstating what the data supports.
Run funnel and drop-off diagnosis that produces ranked, testable hypotheses.

Slide deck

Work through the module

Each slide is shown in its 16:9 frame, exactly as it appears in the video version. Open the notes under any slide for the longer explanation, and the narration if you prefer to read along.

Slide 1 of 1316:9

Surveys, Experiments, and Funnel Diagnosis

Agentic Design Research · Module 4 of 6

Agents speed up the work; they do not supply the rigour
Survey design and agent critique against known bias patterns
Experiment readouts that say only what the data supports
Funnel diagnosis that ends in ranked, testable hypotheses

The rule for the whole module: every number comes from a script the agent ran, and every claim carries its base, its segment, and its source.

Slide notes

This module moves the course from qualitative to quantitative evidence, and the framing has to be set in the first minute: agents make quantitative work faster, but they do not make it more rigorous. Speed and rigour are different properties. An agent will happily draft a leading survey question, estimate a percentage it never computed, or describe a correlational drop-off as a cause — quickly and fluently. The discipline in this module is about preventing exactly that.

The three workflows behind the module are the school's survey design and analysis workflow, the experiment design and results readout workflow, and the funnel and drop-off diagnosis workflow. They share one rule that recurs on almost every slide: numbers come from code the agent writes and runs against an export, never from the model's recollection or estimate. The model is good at writing the script and bad at being the calculator.

Set expectations on depth too. This is not a statistics course. Where the work needs significance testing, weighting, or contested methodological calls, the honest answer is to involve someone who knows the methods. The module teaches designers and researchers to use agents for the parts they can own — instrument critique, scripted analysis, evidence-linked hypotheses — and to recognise the boundary where statistical expertise takes over.

Narration for this slide

Welcome to Module 4. So far this course has been mostly about qualitative evidence — packets, teardowns, synthesis. This module is about the quantitative side: surveys, experiments, and funnels. And it opens with the line the whole module hangs on: agents speed up this work, but they do not supply the rigour. An agent will draft a biased question or quote a number it never computed just as fluently as it does good work. So we are going to build the habits that stop that — critique loops before launch, numbers from scripts only, and readouts that say exactly what the data supports and nothing more. Let's start with surveys.

Slide 2 of 1316:9

Where agents help, and where rigour stays human

The split is the same across surveys, experiments, and funnels: agents do coverage and computation; humans own the question, the launch, and the decision.

Agents draft instruments, critique them against bias patterns, and write analysis scripts
Agents run the scripts and report only what the code printed
Humans frame the question, decide ethics and consent, and own the launch
Humans read the readout, weigh the caveats, and make the decision
Statistical judgment calls — significance, weighting, contested methods — go to someone who knows the methods

If a number in the report was never computed by code, it is an estimate the model made — and it does not belong in the report.

Slide notes

This slide establishes the division of labour the rest of the module assumes. On the agent side: drafting survey questions mapped to constructs, critiquing drafts against named bias patterns, computing sample sizes with a script, writing and running analysis scripts against exports, coding open-text answers against a codebook, and attacking draft readouts as a skeptic. All of that is coverage and computation, and agents are genuinely good at it.

On the human side: the research question and the decision it informs, the ethics of what gets asked and tested, the launch decision, the interpretation of caveats, and the final call. None of that transfers. The middle ground — significance tests, weighting, margin-of-error claims, choosing between analytical framings — is not agent territory and not necessarily designer territory either; it belongs to someone who knows the methods, and the honest move is to say so in the report rather than improvise.

The highlight line is the operational rule that makes the rest enforceable. Models smooth numbers when allowed to estimate: 38 per cent becomes roughly 40 per cent, a cell of 51 becomes about 50. The fix is structural, not motivational — the workflow only reports what its scripts printed, and the human spot-checks a few numbers by re-running the script.

Narration for this slide

Here is the division of labour for everything in this module. Agents draft the survey, critique it, compute the sample size, write the analysis scripts, run them, and even attack the draft readout as a skeptic. Humans frame the question, own the ethics and the launch, read the caveats, and make the decision. And the genuinely statistical calls — significance, weighting, contested methods — go to someone who knows the methods, not to an improvised script. The one rule that makes all of this work: if a number was not computed by code, it is an estimate the model made, and it does not go in the report.

Slide 3 of 1316:9

Survey design: constructs first, questions second

Surveys fail before launch more often than after it. A leading question answered by 1,200 people cannot be repaired by analysis.

Start from the decision and the constructs it implies — not from a list of questions people want to ask
Each construct earns one or two questions; anything unmapped gets cut
Watch for leading wording, double-barreled items, and unbalanced scales
Give honest answers somewhere to go: a real neutral, a not-applicable option
Question order matters — early questions prime later ones

The construct map is the cheapest quality gate in the whole survey: it kills the questions the study never needed and exposes the ones it is missing.

Slide notes

Surveys fail in two places, and most teams only watch the second. The first failure happens before launch: a leading question that smuggles the desired answer into the wording, a double-barreled item that asks two things and allows one answer, a scale with no honest middle or no not-applicable escape, or an ordering effect where an early question primes everything after it. Once people have answered a flawed question, the data is flawed permanently — there is no analysis pass that recovers it.

The discipline that prevents most of this is mundane: before drafting a single question, write down the research questions and the constructs they imply. A question like why are trial users not converting decomposes into constructs — perceived value, price sensitivity, setup blockers, comparison with alternatives — and each construct earns one or two survey items. Anything that does not map to a construct is a question someone wanted to ask, not a question the study needs. Surveys that skip this step tend to grow to forty questions, and long surveys buy worse data twice: through drop-off and through the fatigue of whoever finishes.

The bias patterns named on the slide are standard survey methodology, not agent-specific. The next slide is where the agent earns its place: as a critic that checks the draft against these patterns systematically, before anyone answers it.

Narration for this slide

Surveys fail twice, and the expensive failure is the first one — before launch. A leading question, a double-barreled item, a scale with no honest middle: once twelve hundred people have answered it, no analysis can repair it. The discipline that prevents most of this is simple. Start from the decision and the constructs it implies, give each construct one or two questions, and cut anything that does not map. Then check the wording: does the question suggest its own answer, does it ask two things at once, does every honest answer have somewhere to go, and does an early question prime a later one. That checklist is exactly what we hand the agent next.

Slide 4 of 1316:9

The agent as survey critic

The critique loop runs before launch: draft, critique, revise, critique again — until only judgment calls remain.

A critic agent reads the draft like a hostile methodologist
It hunts leading wording, double-barreled items, missing options, scale problems, ordering effects
Every finding quotes the offending wording, names the bias, and proposes a rewrite
Defects loop back into revision; judgment calls go to the researcher
A persona read-aloud pass catches questions that make sense only to the team

The critique loop reduces wording bias; it does not certify the survey as unbiased. Pilot with five real people anyway.

Slide notes

The survey workflow's first stage is a critique loop. A critic subagent — defined once and reused across studies — reads the draft the way a hostile methodologist would: it checks every question for leading or loaded wording, double-barreled construction, missing or unbalanced answer options, scale problems such as unlabelled endpoints or stacked agree/disagree items, and ordering effects. It also checks every question against the construct map and flags anything unmapped. Each finding comes back with the offending wording quoted verbatim, the bias named, and a rewrite proposed. The loop runs until the critic returns only judgment calls, and those go to the researcher rather than back into the loop — the researcher owns the revision.

A second cheap pass before launch is the persona read-aloud: an agent answers the survey as a handful of named personas — a rushed respondent on a phone, a non-native English speaker, a power user, someone who hit the exact problem the survey is about — and narrates where each one hesitates, misreads, or cannot find an honest answer among the options. This is a simulation, not a substitute for piloting with real people, and the report should say so. But it reliably catches the question that makes sense to the team and to nobody else.

The caveat on the slide matters: the critique loop reduces wording bias, it does not certify the instrument. When the stakes justify it, a human methodologist reviewing the final draft is still the stronger gate, and a pilot with five real respondents is always worth the hour it costs.

Narration for this slide

Here is where the agent earns its place in survey design — as a critic, before launch. A critic agent reads the draft like a hostile methodologist: leading wording, double-barreled items, missing answer options, broken scales, ordering effects. Every finding quotes the actual wording, names the bias, and proposes a rewrite. You loop — draft, critique, revise, critique again — until only judgment calls remain, and those are yours to decide. Then a persona read-aloud pass answers the survey as a rushed mobile respondent or a non-native speaker and tells you where they get stuck. None of this certifies the survey as unbiased. It just removes the defects you would otherwise discover after twelve hundred people answered. Pilot with five real people anyway.

Slide 5 of 1316:9

Analysis: numbers come from scripts, not from the model

When the export lands, the temptation is to paste rows into chat. Resist it. The rows stay on disk; only computed tables come back.

Every number in the report is computed by a script the agent wrote and ran against the export
Per-segment agents run the same analysis script with different filters
Open-text answers are coded against a codebook: verbatim excerpts, response IDs, counts
A challenge pass screens every claim against sample size per cell and who is missing from the sample
The human spot-checks: re-run three numbers, re-read three excerpts against the raw export

The model is good at writing the analysis script and bad at being the calculator. Keep those two jobs separate and the numbers stay defensible.

Slide notes

The analysis stage of the survey workflow has one inviolable rule: every number in the report was computed by a script the workflow wrote and ran against the export. The model never estimates a number, because the difference between 38 per cent and roughly 40 per cent is exactly the kind of thing it will smooth over if allowed to. For most product surveys the scripts are not statistics packages — counting, percentages with their bases, and cross-tabs cover the bulk of the work, and a plain script keeps everything inspectable. When a stakeholder later asks where a number came from, the answer is a file and a command, not a conversation.

The structure is a fan-out: each segment defined in the brief — plan tier, role, tenure — gets its own analysis agent running the same script with a different filter, and a separate agent codes the open-text answers against a codebook with the same discipline as interview coding from Module 3: verbatim excerpts, response IDs, counts per code. Open text is where conclusions change; in the workflow's pricing case study, 64 per cent of closed-question answers looked like acceptance, but coding 814 free-text answers showed much of that acceptance was explicitly conditional on a discount or a missing feature.

The last gate before the report is a challenge pass: an agent that reads every draft claim and checks it against sample size per cell, response rate, and who never saw the invitation. Read the challenge file before the findings file, every time. Then do the human spot-check: re-run a few numbers with the same script and arguments, and check a few quoted excerpts against the raw export.

Narration for this slide

Now the responses arrive, and the temptation is to paste a thousand rows into chat and ask what they say. Don't. The rows stay on disk. The agent writes an analysis script, runs it against the export, and reports only what the script printed — distributions with their bases, cross-tabs per segment, and open-text answers coded against a codebook with verbatim excerpts and counts. Then a challenge pass reads every claim and asks the awkward questions: how big is that cell, what was the response rate, who never saw the invitation. You read the challenge file before the findings, and you spot-check three numbers and three quotes yourself. The model writes the script. The script does the arithmetic. That separation is what makes the numbers defensible.

Loop diagram of quantitative research with agents in six steps. A human frames the question and the decision it informs. The agent helps design the instrument or experiment and critiques it against bias patterns, with sample sizes computed by script. The team collects the data through real survey or experiment tooling and exports it as flat files. The agent runs the analysis with scripts and stated assumptions, including a skeptic pass. The readout passes through a human review gate where numbers are spot-checked and claims compared to the pre-registration, before a human decision. A dashed feedback line shows the decision raising the next question. — Framing the question, collecting the data, the readout review, and the decision are human-led. Instrument design support and the scripted analysis are agent-run. The dashed line closes the loop: the decision raises the next question.

Slide notes

Walk the loop and name the owner of each step. Framing the question is human: the decision at stake, the constructs or metrics, the segments, and the ethics of asking or testing at all. Design support is agent-run: drafting the instrument or the experiment plan, critiquing it against bias patterns, computing the sample size with a script, and assembling a pre-registration for a human to sign. Data collection is owned by the team and its real tooling — the agent does not launch surveys and does not touch production traffic; it designs and reads out. The analysis is agent-run but scripted: every number from code, bases and segments and intervals reported, assumptions stated, and a skeptic or challenge pass attacking the draft before any human reads it.

The readout review gate is the human step most teams skip when they are in a hurry, and it is where laundering gets caught: spot-check numbers against the export, compare what the readout claims against what the pre-registration said would be reported, and strip anything the sample cannot carry. Only then does the decision happen — and the decision is recorded separately from the evidence, so next year nobody has to reconstruct whether the data made the call or the team did.

The dashed line matters for how research functions in a team: the decision raises the next question, which is how this connects back to the research packet structure from Module 1 and forward to the insight-to-brief work in Module 6.

Narration for this slide

Here is the loop that carries everything in this module. You frame the question and the decision it informs. The agent helps design the instrument or the experiment — drafting, critiquing, computing the sample size with a script. Your team collects the data with real tooling; the agent never owns the launch. Then the agent runs the analysis: scripts only, assumptions stated, and a skeptic pass that attacks the draft before you see it. You hold the readout gate — spot-check the numbers, compare the claims to what was pre-registered, strip what the sample cannot carry. And then a human makes the decision, recorded separately from the evidence. The dashed line is the part people forget: the decision raises the next question, and the loop starts again.

Slide 7 of 1316:9

Experiment readouts: decided before launch, attacked before reading

The discipline lives in two documents: a pre-registration signed before launch, and a readout the skeptic agent has already attacked.

Hypothesis with a because clause: because we observed X, we believe Y will move metric Z for population W
One primary metric, explicit guardrails, sample size computed by a script
Pre-registration locks the segments, run length, and ship/kill rules before launch
After the run, the analysis agent computes results with code; a skeptic agent attacks the draft
The readout separates what the data shows from what the team decides

Most readout dishonesty is not fabrication — it is drift between what the team said it would do and what the readout quietly does instead.

Slide notes

Experiment programs rarely fail at running tests; they fail at being honest about them. The failure modes have names: peeking — checking significance daily and stopping on a good day; HARKing — hypothesising after the results are known; segment fishing — slicing until something clears the bar; and novelty effects mistaken for durable lift. None of these require bad intent, only a deadline and an unguarded readout.

The design stage forces the discipline to exist before launch. The hypothesis must carry a because clause that points at real evidence — a coded ticket theme, a usability finding, a drop-off the funnel work surfaced. There is one primary metric and a small set of guardrails with their harm directions stated, because guardrails are where redesigns quietly do damage and also where unexpected wins hide. The minimum detectable effect and the sample size come from a script with its inputs written down; if the script says nine weeks of traffic and the team has three, that is a design conversation now, not a surprise after an inconclusive readout. All of it goes into a one-page pre-registration that a human signs.

After the run reaches its pre-registered length, the analysis agent computes per-arm results, intervals, guardrails, and only the pre-registered segments — with code. Then a skeptic agent receives the draft, the pre-registration, and the script outputs, and is prompted to assume the conclusion is wrong: it checks the stopping rule, hypothesis drift, the number of segments examined versus pre-registered, novelty and seasonality in the trend, guardrail damage, and practical significance. Its objections are published in the readout with their dispositions, whether or not they change the conclusion. The readout's final section — what the team decides — is left for the humans.

Narration for this slide

Experiments go wrong in predictable ways: peeking at the dashboard and stopping on a good day, rewriting the hypothesis after the results are in, slicing segments until something looks significant. The protection is two documents. Before launch, a pre-registration: the hypothesis with its because clause, one primary metric, the guardrails, a sample size computed by a script, and the ship and kill rules — signed by a human. After the run, a readout where every number came from code and where a skeptic agent has already attacked the draft: did the run stop early, did the hypothesis drift, how many segments were fished, is the effect big enough to matter. The objections get published. And the readout keeps two sections separate: what the data shows, and what the team decides.

Slide 8 of 1316:9

The laundering temptation: weak data, confident claims

The arithmetic is rarely the problem. The words around the numbers are what permit the reader to believe more than the data supports.

Over-claimed	Honest
"Most customers are fine with the new pricing"	64% selected acceptable or better (771 of 1,201); 41% of open-text answers coded as conditional acceptance
"Pricing variant drives a 9% lift in trial starts"	Test stopped at 40% of pre-registered sample; wide interval, decaying daily trend; shipped as a decision under uncertainty with a holdback
"New-user navigation success up 14%"	No overall effect across 84,000 sessions; an underpowered positive signal in one segment, flagged for a dedicated follow-up test
"The data clearly supports shipping"	The data section reports the results; the decision section records that the team chose to ship, and why

Every claim must name its base, its segment, and its source. A claim that cannot do that is either an estimate or a generalisation the sample cannot carry.

Slide notes

Laundering is the right word for what happens when weak data goes in and confident claims come out, and it rarely involves anyone lying. It involves rounding a percentage upward in conversation, dropping the base from a quote, promoting a single segment's result into the headline, and letting roughly and most carry weight the numbers never earned. Agents make this worse if unguarded, because they generate confident summary prose by default — and better if guarded, because the challenge and skeptic passes are tireless about exactly these failures.

Walk a couple of rows. The pricing row is from the survey workflow's case study: 64 per cent acceptable or better looked like broad acceptance until the open-text coding showed much of that acceptance was explicitly conditional — and the challenge pass added that the sample skewed towards engaged customers, so the figure could not be read as a rate across the whole base. The navigation row is the classic segment over-claim: no overall effect, one pre-registered segment moved, and the smallest cell at that — fragile evidence under any reasonable multiple-comparisons view. The honest version, an underpowered signal worth a dedicated follow-up, led to a smaller but real confirmed effect two months later.

The test in the highlight is mechanical enough to apply in review: base, segment, source. Train yourself and the team to ask those three of every quoted number, and most laundering dies in the readout review gate rather than in front of leadership.

Narration for this slide

Here is the temptation this module exists to resist: laundering weak data into confident claims. Nobody lies. They just round up, drop the base, promote one segment into the headline, and let the word most do work the numbers never earned. Look at the difference. Most customers are fine with the new pricing — versus sixty-four per cent of twelve hundred respondents, with forty-one per cent of the open text coded as conditional acceptance. A nine per cent lift — versus a test stopped at forty per cent of its planned sample with a decaying trend. The honest versions are longer, and they are the ones you can defend a year later. The test is simple: every claim names its base, its segment, and its source — or it goes.

Slide 9 of 1316:9

Funnel diagnosis: the numbers locate the wound, the screens explain it

Funnels live in analytics tools; screens live in design tools. Diagnosis fails when those two never sit on the same desk.

Work from a flat export with agreed step definitions — not from a dashboard glance
An analytics agent computes step conversion overall and per segment: device, plan, source
The two or three worst steps get walkthrough agents driving the real product at desktop and phone widths
Walkthroughs capture screenshots, verbatim copy, error states, load behaviour — observations, not explanations
Broken instrumentation masquerades as user behaviour; let the workflow say when the tracking itself is suspect

A 60 per cent drop tells you where to look. Only looking at the screen — on a phone, with the error states triggered — tells you what to look at.

Slide notes

The structural problem funnel diagnosis solves is that the number and the interface that produced it are rarely on the same desk. The analytics say 60 per cent drop at email verification, so the team shortens the email copy — but nobody opened the verification screen on a phone, saw the link expire, or noticed the resend button fail silently. The number locates the wound; only the screen explains it.

The workflow runs in two connected stages. First, an analytics agent computes step conversion from a flat export — one row per user, one column per step, plus segment columns — overall and cut by device, plan, and source. The step definitions, time window, and whether steps can be skipped are decisions the researcher writes into the brief before any agent runs, because they change every downstream number. Second, the workflow dispatches walkthrough agents only to the two or three steps where the drop-off or the segment gap is largest. Those agents drive the actual product in a browser, at desktop and phone widths, in a test environment: capture the screen, read the verbatim copy, attempt the step, trigger the plausible error states, note what loads slowly, and check whether progress survives leaving and returning. Everything is recorded as an observation with evidence — a screenshot path, quoted copy — and explicitly not yet as an explanation.

Two caveats belong on this slide. Walkthrough agents work in test environments, so they miss what only happens with real payments, real email deliverability, and real accounts. And sometimes the most important hypothesis is about the instrumentation: a step event firing twice, or not firing on mobile. Broken tracking masquerades as user behaviour, and the workflow should be allowed to say so.

Narration for this slide

Funnel diagnosis fails for a structural reason: the funnel lives in an analytics tool and the screens live somewhere else, and the conversation between them is a meeting where someone guesses. The workflow puts both on the same desk. An analytics agent computes step conversion from a clean export — overall, and cut by device and segment. Then, for the two or three worst steps, walkthrough agents open the real product in a browser, at desktop and phone widths: they capture the screens, read the copy, trigger the error states, and note what loads slowly. Everything comes back as observations with evidence, not explanations. The number tells you where to look. The walkthrough tells you what you are actually looking at.

Slide 10 of 1316:9

From drop-off to ranked, testable hypotheses

The output of diagnosis is not a cause. It is a ranked list of hypotheses, each linked to evidence and tagged with the cheapest way to verify or kill it.

Every hypothesis cites at least one number and at least one walkthrough observation
Hypotheses are ranked by evidence strength, not by how fixable they sound
Each carries a verification method: session replays, a five-user usability test, or an experiment
Some hypotheses are about the tracking, not the users — keep those on the board
No hypothesis is stated as a confirmed cause; correlational funnel data cannot carry that weight

This is where the experiment workflow gets its because clause: a drop-off plus a screen-level observation is exactly the evidence an honest hypothesis needs.

Slide notes

The merge stage is where the discipline about causation is enforced. Funnel data is correlational. A drop-off plus a broken-looking error state is a strong hypothesis, not a demonstrated cause — and the output format makes that impossible to forget, because every hypothesis must cite at least one number and at least one walkthrough observation, and must name the method that would verify or falsify it. Hypotheses that restate the chart in words, or leap to a confirmed cause, or prescribe a redesign with no evidence link, get rejected in the merge.

The verification routes are worth spelling out because they are also a cost ladder. Session replays are the fastest check for behavioural hypotheses — do users actually hit the error state, scroll past the field, abandon at the load stall. A five-user usability test is the right tool for comprehension and expectation hypotheses, where the question is whether the copy means what the team thinks it means. An experiment is reserved for hypotheses that survived a cheaper check and justify a build — and that is the connection to the previous slides: this workflow produces the evidence-backed because clause the experiment pre-registration asks for. The fourth route is an instrumentation fix, for the cases where the walkthrough suggests the number itself is wrong.

The prioritisation at the end — which hypothesis is worth a five-user test versus an engineering quarter — is a business judgment about cost and risk that belongs to the team, with the evidence board in front of it rather than in place of it. One more habit worth keeping: walk the worst step yourself once, on your own phone. Ten minutes of first-hand experience is the cheapest calibration available for judging which hypotheses ring true.

Narration for this slide

The output of a funnel diagnosis is not an answer — it is a ranked list of hypotheses, and the format does the discipline for you. Every hypothesis has to cite a number and a screen-level observation, and has to name the cheapest way to test it: session replays for behaviour, a five-user usability test for comprehension, an experiment only when a hypothesis has survived a cheaper check and justifies a build. Some hypotheses will be about the tracking itself — keep those, because broken instrumentation looks exactly like user behaviour. And nothing on the board is stated as a confirmed cause, because funnel data cannot prove causation. This board is also where your next experiment gets its because clause.

Slide 11 of 1316:9

Worked example: one funnel diagnosed, two hypotheses tested

A trial signup funnel losing 60 per cent of users at email verification, traced from the segment cut to the screen to the verified fix.

Stage	What it produced
Analytics cut	60% drop at verification overall — 47% on desktop, 71% on mobile; a motivation story cannot explain a device gap
Walkthrough observations	Email took up to 4 minutes to arrive; link expired after 10; resend control below the fold at 390px; expired link landed on a dead-end error page
Hypothesis 1 → session replays	Mobile users hit the expired-link dead end; replays confirmed it within a week
Hypothesis 2 → experiment	Missing delay expectation in the interstitial copy; tested as a copy-and-resend change
Outcome	Longer expiry, resend above the fold, delay expectation set — verification completion up 19 points over the following month

The workflow did not prove causation — the team's verification work did. What it changed was pointing that verification at the right screen on the first try.

Slide notes

This traced run comes from the funnel diagnosis workflow's case study, and it is worth walking slowly because it shows every discipline from the module in one place. The team's standing theory was motivational: people did not want the product enough to open an email. The segment cut undermined that immediately — 47 per cent drop on desktop versus 71 per cent on mobile is not a pattern motivation explains — and that single cut redirected the whole investigation.

The walkthrough agents supplied the candidate explanations: a verification email that took up to four minutes to arrive in the test runs, an interstitial that said check your email with no mention of delay, a resend control below the fold on a phone, a link that expired after ten minutes — shorter than the delivery delay plus a normal distraction — and an expired link landing on a generic error page with no path back. Note what the merge stage did with this: it produced two ranked hypotheses, each tied to a number and an observation, each tagged with a verification method matched to its nature. The dead-end hypothesis was behavioural, so it went to session replays — the cheapest check — and was confirmed within a week. The expectation-setting hypothesis was about copy, so it was tested as a change.

The outcome — verification completion up 19 points over the following month — belongs to the team's fixes and their verification, not to the diagnosis itself, and the readout said so. That is the standard to hold: the workflow ranks and routes hypotheses; humans and their cheaper checks establish what is actually true.

Narration for this slide

Let's trace one run end to end. A trial signup funnel was losing sixty per cent of users at email verification, and the team's theory was that people just were not motivated. The segment cut killed that theory in one table: forty-seven per cent drop on desktop, seventy-one on mobile. The walkthroughs explained why: the email took up to four minutes to arrive, the screen never mentioned a delay, the resend button was below the fold on a phone, and the link expired into a dead-end error page. Two hypotheses came out, each routed to the cheapest check — replays confirmed the dead end within a week, and the copy change went to an experiment. The fixes lifted verification by nineteen points. The diagnosis did not prove anything. It pointed the proof at the right screen, first try.

Slide 12 of 1316:9

Exercise: critique an existing survey with an agent

Take a survey your team has already run — or is about to run — and put it through the critique loop. Do not collect any new data for this exercise.

Write the construct map first: the decision at stake and the constructs each question should serve
Ask the agent to critique every question: leading wording, double-barreled items, missing options, scale problems, ordering effects
Require each finding to quote the wording, name the bias, and propose a rewrite
Sort the findings yourself: defects to fix, judgment calls to decide, questions to cut because they map to no construct
If the survey already ran, note which findings would have changed how you read the results

Most people find at least one question they would no longer defend. That discovery costs an hour now and twelve hundred responses later.

Slide notes

The exercise is deliberately scoped to the cheapest, highest-leverage part of the module: instrument critique. It needs no export, no analysis scripts, and no statistics — just an existing survey and an hour. Steer participants towards a survey with real stakes if they have one: a pricing or satisfaction survey that informed an actual decision is far more instructive than a throwaway pulse check, because the findings land differently when the data has already been quoted to leadership.

The construct map comes first for a reason: without it, the critique becomes a wording polish, and the most common real defect — questions that serve no construct at all, and constructs with no question — stays invisible. Insist that each agent finding quotes the verbatim wording and names the bias; vague feedback like this could be clearer is not a finding and teaches nothing. The sorting step is where the human judgment sits, and it mirrors the workflow's own boundary: defects loop back into revision, judgment calls go to the researcher.

The last bullet is the uncomfortable one and the most valuable. If the survey already ran, some findings will imply that specific numbers in the report rest on flawed questions. The point is not to relitigate the old report; it is to feel, once, how expensive a pre-launch hour would have been compared to the alternative — and to make the critique loop a standard step before the next launch.

Narration for this slide

Your exercise for this module: take a survey your team has already run, or is about to run, and put it through the critique loop. Start with the construct map — what decision the survey serves and which constructs each question should cover. Then have the agent critique every question: leading wording, double-barreled items, missing options, broken scales, ordering effects. Make it quote the wording and name the bias every time. Then you do the sorting: real defects, judgment calls, and questions that map to no construct and should simply go. If the survey already ran, note which findings would have changed how you read the results. Most people find at least one question they would no longer defend — better to find it this way.

Slide 13 of 1316:9

Summary, and what comes next

Agents speed up quantitative work; the rigour comes from the loop around them, not from the model
Surveys: constructs first, an agent critique loop before launch, and a pilot with real people anyway
Analysis and readouts: every number from a script, every claim with its base, segment, and source
Experiments: pre-register before launch, let a skeptic attack the readout, keep evidence and decision separate
Funnels: numbers locate the drop-off, walkthroughs explain it, and the output is ranked hypotheses — never claimed causes

Module 5 turns the same product data into journey maps and service blueprints — built from what the product records, with the gaps marked honestly.

Slide notes

Recap by returning to the opening claim and showing it has now been made operational. Agents do not supply rigour, but the loop does: a critique pass before any respondent answers, sample sizes and analysis from scripts rather than estimates, a challenge or skeptic pass before any human reads a conclusion, a readout review gate where claims are compared to the pre-registration, and a decision recorded separately from the evidence. None of those steps is statistically sophisticated; they are habits, and they are the habits that stop weak data being laundered into confident claims.

It is worth restating the boundaries one last time, because they are what keep the module honest. Surveys describe the people who answered them, and selection bias survives every script. Stated preference is not behaviour. Experiments answer narrow causal questions about the populations and periods they ran in. Funnel data is correlational, and walkthroughs in test environments miss what only happens in production. Where significance, weighting, or contested methods decide something that matters, bring in someone who knows the methods.

The bridge to Module 5 is direct: the same event exports, session paths, and ticket data this module analysed in funnels become the raw material for journey maps and service blueprints — generated from what the product actually records rather than workshop recollection, with the gaps the data cannot see marked honestly, and kept current by re-running the generation rather than redrawing the poster.

Narration for this slide

Let's close the module. Agents make quantitative work faster; the rigour comes from the loop you put around them. For surveys: constructs first, a critique loop before launch, and a real pilot anyway. For analysis: every number from a script, every claim carrying its base, its segment, and its source. For experiments: pre-register before launch, let a skeptic attack the readout, and keep what the data shows separate from what the team decides. For funnels: the numbers locate the drop-off, the walkthroughs explain it, and the output is ranked hypotheses — never claimed causes. In Module 5 we take the same product data — events, sessions, tickets — and build journey maps and service blueprints from it, with the gaps marked honestly. See you there.

Module transcript

Module 4, narrated slide by slide

Slide 1 — Surveys, Experiments, and Funnel Diagnosis

Slide 2 — Where agents help, and where rigour stays human

Slide 3 — Survey design: constructs first, questions second

Slide 4 — The agent as survey critic

Slide 5 — Analysis: numbers come from scripts, not from the model

Slide 6 — The quantitative evidence loop

Slide 7 — Experiment readouts: decided before launch, attacked before reading

Slide 8 — The laundering temptation: weak data, confident claims

Slide 9 — Funnel diagnosis: the numbers locate the wound, the screens explain it

Slide 10 — From drop-off to ranked, testable hypotheses

Slide 11 — Worked example: one funnel diagnosed, two hypotheses tested

Slide 12 — Exercise: critique an existing survey with an agent

Slide 13 — Summary, and what comes next

Previous: Module 3 — Research Synthesis with Agents Next: Module 5 — Journey Maps and Service Blueprints from Product Data

Back to Agentic Design Research