Slide 1 — Surveys, Experiments, and Funnel Diagnosis
Welcome to Module 4. So far this course has been mostly about qualitative evidence — packets, teardowns, synthesis. This module is about the quantitative side: surveys, experiments, and funnels. And it opens with the line the whole module hangs on: agents speed up this work, but they do not supply the rigour. An agent will draft a biased question or quote a number it never computed just as fluently as it does good work. So we are going to build the habits that stop that — critique loops before launch, numbers from scripts only, and readouts that say exactly what the data supports and nothing more. Let's start with surveys.
Slide 2 — Where agents help, and where rigour stays human
Here is the division of labour for everything in this module. Agents draft the survey, critique it, compute the sample size, write the analysis scripts, run them, and even attack the draft readout as a skeptic. Humans frame the question, own the ethics and the launch, read the caveats, and make the decision. And the genuinely statistical calls — significance, weighting, contested methods — go to someone who knows the methods, not to an improvised script. The one rule that makes all of this work: if a number was not computed by code, it is an estimate the model made, and it does not go in the report.
Slide 3 — Survey design: constructs first, questions second
Surveys fail twice, and the expensive failure is the first one — before launch. A leading question, a double-barreled item, a scale with no honest middle: once twelve hundred people have answered it, no analysis can repair it. The discipline that prevents most of this is simple. Start from the decision and the constructs it implies, give each construct one or two questions, and cut anything that does not map. Then check the wording: does the question suggest its own answer, does it ask two things at once, does every honest answer have somewhere to go, and does an early question prime a later one. That checklist is exactly what we hand the agent next.
Slide 4 — The agent as survey critic
Here is where the agent earns its place in survey design — as a critic, before launch. A critic agent reads the draft like a hostile methodologist: leading wording, double-barreled items, missing answer options, broken scales, ordering effects. Every finding quotes the actual wording, names the bias, and proposes a rewrite. You loop — draft, critique, revise, critique again — until only judgment calls remain, and those are yours to decide. Then a persona read-aloud pass answers the survey as a rushed mobile respondent or a non-native speaker and tells you where they get stuck. None of this certifies the survey as unbiased. It just removes the defects you would otherwise discover after twelve hundred people answered. Pilot with five real people anyway.
Slide 5 — Analysis: numbers come from scripts, not from the model
Now the responses arrive, and the temptation is to paste a thousand rows into chat and ask what they say. Don't. The rows stay on disk. The agent writes an analysis script, runs it against the export, and reports only what the script printed — distributions with their bases, cross-tabs per segment, and open-text answers coded against a codebook with verbatim excerpts and counts. Then a challenge pass reads every claim and asks the awkward questions: how big is that cell, what was the response rate, who never saw the invitation. You read the challenge file before the findings, and you spot-check three numbers and three quotes yourself. The model writes the script. The script does the arithmetic. That separation is what makes the numbers defensible.
Slide 6 — The quantitative evidence loop
Here is the loop that carries everything in this module. You frame the question and the decision it informs. The agent helps design the instrument or the experiment — drafting, critiquing, computing the sample size with a script. Your team collects the data with real tooling; the agent never owns the launch. Then the agent runs the analysis: scripts only, assumptions stated, and a skeptic pass that attacks the draft before you see it. You hold the readout gate — spot-check the numbers, compare the claims to what was pre-registered, strip what the sample cannot carry. And then a human makes the decision, recorded separately from the evidence. The dashed line is the part people forget: the decision raises the next question, and the loop starts again.
Slide 7 — Experiment readouts: decided before launch, attacked before reading
Experiments go wrong in predictable ways: peeking at the dashboard and stopping on a good day, rewriting the hypothesis after the results are in, slicing segments until something looks significant. The protection is two documents. Before launch, a pre-registration: the hypothesis with its because clause, one primary metric, the guardrails, a sample size computed by a script, and the ship and kill rules — signed by a human. After the run, a readout where every number came from code and where a skeptic agent has already attacked the draft: did the run stop early, did the hypothesis drift, how many segments were fished, is the effect big enough to matter. The objections get published. And the readout keeps two sections separate: what the data shows, and what the team decides.
Slide 8 — The laundering temptation: weak data, confident claims
Here is the temptation this module exists to resist: laundering weak data into confident claims. Nobody lies. They just round up, drop the base, promote one segment into the headline, and let the word most do work the numbers never earned. Look at the difference. Most customers are fine with the new pricing — versus sixty-four per cent of twelve hundred respondents, with forty-one per cent of the open text coded as conditional acceptance. A nine per cent lift — versus a test stopped at forty per cent of its planned sample with a decaying trend. The honest versions are longer, and they are the ones you can defend a year later. The test is simple: every claim names its base, its segment, and its source — or it goes.
Slide 9 — Funnel diagnosis: the numbers locate the wound, the screens explain it
Funnel diagnosis fails for a structural reason: the funnel lives in an analytics tool and the screens live somewhere else, and the conversation between them is a meeting where someone guesses. The workflow puts both on the same desk. An analytics agent computes step conversion from a clean export — overall, and cut by device and segment. Then, for the two or three worst steps, walkthrough agents open the real product in a browser, at desktop and phone widths: they capture the screens, read the copy, trigger the error states, and note what loads slowly. Everything comes back as observations with evidence, not explanations. The number tells you where to look. The walkthrough tells you what you are actually looking at.
Slide 10 — From drop-off to ranked, testable hypotheses
The output of a funnel diagnosis is not an answer — it is a ranked list of hypotheses, and the format does the discipline for you. Every hypothesis has to cite a number and a screen-level observation, and has to name the cheapest way to test it: session replays for behaviour, a five-user usability test for comprehension, an experiment only when a hypothesis has survived a cheaper check and justifies a build. Some hypotheses will be about the tracking itself — keep those, because broken instrumentation looks exactly like user behaviour. And nothing on the board is stated as a confirmed cause, because funnel data cannot prove causation. This board is also where your next experiment gets its because clause.
Slide 11 — Worked example: one funnel diagnosed, two hypotheses tested
Let's trace one run end to end. A trial signup funnel was losing sixty per cent of users at email verification, and the team's theory was that people just were not motivated. The segment cut killed that theory in one table: forty-seven per cent drop on desktop, seventy-one on mobile. The walkthroughs explained why: the email took up to four minutes to arrive, the screen never mentioned a delay, the resend button was below the fold on a phone, and the link expired into a dead-end error page. Two hypotheses came out, each routed to the cheapest check — replays confirmed the dead end within a week, and the copy change went to an experiment. The fixes lifted verification by nineteen points. The diagnosis did not prove anything. It pointed the proof at the right screen, first try.
Slide 12 — Exercise: critique an existing survey with an agent
Your exercise for this module: take a survey your team has already run, or is about to run, and put it through the critique loop. Start with the construct map — what decision the survey serves and which constructs each question should cover. Then have the agent critique every question: leading wording, double-barreled items, missing options, broken scales, ordering effects. Make it quote the wording and name the bias every time. Then you do the sorting: real defects, judgment calls, and questions that map to no construct and should simply go. If the survey already ran, note which findings would have changed how you read the results. Most people find at least one question they would no longer defend — better to find it this way.
Slide 13 — Summary, and what comes next
Let's close the module. Agents make quantitative work faster; the rigour comes from the loop you put around them. For surveys: constructs first, a critique loop before launch, and a real pilot anyway. For analysis: every number from a script, every claim carrying its base, its segment, and its source. For experiments: pre-register before launch, let a skeptic attack the readout, and keep what the data shows separate from what the team decides. For funnels: the numbers locate the drop-off, the walkthroughs explain it, and the output is ranked hypotheses — never claimed causes. In Module 5 we take the same product data — events, sessions, tickets — and build journey maps and service blueprints from it, with the gaps marked honestly. See you there.