Section 1
Why funnels and screens get diagnosed by different people
In most teams, the funnel lives in an analytics tool owned by a data person, and the screens live in a design tool owned by a designer, and the conversation between them happens in a meeting where someone says step three looks bad and someone else guesses why. The number and the interface that produced it are rarely on the same desk at the same time.
That gap is where weak fixes come from. The analytics say 60 percent drop at email verification, so the team shortens the email copy — but nobody opened the verification screen on a phone, saw the link expire after ten minutes, or noticed that the resend button returns a silent failure. The number located the wound; only looking at the screen explains it.
This workflow puts both on the same desk. Analytics agents compute the step conversion and the segment differences with scripts run against the export. Walkthrough agents then drive the actual product at the worst steps — capturing screenshots, reading the copy, triggering the error states, watching what loads slowly — and the two evidence streams merge into ranked hypotheses, each tagged with how to verify it. The workflow proposes explanations; it never claims to have proven one, because correlational funnel data cannot carry that weight.
Section 2
When to reach for this workflow
Use it when a funnel matters and nobody can say precisely why it leaks: trial signup, onboarding, checkout, upgrade. Use it before commissioning a redesign of a step the analytics merely correlate with the problem, and before running an experiment whose hypothesis is currently a guess — this workflow is what produces the evidence-backed because clause the experiment workflow asks for.
Half a day buys one funnel done properly: the export, the scripts, the walkthroughs at two or three steps, and the hypothesis board. Trying to diagnose four funnels in the same half day produces four sets of guesses.
- A signup, onboarding, or checkout funnel with an unexplained drop-off.
- A mobile vs desktop gap nobody has traced to a specific screen.
- Preparation for an experiment: turning a drop-off into a testable hypothesis.
- A quarterly funnel health review you want to be reproducible next quarter.
Section 3
The orchestration pattern: analytics first, walkthroughs where the numbers point
This runs as a Claude Code dynamic workflow: a JavaScript script Claude writes and runs in the background, orchestrating subagents while intermediate results stay in script variables rather than Claude's context. The pattern is dynamic in the useful sense — the workflow does not know in advance which steps deserve walkthroughs; it computes the funnel first, then dispatches walkthrough agents only to the two or three steps where the drop-off and the segment gaps are largest.
Workflows can run up to 16 agents concurrently and up to 1,000 per run, and runs are resumable, which helps when walkthroughs across several viewports take a while. You trigger it by including the word workflow in the prompt or via /effort ultracode, save the stable prompt to .claude/workflows/ in the project (or ~/.claude/workflows/ personally) as a /diagnose-funnel command, and keep the analytics and walkthrough agent definitions in .claude/agents/. The walkthrough agents drive a real browser through Playwright or Playwright MCP, the same capture discipline as the visual QA workflow on this site.
Design decision
Export and clean funnel data
Design decision
Compute step conversion
Design decision
Cut by segment and device
Design decision
Walk through the worst steps
Design decision
Capture screens, copy, error states
Design decision
Rank evidence-linked hypotheses
Design decision
Tag each with a verification method
The numbers decide where the walkthroughs go; the walkthroughs explain what the numbers cannot.
Section 4
Start from a clean export, not from the dashboard
Work from a flat export — GA4, Amplitude, Mixpanel, or your warehouse — with one row per user or session and one column per funnel step, plus the segment columns you care about: device, plan, acquisition source, country. Dashboards are fine for noticing a problem; they are a poor substrate for diagnosis because nobody can rerun a dashboard glance, and the workflow's first rule is the same as everywhere else on this site: every number in the output was computed by a script against the export, never estimated by the model.
Spend ten minutes on definitions before any agent runs. What counts as reaching a step, what the time window is, whether users can skip steps, and whether the funnel is strictly ordered are decisions that change every downstream number, and they belong to the researcher, written into the brief.
import { readFile } from "node:fs/promises"
// Usage: node analyse-funnel.mjs export.csv --steps signup,verify_email,create_workspace,invite,first_action --segment device
const rows = parseCsv(await readFile(process.argv[2], "utf8"))
const steps = getArg("--steps").split(",")
const segmentCol = getArg("--segment")
function funnelFor(subset, label) {
console.log("\n## " + label + " (entered: " + subset.length + ")")
let previous = subset.length
for (const step of steps) {
const reached = subset.filter((r) => r[step] === "1" || Number(r[step]) > 0).length
const stepRate = previous ? ((reached / previous) * 100).toFixed(1) : "0.0"
const overall = ((reached / subset.length) * 100).toFixed(1)
console.log("- " + step + ": " + reached + " (step " + stepRate + "%, overall " + overall + "%)")
previous = reached
}
}
funnelFor(rows, "All users")
for (const value of distinct(rows, segmentCol)) {
funnelFor(rows.filter((r) => r[segmentCol] === value), segmentCol + " = " + value)
}
// The workflow reads this output to decide which 2-3 steps get walkthroughs.Section 5
The walkthrough agents inspect the screens the numbers indict
For each of the worst steps, a walkthrough agent opens the actual product in a browser at desktop and phone widths and does what a careful researcher would do with a notebook: capture the screen, read every piece of copy on it, attempt the step, trigger the obvious error states (wrong password, expired link, invalid address, declined card in a test environment), note what loads slowly or shifts, and check what happens on resume after the user leaves and comes back.
Each observation is recorded with its evidence: a screenshot path, the verbatim copy, the console or network note. Observations are facts about the interface; they are not yet explanations, and the agent is told to keep that distinction or its output gets discarded in the merge.
--- name: funnel-walkthrough description: Walks through one funnel step in a real browser, capturing screenshots, copy, error states, and load behavior at desktop and phone widths. Use after the analytics pass has identified the worst drop-off steps. tools: Read, Bash, mcp__playwright --- You are inspecting one step of a funnel in a real browser (test environment). For the step you are given: 1. Capture the screen at 1440px and 390px before any interaction. 2. Read and record the verbatim copy: headings, field labels, helper text, button labels, legal text, and anything that sets expectations. 3. Attempt the step the way a first-time user would. Note every required field, every decision, and anything that demands information the user may not have at hand. 4. Trigger the plausible error states (invalid input, expired link, declined test card) and capture how each is communicated and whether recovery is obvious. 5. Note load behavior: spinners over 2 seconds, layout shift, anything that appears broken before it finishes loading. 6. Leave and return: does the step preserve progress? Report observations only, each with its evidence (screenshot path, verbatim copy, console note). Do not explain why users drop off; that happens in the merge stage with the analytics in view.
Design decision
Open step in browser
Design decision
Capture at 1440 and 390
Design decision
Read the verbatim copy
Design decision
Attempt the step
Design decision
Trigger error states
Design decision
Note load and layout shift
Design decision
Leave and return
Each indicted step gets the same inspection at desktop and phone widths, and everything is recorded as an observation with evidence rather than an explanation.
Section 6
The workflow prompt
Paste the prompt with the export, the step definitions, and a test-environment URL in the project. The merge stage is where the rule about causation is enforced: hypotheses are ranked by how much evidence supports them, and every one carries the method that would verify it.
Run this as a workflow. Input: ./data/funnel-export.csv (one row per user, step columns and segment columns documented in ./brief.md), and ./brief.md (step definitions, time window, test environment URL and credentials, segments to cut by). Stage 1 - Analytics: write and run analyse-funnel.mjs against the export. Compute step conversion overall and for each segment in the brief. Report only numbers the script printed. Identify the 2-3 steps with the largest absolute drop-off or the largest segment gap. Stage 2 - Walkthroughs: for each identified step, launch a funnel-walkthrough agent against the test environment at 1440px and 390px. Collect screenshots, verbatim copy, error-state behavior, and load notes into ./evidence/<step>/. Stage 3 - Merge: combine the analytics and walkthrough evidence into hypotheses for why users leave at each step. Rank them by evidence strength. Each hypothesis must cite at least one number and at least one walkthrough observation, and must name a verification method: session replays, a usability test, or an experiment. Do not state any hypothesis as a confirmed cause. Output: ./output/funnel-tables.md, ./output/walkthrough-notes.md, and ./output/hypotheses.md (ranked, evidence-linked, each tagged with its verification method and an owner field left blank for the team).
Section 7
What the orchestration script roughly does
Claude writes the orchestration script when you trigger the workflow; the sketch below shows the shape with an agent() pseudo-API. The dynamic part is visible in the middle: which walkthrough agents get launched depends on what the analytics stage just computed, which is exactly the kind of branching a fixed pipeline cannot do and a script can. It is illustrative, not the literal generated code.
const fs = require("node:fs")
const { execSync } = require("node:child_process")
// Stage 1: run the funnel script directly; keep its output in a variable.
const funnelTables = execSync(
"node analyse-funnel.mjs ./data/funnel-export.csv --steps signup,verify_email,create_workspace,invite,first_action --segment device"
).toString()
// Decide where to look: the workflow picks the worst steps from the output.
const worstSteps = await agent(
"From these funnel tables, return the 2-3 steps with the largest absolute " +
"drop-off or the largest segment gap, as a JSON array of step names.\n\n" + funnelTables,
{ model: "sonnet" }
)
// Stage 2: one walkthrough agent per step and viewport, in parallel.
const walkthroughs = await Promise.all(
JSON.parse(worstSteps).flatMap((step) =>
[1440, 390].map((width) =>
agent(
"Act as the funnel-walkthrough agent for step " + step + " at " + width +
"px against the test environment in ./brief.md. Save screenshots to " +
"./evidence/" + step + "/ and report observations with evidence only.",
{ model: "sonnet" }
)
)
)
)
// Stage 3: merge into ranked, evidence-linked hypotheses.
const hypotheses = await agent(
"Merge the funnel tables and walkthrough observations into ranked hypotheses " +
"for each drop-off. Every hypothesis cites a number and an observation, and " +
"names its verification method (session replays, usability test, experiment). " +
"Never present a hypothesis as a confirmed cause.\n\nTABLES:\n" + funnelTables +
"\n\nWALKTHROUGHS:\n" + JSON.stringify(walkthroughs),
{ model: "opus" }
)
fs.writeFileSync("./output/hypotheses.md", hypotheses)Section 8
Step by step through one diagnosis
Write the brief: step definitions, time window, segments, and a test environment the walkthrough agents may safely use — never production accounts with real user data. Export the funnel, run the workflow, and expect the analytics stage to finish in minutes and the walkthroughs to take the bulk of the run.
Read the hypothesis board with the funnel tables beside it, and walk the worst step yourself once on your own phone; ten minutes of first-hand experience is the cheapest calibration available for judging which hypotheses ring true. Then assign verification: which hypotheses go to session replays this week, which earn a five-user usability test, which become an experiment via the experiment design workflow.
The half-day estimate covers the brief, the run, your own walkthrough, and the prioritization conversation. The artifacts — tables, screenshots, hypotheses with owners — are the input the next quarter's review compares against.
Fastest check for behavioral hypotheses — do users actually hit the error state, scroll past the field, or abandon at the load stall?
Best for comprehension and expectation hypotheses — five users attempting the step reveal whether the copy means what the team thinks it means
Reserved for hypotheses that survived a cheaper check and justify a build — designed and read out via the experiment workflow
When the walkthrough suggests the number itself is wrong — a step event firing twice, or not firing on mobile
Every hypothesis leaves the workflow tagged with the cheapest method that could falsify it.
Section 9
Case study: 60 percent dropping at email verification
A trial signup funnel lost 60 percent of users between account creation and verified email, the worst step by a wide margin, and the team's standing theory was that people simply did not want the product enough to open an email. The segment cut undermined that immediately: the drop was 47 percent on desktop and 71 percent on mobile, which a motivation story cannot explain.
The walkthroughs supplied the candidates. The verification email took up to four minutes to arrive in the test runs while the interstitial screen said check your email with no mention of delay and no resend option visible without scrolling on a phone. The link expired after ten minutes — shorter than the observed delivery delay plus a normal distraction — and an expired link landed on a generic error page with no path back. The top-ranked hypothesis tied those observations to the mobile gap and was tagged for session replays; the second, about the missing delay expectation, was tagged for a copy-and-resend experiment.
Replays confirmed the expired-link dead end within a week. Lengthening the expiry, adding a resend button above the fold, and setting the delay expectation in the interstitial copy lifted verification completion by 19 points over the following month. The workflow did not prove causation — the team's verification work did — but it pointed the verification at the right screen on the first try.
Section 10
Case study: a mobile checkout drop-off traced to address validation
An e-commerce funnel showed checkout completion of 71 percent on desktop and 44 percent on mobile, with the gap concentrated at the address step. The walkthrough agent, attempting the step at 390 pixels with deliberately imperfect input, found the cause candidate quickly: the address validator rejected apartment-style addresses unless they matched a strict format, the error message said only please enter a valid address, and the error summary rendered above the form — off-screen on a phone, where the user saw nothing change and a button that appeared to do nothing.
The analytics could never have named that. The hypothesis linked the 27-point device gap to the captured error state and the off-screen summary, and was tagged for session replays filtered to mobile address-step abandons; replays showed users retrying the same address two and three times before leaving. The fix — accept the input and confirm rather than block, and move errors inline next to the field — was shipped behind an experiment that showed an 11 percent relative improvement in mobile checkout completion.
Section 11
Case study: the analytics blamed step 3, the walkthrough blamed step 1
An onboarding funnel for a B2B tool showed its largest numerical drop at step three, where users were asked to invite teammates, and the team was preparing to make invitations skippable. The walkthrough pass read the whole flow rather than only the indicted step, and the merge stage surfaced a different story: step one promised set up your workspace in two minutes, step two delivered a workspace that was empty and inert until data was connected, and by step three users were being asked to invite colleagues into something that did not yet do anything.
The top hypothesis was that the damage was done in step one's expectation-setting, and that step three was merely where the disappointment became measurable — supported by the numbers (users who connected data before the invite step completed it at three times the rate) and by the captured copy. It was tagged for a usability test rather than an experiment, because the question was about expectations, not button placement.
Five interviews later the pattern was unmistakable, and the team reordered onboarding so a data connection preceded the invite ask, leaving the two-minute promise to a flow that could keep it. Making step three skippable — the original plan — would have moved the drop-off downstream and called it a win.
Section 12
Good vs bad diagnosis output
A useful hypothesis names a number, names a screen-level observation, and names the method that would falsify it. A weak one restates the funnel chart in words, or leaps to a confirmed cause that correlational data cannot support, or prescribes a redesign with no evidence link at all.
Users drop at email verification because they are not motivated enough
Verification completion is 47% desktop vs 71% drop on mobile; the link expires in 10 minutes, delivery took up to 4 minutes, and the resend control is below the fold at 390px — verify with session replays of mobile abandons
The address step has a 27-point mobile gap (restates the table, explains nothing)
The validator rejects apartment formats and the error summary renders off-screen at 390px (screenshot evidence/address/390-error.png); verify with replays filtered to repeated address submissions
Step 3 is broken and invitations should be skippable
Users who connect data before the invite step complete it at 3x the rate; step 1 promises a two-minute setup the empty workspace cannot keep — verify with a 5-user usability test of the first session
Evidence-linked hypotheses can be verified or killed; restated charts and asserted causes can only be argued about.
Section 13
Limits: what this workflow cannot prove
Funnel data is correlational, and the workflow's outputs are written to respect that: a drop-off plus a broken-looking error state is a strong hypothesis, not a demonstrated cause. Causation comes from the verification methods the hypotheses are tagged with — replays, usability tests, experiments — and from the humans who run them. The workflow also cannot see what the instrumentation does not record; if the mobile event fires unreliably, the most important hypothesis may be about the tracking, and the workflow should be allowed to say so.
Walkthrough agents inspect a test environment with test data, which means they miss whatever only happens with real payment providers, real email deliverability, real network conditions, and real user accounts mid-history. And the prioritization at the end — which hypothesis is worth a five-user test versus an engineering quarter — is a business judgment about cost, risk, and strategy that belongs to the team, with the evidence board in front of it rather than in place of it.
- Cannot establish causation from funnel data; it can only rank hypotheses for verification.
- Cannot see beyond the instrumentation; broken tracking masquerades as user behavior.
- Cannot reproduce real-world conditions a test environment lacks.
- Cannot decide which fix is worth its cost; that is the team's call.
Section 14
The reusable funnel diagnosis workflow
Save the prompt to .claude/workflows/ as /diagnose-funnel, keep the walkthrough agent definition in .claude/agents/, and keep analyse-funnel.mjs and the brief template in the repo. Rerun it on the same funnel each quarter with the same step definitions, and the comparison across quarters becomes evidence in its own right.
1. Write the brief: step definitions, time window, segments, and a safe test environment. 2. Export the funnel as a flat file; never diagnose from a dashboard glance. 3. Run the analytics stage: step conversion and segment cuts computed by analyse-funnel.mjs. 4. Let the workflow pick the 2-3 worst steps and dispatch walkthrough agents at 1440px and 390px. 5. Collect screenshots, verbatim copy, error states, and load notes as observations with evidence. 6. Merge into ranked hypotheses; every one cites a number, an observation, and a verification method. 7. Walk the worst step yourself once, then assign each hypothesis to replays, a usability test, or an experiment. 8. Archive the tables, evidence, and decisions; rerun next quarter against the same definitions.
Sources

