Section 1
Why most experiment programs decay into theater
Teams rarely fail at running experiments; they fail at being honest about them. The hypothesis gets written after the variant is built, the test runs until the dashboard turns green, the readout quotes whichever segment moved, and six months later nobody can say which of the shipped winners actually changed anything. The experiment platform did its job; the discipline around it did not.
The failures have names. Peeking — checking significance daily and stopping on a good day. HARKing — hypothesizing after the results are known. Segment fishing — slicing until something clears the bar. Novelty effects mistaken for durable lift. None of these require bad intent; they only require a deadline and an unguarded readout.
This workflow puts the discipline into a repeatable shape: the design stage forces the hypothesis, the metrics, the minimum detectable effect, and the ship criteria to exist before launch, with the sample size computed by a script rather than negotiated in a meeting. The readout stage runs the statistics with code, then hands the draft to a skeptic agent whose only job is to attack it. What survives goes into a readout that separates what the data shows from what the team decides.
Section 2
When to reach for this workflow
Use it for experiments that will change what ships: checkout and pricing-page tests, navigation and onboarding changes, anything where a wrong call costs real money or real users. Use it especially when stakeholders are emotionally invested in one arm winning, because that is exactly when the readout needs an adversary who is not in the room.
Do not use it to launder a decision that has already been made. If the team will ship the redesign regardless of the result, say so and skip the experiment; a test you will not act on is theater with extra latency.
- Conversion-critical surfaces: checkout, pricing, signup, onboarding.
- Experiments with guardrail risks: support volume, refunds, latency, accessibility.
- Tests where traffic is limited and the sample-size math decides feasibility.
- Readouts that will be presented to leadership and quoted later.
Section 3
The orchestration pattern: staged, with an adversary built in
The workflow runs as a Claude Code dynamic workflow: a JavaScript orchestration script that Claude writes and runs in the background, calling subagents while intermediate results — the metrics export, the per-arm tables, the draft readout — stay in script variables instead of Claude's context. Up to 16 agents can run concurrently and up to 1,000 per run, runs are resumable, and the finished prompt can be saved to .claude/workflows/ in the project (or ~/.claude/workflows/ personally) as a /design-experiment or /readout command. You trigger it by including the word workflow in the prompt or via /effort ultracode; the analyst and skeptic agent definitions live in .claude/agents/.
The stages are sequential on purpose. Design must finish — and be signed by a human — before launch. Analysis must finish before the skeptic sees it, and the skeptic must finish before the team sees anything. The adversarial review is not a politeness pass; the skeptic agent is prompted to assume the analysis is wrong and to find out how, and its objections are published in the readout whether or not they change the conclusion.
Design decision
Evidence and hypothesis
Design decision
Metrics and MDE
Design decision
Sample size script
Design decision
Ship criteria locked
Design decision
Run and monitor
Design decision
Analysis with code
Design decision
Skeptic attack and readout
Design is locked before launch; after the run, the analysis is attacked by a skeptic agent before any human reads the conclusion.
Section 4
Design stage: the hypothesis comes from evidence
A hypothesis is a causal claim with a reason: because we observed X (in research, support tickets, funnel data), we believe change Y will move metric Z for population W. The design agent's first job is to check that the because clause exists and points at real evidence — a finding from a usability study, a coded theme from tickets, a drop-off the funnel diagnosis workflow surfaced. A variant without an evidence-backed hypothesis is a guess, and guesses are fine, but they should be labeled as such in the backlog, not dressed up after the fact.
The second job is metric discipline: one primary metric the hypothesis is about, and explicit guardrails the change must not damage — support contact rate, refund rate, page latency, accessibility checks. Guardrails are where redesigns quietly do harm, and they are also where unexpected wins hide, as the checkout case below shows.
Section 5
Sample size is computed, never negotiated
The minimum detectable effect and the sample size come from a script, run before launch, with the inputs written down: baseline conversion, the smallest lift worth shipping for, significance level, and power. If the script says the test needs nine weeks of traffic and the team has three, that is a design decision to make now — test a bolder change, accept a larger MDE, or do not run the test — not a surprise to discover after an inconclusive readout.
The model never estimates these numbers. It writes the script, runs it, and reports what it printed, and the script stays in the repo so the next experiment uses the same math.
// Two-proportion sample size per arm, normal approximation.
// Usage: node sample-size.mjs <baselineRate> <minDetectableLift> [alpha] [power]
// Example: node sample-size.mjs 0.034 0.10 -> lift means relative +10%
const [, , baseArg, liftArg, alphaArg, powerArg] = process.argv
const p1 = Number(baseArg)
const p2 = p1 * (1 + Number(liftArg))
const alpha = Number(alphaArg ?? 0.05)
const power = Number(powerArg ?? 0.8)
const zAlpha = inverseNormal(1 - alpha / 2)
const zBeta = inverseNormal(power)
const pBar = (p1 + p2) / 2
const n = Math.ceil(
((zAlpha * Math.sqrt(2 * pBar * (1 - pBar)) + zBeta * Math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2) /
(p2 - p1) ** 2
)
console.log("Baseline rate: " + p1)
console.log("Target rate: " + p2.toFixed(4))
console.log("Alpha / power: " + alpha + " / " + power)
console.log("Required n per arm: " + n)
console.log("At 40,000 eligible visitors/week split 50/50, that is roughly " +
Math.ceil(n / 20000) + " weeks. Decide feasibility before launch.")Section 6
Ship criteria are agreed before launch
The design stage ends with a one-page pre-registration: the hypothesis, the primary and guardrail metrics, the MDE and computed sample size, the run length, the segments that will be reported (decided now, not discovered later), and the decision rule — what result ships the variant, what result kills it, and what counts as inconclusive. A human signs it. The workflow stores it next to the readout so the two can be compared line by line after the run.
This document is what the skeptic agent will later hold the readout against. Most of the dishonesty it catches is not fabrication; it is drift between what the team said it would do and what the readout quietly does instead.
Run this as a workflow. Input: ./experiment/evidence.md (the research, ticket themes, or funnel findings motivating this test) and ./experiment/context.md (traffic, baseline rates, constraints). Stage 1 - Hypothesis: draft the hypothesis in the form: because [evidence], we believe [change] will move [primary metric] for [population]. Flag it if the because clause does not point at a specific piece of evidence in evidence.md. Stage 2 - Metrics: propose one primary metric and 2-4 guardrail metrics with the direction that counts as harm. Justify each in one sentence. Stage 3 - Power: write and run sample-size.mjs with the baseline rate from context.md and the smallest lift worth shipping for. Report n per arm and run length at current traffic. Never estimate these numbers without running the script. Stage 4 - Pre-registration: assemble ./experiment/pre-registration.md with hypothesis, metrics, MDE, sample size, run length, the segments that will be reported, and the ship / kill / inconclusive decision rules. Stop there. A human signs this document before anything launches, and the experiment platform configuration is done by the team, not by this workflow.
Section 7
Readout stage: the analysis agent runs the stats with code
After the run completes — at the pre-registered sample size, not on a good-looking Tuesday — export the per-arm results and let the analysis agent work from the file. It writes and runs the statistics: conversion per arm with confidence intervals, the difference and its uncertainty, guardrail metrics with the same treatment, and the pre-registered segments only. Every number in the readout traces to script output.
The analysis agent also reconstructs the timeline: when the test started, whether the assignment ratio held, whether anything shipped mid-test, and whether the result drifted across the run in the way novelty effects do. Those facts go into the readout as context, because the skeptic will ask for them anyway.
Section 8
The skeptic agent attacks the readout
The skeptic agent receives the draft readout, the pre-registration, and the script outputs, and is prompted to assume the conclusion is wrong. It checks for peeking and early stopping against the pre-registered run length, for HARKing by diffing the readout's hypothesis against the signed one, for segment fishing by counting how many cuts were examined versus pre-registered, for novelty effects in the weekly trend, and for guardrail damage the headline ignores. It writes its objections as a numbered list, each with the evidence it rests on, and states which objections are fatal, which require caveats, and which it raises and withdraws.
The objections are published in the readout. A readout that says the skeptic raised the early-stopping concern and here is why it does not apply is more credible than one that never mentions the risk — and on the occasions the objection is fatal, it is far cheaper to hear it from an agent before the meeting than from a competitor's data after the rollout.
--- name: experiment-skeptic description: Adversarial reviewer for experiment readouts. Assumes the conclusion is wrong and hunts for peeking, HARKing, segment fishing, novelty effects, and guardrail damage. Use after analysis, before any human reads the readout. tools: Read, Bash --- You are reviewing an experiment readout as a hostile statistician. Assume the conclusion is wrong and try to demonstrate how. Check, in order: 1. Stopping rule — did the run reach the pre-registered sample size and duration? Quote both documents. Any early stop is a finding. 2. HARKing — does the readout's hypothesis match pre-registration.md word for word in substance? Note any drift. 3. Segment fishing — how many segments were examined vs pre-registered? Any segment-only win must carry a multiple-comparisons caveat. 4. Novelty and seasonality — does the weekly trend decay? Did the run span a sale, a holiday, or a release that contaminates it? 5. Guardrails — did any guardrail move adversely, even if the primary won? 6. Practical significance — is the confidence interval consistent with an effect too small to matter, even if it excludes zero? Output a numbered list of objections. For each: the evidence it rests on, whether it is fatal / requires a caveat / raised and withdrawn. Do not soften objections to be agreeable, and do not invent objections the evidence does not support.
Design decision
Stopping rule vs pre-registration
Design decision
Hypothesis drift check
Design decision
Segment fishing count
Design decision
Novelty and seasonality
Design decision
Guardrail damage
Design decision
Practical significance
Design decision
Objections published
The skeptic works through the same checks on every readout, and each objection is published with its disposition rather than negotiated away.
Section 9
What the orchestration script roughly does
As elsewhere on this site, Claude writes the orchestration script itself when you trigger the workflow. The sketch shows the shape with an agent() pseudo-API: analysis and timeline reconstruction can run in parallel, the skeptic runs strictly after them, and the readout assembly runs last with everything in front of it. It is illustrative, not the literal generated code.
const fs = require("node:fs")
const preReg = fs.readFileSync("./experiment/pre-registration.md", "utf8")
// Stage 1: analysis and timeline run in parallel; both work from files and
// return only computed tables and dated facts.
const [analysis, timeline] = await Promise.all([
agent(
"Write and run a Node script against ./data/results.csv. Compute per-arm " +
"conversion with 95% confidence intervals, the difference, guardrail " +
"metrics, and the pre-registered segments only (see below). Report only " +
"what the script printed.\n\n" + preReg,
{ model: "sonnet" }
),
agent(
"From ./data/assignment-log.csv and ./experiment/changelog.md, reconstruct " +
"the run timeline: start/end dates, assignment ratio over time, weekly " +
"trend of the primary metric, and anything that shipped mid-test.",
{ model: "sonnet" }
),
])
// Stage 2: the skeptic attacks the draft before any human reads it.
const objections = await agent(
"Act as the experiment-skeptic agent. Attack this analysis against the " +
"pre-registration. List objections with evidence; mark each fatal, caveat, " +
"or withdrawn.\n\nPRE-REGISTRATION:\n" + preReg +
"\n\nANALYSIS:\n" + analysis + "\n\nTIMELINE:\n" + timeline,
{ model: "opus" }
)
// Stage 3: assemble the readout with evidence and decision kept separate.
const readout = await agent(
"Assemble readout.md using the template in ./templates/readout-template.md. " +
"Section 1: what the data shows (script outputs only). Section 2: skeptic " +
"objections and their dispositions. Section 3: what the team decides — " +
"leave this section as questions for the humans, do not answer them.",
{ model: "opus" }
)
fs.writeFileSync("./output/readout.md", readout)Section 10
Step by step through one experiment
Gather the evidence and run the design workflow; expect about an hour to a signed pre-registration, including the conversation about whether the sample-size math makes the test feasible at all. Configure and launch the test on your experiment platform yourself — the workflow designs and reads out, it does not touch production traffic.
While the test runs, resist the dashboard. The pre-registration is your permission to not look. When the run completes, export the results, run the readout workflow, and read the skeptic's objections before the headline. Spot-check by re-running the analysis script yourself on the same export.
Then hold the decision meeting with the readout's third section — what the team decides — still blank. The data section says what happened; the humans own what to do about it, and the readout records both, separately, for the next person who asks why this shipped.
Section 11
Case study: a checkout redesign that was a wash on conversion
A team tested a single-page checkout against their three-step flow, pre-registered around a 0.034 baseline and a 10 percent relative MDE, and ran the full six weeks the script demanded. The primary metric was a wash: completed checkout moved from 3.41 percent to 3.46 percent, with a confidence interval comfortably spanning zero.
The guardrails were not a wash. Support tickets tagged checkout fell 22 percent in the variant arm over the same period, driven almost entirely by address and payment-error contacts, and the timeline reconstruction showed the gap was stable across all six weeks rather than a novelty dip. The skeptic agent's main objection — that ticket volume is noisy and was not the primary metric — was published as a caveat, with the recommendation to treat the ticket reduction as a strong secondary signal rather than a proven effect.
The team shipped the single-page checkout anyway, on the explicit grounds that conversion was not harmed and the support cost reduction was worth having even at secondary-evidence confidence. That reasoning is in the readout's decision section, signed, which is exactly where it belongs: the data did not make the call, the team did, and nobody has to reconstruct the logic from memory next year.
Section 12
Case study: a pricing page test stopped early by a stakeholder
Twelve days into a planned 28-day pricing page test, the variant showed a 9 percent lift on trial starts and a stakeholder asked to ship it that week. The analysis was run on the truncated data anyway — sometimes the organization simply does that — and the workflow's job became making the cost of the early stop visible rather than pretending it had not happened.
The skeptic agent flagged three things. The run had reached barely 40 percent of the pre-registered sample size, so the apparent lift carried an interval wide enough to include effects too small to matter. The variant's daily advantage was shrinking across the second week, a pattern consistent with a novelty effect. And the lift was concentrated in trial starts while the downstream guardrail — trial-to-paid conversion — had not had time to mature for most of the variant cohort.
The readout said all of this plainly, and the decision section recorded the compromise: the variant shipped, framed as a decision under uncertainty rather than a validated win, with a holdback of 10 percent of traffic kept on the old page for four more weeks. The holdback later showed the true lift was closer to 3 percent, and the team's next pre-registration added an explicit clause about who can stop a test and what the readout must say when they do.
Section 14
Honest readout vs over-claimed readout
The difference between the two is rarely the arithmetic; it is what the words around the numbers permit the reader to believe. The honest readout names its base, its interval, its pre-registration, and its objections. The over-claimed readout quotes the best number it found and lets the reader assume the rest.
The new checkout improves the experience and reduces support load
Completed checkout: 3.41% control vs 3.46% variant, CI spans zero at the pre-registered n; checkout-tagged tickets down 22% as a secondary signal with a published noise caveat
Pricing variant drives a 9% lift in trial starts
Test stopped at 40% of pre-registered sample; observed 9% lift with a wide interval and a decaying daily trend; shipped as a decision under uncertainty with a 10% holdback
New-user navigation success up 14%
No overall effect across 84,000 sessions; an underpowered positive signal in the new-user segment (n≈6,000), flagged for a dedicated follow-up test
The data clearly supports shipping
Section 1 reports the data; section 3 records that the team chose to ship and why — the data informed the call, the team made it
An honest readout separates what the data shows from what the team decides; an over-claimed readout collapses the two.
Section 15
Limits: what this workflow cannot prove
The workflow enforces process honesty; it does not supply statistical judgment. Choosing between frequentist and Bayesian framings, deciding how to handle interference between concurrent tests, or weighting a guardrail against a primary metric are calls for someone who understands the methods and the business — the scripts make the inputs visible, they do not make the call. When the stakes are high enough, have a statistician review the pre-registration and the readout; the skeptic agent is a rehearsal for that review, not a replacement.
Experiments answer narrow causal questions about the populations and periods they ran in. They do not tell you whether the change is right for users you did not expose, whether it remains true after the novelty fades for good, or whether optimizing this metric is good for the product at all. And the ethical boundaries — what is acceptable to test on real users, what requires consent or exclusion, what crosses into dark patterns — are owned by humans before the design stage starts.
- Cannot substitute for statistical expertise on contested or high-stakes calls.
- Cannot generalize beyond the exposed population and the run period.
- Cannot decide whether a statistically real effect is practically worth shipping.
- Cannot rule on the ethics of what gets tested; that decision precedes the workflow.
Section 16
The reusable experiment workflow
Save the design and readout prompts to .claude/workflows/ as /design-experiment and /readout, keep the skeptic and analyst definitions in .claude/agents/, and keep sample-size.mjs and the readout template in the repo. The discipline survives staff changes because it lives in files, not in whoever ran the last test.
1. Collect the evidence motivating the test; write the hypothesis with its because clause. 2. Choose one primary metric and the guardrails, with harm directions stated. 3. Run sample-size.mjs; decide feasibility, MDE, and run length from its output. 4. Write and sign the pre-registration: segments, decision rules, stopping rule. 5. Launch on your platform; do not peek before the pre-registered end. 6. Export results; run the analysis agent so every number comes from script output. 7. Run the skeptic agent; publish its objections and their dispositions in the readout. 8. Hold the decision meeting; record what the team decides separately from what the data shows.
Sources

