I'm planning an A/B test. Help me write a rigorous hypothesis. Use this exact structure:
1. Hypothesis: stated as a directional prediction with a magnitude. Format: 'Doing X will change metric Y by at least Z%, because [mechanism].'
2. Experiment design: audience, split, duration (or sample-size threshold), exclusions.
3. Success criteria:
- Primary: the metric and threshold.
- Secondary: a sanity-check metric.
- Guardrails: 1-2 metrics that must NOT degrade (downstream funnel, refund rate, support load).
4. Expected lift: a range with reasoning. Why this number, not a different one?
5. What would falsify: the specific result that would tell us we were wrong - not just 'no effect' but the direction and threshold.
The test I'm planning:
{{test_description}}
A/B test hypothesis
Example output
Hypothesis: Replacing the 'Add to cart' button label with 'Buy now' will increase the cart-add rate by at least 8% on the product detail page, because the 'Buy now' framing reduces the perceived commitment cost (no separate checkout step implied).
Experiment design:
- Audience: 50/50 split of US visitors to product detail pages.
- Duration: 14 days minimum, or until 4,000 visitors per arm (whichever comes first).
- Exclusions: returning logged-in customers (already past the perceived-commitment-cost barrier).
Success criteria:
- Primary: cart-add rate increase >= 8%, p < 0.05.
- Secondary: no decrease in checkout-completion rate (sanity check).
- Guardrail: refund rate within 1% of baseline (lower-friction click could mean lower-intent buyers).
Expected lift: 8-12%. Reasoning: similar label changes in published e-commerce A/B test corpora show 5-15% lifts; our audience is closer to the higher end because our 'Add to cart' copy is unusually formal.
What would falsify: cart-add rate change < 3% (no detectable effect), OR checkout-completion drops 5%+ (we shifted clicks but lost intent), OR refund rate climbs 2%+ (we're winning bad customers).
Common mistakes
Don't write a hypothesis that says 'we'll see if X has any effect.' That's not a hypothesis; it's a survey. Force a directional prediction with a magnitude. Also: do not skip the guardrail metrics. Many A/B tests 'win' on the primary metric while breaking something downstream that the team only notices three weeks later. The guardrail is the cheap insurance. Third mistake: 'expected lift' without a reasoning trail. If you can't say why you expect X%, you're guessing - and you'll have no way to update your priors when the test result comes back.
More from AI Prompts for Product Managers
PRD draft from a one-line problem
Act as a senior product manager. I'm going to give you a one-line problem statement. Before drafting any PRD, do this: 1. State the…
Customer interview synthesis
Read the raw customer interview transcripts below. Produce: 1. The top 5 themes ranked by frequency (number of interviews where the theme…
Stakeholder weekly update
Convert my raw bullets below into a weekly stakeholder update. Use exactly three sections: 1. Shipped: each item with a one-line outcome,…
Why it works
A/B test hypotheses are the prompt where most PMs accidentally do bad statistics. The mistake is writing 'we'll test a new button color' and skipping the falsifiability question. This prompt forces five things: a hypothesis stated as a directional prediction (not just 'we'll see if'), the experiment design including audience and duration, the success criteria with thresholds (not vague 'better'), the expected lift with reasoning, and what would falsify the hypothesis. The falsifiability section is what separates an experiment from a guess. PMs who run this prompt before kicking off any A/B find their tests sharper, their results easier to act on, and their disagreements with growth/data teams shorter. Tested cleanest on Claude Opus 4.7.