slice icon Context Slice

Interpreting Statistical Output

After running codeCalculate Statistical Significance, you'll have statistical results to interpret. This guide helps translate numbers into decisions.

Key Metrics Explained

P-value: The probability of seeing this result (or more extreme) if there's actually no difference between variants. Lower = more confident the difference is real.

P-value Interpretation
< 0.01 Very strong evidence of real difference
0.01–0.05 Strong evidence (standard threshold)
0.05–0.10 Weak evidence, consider extending test
> 0.10 Insufficient evidence to conclude

Confidence Interval (CI): The range where the true effect likely falls. A 95% CI means if we ran this test 100 times, ~95 of those CIs would contain the true effect.

  • CI excludes zero → statistically significant
  • CI includes zero → not statistically significant
  • Narrow CI → precise estimate
  • Wide CI → uncertain estimate

Lift: The relative change from control. +10% lift means the variant performed 10% better than control on that metric.

Statistical vs Practical Significance

A result can be statistically significant but not practically significant (or vice versa).

Scenario Action
Stat sig + large lift Ship confidently
Stat sig + tiny lift Evaluate if lift worth the complexity
Not stat sig + large lift Extend test for more data
Not stat sig + small lift Likely no real effect; don't ship

Rule of thumb: For most product changes, aim for at least 5% relative lift to be worth shipping. Smaller lifts may not justify the maintenance cost.

Sample Size Considerations

Small samples produce unreliable results. Watch for these warnings:

Sample per variant Reliability
< 30 Unreliable—do not trust results
30–100 Low power—can detect large effects only
100–1000 Moderate power—standard tests
> 1000 Good power—can detect small effects

Minimum detectable effect (MDE): With small samples, you can only detect large differences. If your expected lift is 5% but sample only supports detecting 20%+ effects, a "no significance" result is meaningless.

Decision Framework

Use this framework to translate results into action:

IF p-value < 0.05 AND lift > 0:
  → SHIP if practical significance is clear
  → Consider rollout % if lift is small

IF p-value < 0.05 AND lift < 0:
  → DO NOT SHIP
  → Investigate why hypothesis was wrong

IF p-value 0.05–0.10:
  → EXTEND test if sample allows
  → Check segment effects—might be significant for some

IF p-value > 0.10:
  → No evidence of effect
  → Ship only if simpler/cheaper than control
  → Consider the "no harm" standard for cleanup tests

Segment Analysis

Aggregate results can hide important heterogeneous effects. Always check:

  • Plan tiers: Enterprise vs SMB often respond differently
  • User tenure: New users vs veterans
  • Geography: If applicable
  • Platform: Mobile vs desktop

Simpson's Paradox: The overall effect can be opposite to every segment's effect. If segments show contradictory results, investigate before deciding.

Common Pitfalls

Peeking: Looking at results multiple times and stopping when significant inflates false positives. Decide duration upfront.

Multiple comparisons: Testing 10 metrics means ~1 will be "significant" by chance. Focus on the primary metric declared before the test.

Novelty effect: Early variant wins may fade as users adapt. Consider longer tests for UI changes.

Selection bias: If variant causes dropoff, remaining users are self-selected. Check funnel metrics alongside primary metric.

Survivorship: Revenue-per-user might increase if variant pushes away low-value users. Check total revenue, not just per-user.

A/B Test Results Template

Present results using this structure:

# A/B Test Results: [Test Name]

## Summary
**Result:** [SHIP / DO NOT SHIP / INCONCLUSIVE]
**Primary metric:** [Metric] — [Variant value] vs [Control value]
**Lift:** [+/-X%] (95% CI: [lower] to [upper])
**Statistical significance:** [Yes/No] (p = [value])

## Detailed Results

| Variant | Sample Size | [Primary Metric] | Lift vs Control | Significant? |
|---------|-------------|------------------|-----------------|--------------|
| Control | [N] | [Value] | — | — |
| [Variant] | [N] | [Value] | [+/-X%] | [Yes/No] |

## Sample Size Assessment
[Adequate / Marginal / Insufficient]
[Any warnings from the analysis]

## Secondary Metrics

| Metric | Control | Variant | Change | Concern? |
|--------|---------|---------|--------|----------|
| [Metric 1] | [Value] | [Value] | [+/-X%] | [Yes/No] |

## Segment Analysis
[How results varied by segment, if applicable]

| Segment | Lift | Significant? | Notes |
|---------|------|--------------|-------|
| [Segment 1] | [+/-X%] | [Yes/No] | [Context] |

## Interpretation

### Statistical Interpretation
[What the numbers tell us]

### Practical Interpretation
[What this means for the product/business]

### Confidence Assessment
[How confident are we in these results? Any caveats?]

## Recommendation

**Decision:** [Ship to 100% / Don't ship / Iterate / Extend test]

**Rationale:**
- [Reason 1]
- [Reason 2]

**Risks if shipping:**
- [Risk 1]
- [Risk 2]

## Next Steps
1. [Action item]
2. [Follow-up experiment to consider]
3. [Monitoring plan post-launch]

When Results Are Inconclusive

Not every test produces a clear winner. Options when inconclusive:

  1. Extend the test — If trending toward significance, more data may clarify
  2. Ship simpler option — If no difference, prefer less complexity
  3. Check segments — Might be winning for a specific cohort
  4. Redesign and retest — If the change was too subtle, try a bolder variant
  5. Accept the learning — "No effect" is valuable information for roadmap prioritization