Interpreting Statistical Output
After running Calculate Statistical Significance, you'll have statistical results to interpret. This guide helps translate numbers into decisions.
Key Metrics Explained
P-value: The probability of seeing this result (or more extreme) if there's actually no difference between variants. Lower = more confident the difference is real.
| P-value | Interpretation |
|---|---|
| < 0.01 | Very strong evidence of real difference |
| 0.01–0.05 | Strong evidence (standard threshold) |
| 0.05–0.10 | Weak evidence, consider extending test |
| > 0.10 | Insufficient evidence to conclude |
Confidence Interval (CI): The range where the true effect likely falls. A 95% CI means if we ran this test 100 times, ~95 of those CIs would contain the true effect.
- CI excludes zero → statistically significant
- CI includes zero → not statistically significant
- Narrow CI → precise estimate
- Wide CI → uncertain estimate
Lift: The relative change from control. +10% lift means the variant performed 10% better than control on that metric.
Statistical vs Practical Significance
A result can be statistically significant but not practically significant (or vice versa).
| Scenario | Action |
|---|---|
| Stat sig + large lift | Ship confidently |
| Stat sig + tiny lift | Evaluate if lift worth the complexity |
| Not stat sig + large lift | Extend test for more data |
| Not stat sig + small lift | Likely no real effect; don't ship |
Rule of thumb: For most product changes, aim for at least 5% relative lift to be worth shipping. Smaller lifts may not justify the maintenance cost.
Sample Size Considerations
Small samples produce unreliable results. Watch for these warnings:
| Sample per variant | Reliability |
|---|---|
| < 30 | Unreliable—do not trust results |
| 30–100 | Low power—can detect large effects only |
| 100–1000 | Moderate power—standard tests |
| > 1000 | Good power—can detect small effects |
Minimum detectable effect (MDE): With small samples, you can only detect large differences. If your expected lift is 5% but sample only supports detecting 20%+ effects, a "no significance" result is meaningless.
Decision Framework
Use this framework to translate results into action:
IF p-value < 0.05 AND lift > 0:
→ SHIP if practical significance is clear
→ Consider rollout % if lift is small
IF p-value < 0.05 AND lift < 0:
→ DO NOT SHIP
→ Investigate why hypothesis was wrong
IF p-value 0.05–0.10:
→ EXTEND test if sample allows
→ Check segment effects—might be significant for some
IF p-value > 0.10:
→ No evidence of effect
→ Ship only if simpler/cheaper than control
→ Consider the "no harm" standard for cleanup testsSegment Analysis
Aggregate results can hide important heterogeneous effects. Always check:
- Plan tiers: Enterprise vs SMB often respond differently
- User tenure: New users vs veterans
- Geography: If applicable
- Platform: Mobile vs desktop
Simpson's Paradox: The overall effect can be opposite to every segment's effect. If segments show contradictory results, investigate before deciding.
Common Pitfalls
Peeking: Looking at results multiple times and stopping when significant inflates false positives. Decide duration upfront.
Multiple comparisons: Testing 10 metrics means ~1 will be "significant" by chance. Focus on the primary metric declared before the test.
Novelty effect: Early variant wins may fade as users adapt. Consider longer tests for UI changes.
Selection bias: If variant causes dropoff, remaining users are self-selected. Check funnel metrics alongside primary metric.
Survivorship: Revenue-per-user might increase if variant pushes away low-value users. Check total revenue, not just per-user.
A/B Test Results Template
Present results using this structure:
# A/B Test Results: [Test Name]
## Summary
**Result:** [SHIP / DO NOT SHIP / INCONCLUSIVE]
**Primary metric:** [Metric] — [Variant value] vs [Control value]
**Lift:** [+/-X%] (95% CI: [lower] to [upper])
**Statistical significance:** [Yes/No] (p = [value])
## Detailed Results
| Variant | Sample Size | [Primary Metric] | Lift vs Control | Significant? |
|---------|-------------|------------------|-----------------|--------------|
| Control | [N] | [Value] | — | — |
| [Variant] | [N] | [Value] | [+/-X%] | [Yes/No] |
## Sample Size Assessment
[Adequate / Marginal / Insufficient]
[Any warnings from the analysis]
## Secondary Metrics
| Metric | Control | Variant | Change | Concern? |
|--------|---------|---------|--------|----------|
| [Metric 1] | [Value] | [Value] | [+/-X%] | [Yes/No] |
## Segment Analysis
[How results varied by segment, if applicable]
| Segment | Lift | Significant? | Notes |
|---------|------|--------------|-------|
| [Segment 1] | [+/-X%] | [Yes/No] | [Context] |
## Interpretation
### Statistical Interpretation
[What the numbers tell us]
### Practical Interpretation
[What this means for the product/business]
### Confidence Assessment
[How confident are we in these results? Any caveats?]
## Recommendation
**Decision:** [Ship to 100% / Don't ship / Iterate / Extend test]
**Rationale:**
- [Reason 1]
- [Reason 2]
**Risks if shipping:**
- [Risk 1]
- [Risk 2]
## Next Steps
1. [Action item]
2. [Follow-up experiment to consider]
3. [Monitoring plan post-launch]When Results Are Inconclusive
Not every test produces a clear winner. Options when inconclusive:
- Extend the test — If trending toward significance, more data may clarify
- Ship simpler option — If no difference, prefer less complexity
- Check segments — Might be winning for a specific cohort
- Redesign and retest — If the change was too subtle, try a bolder variant
- Accept the learning — "No effect" is valuable information for roadmap prioritization