Experimentation · Causal Inference

I make A/B testing trustworthy.

You ship "wins" that are noise. You kill tests that were real. I'm the fractional senior owner who makes the difference obvious — and tells you which numbers are safe to ship on.

Fractional · 2–3 days/week · remote, full US-timezone overlap
Free diagnostic · no email required

Are your tests lying to you?

Five questions. I'll score your pipeline's trustworthiness out of 100, surface the top 3 risks specific to your setup, and tell you exactly which of your "wins" are most likely noise.

5 questions ~90 seconds 0 sign-up
01Which A/B testing tool do you use?
01 Tool
02How many A/B tests do you ship per month?
5 per month
02 Tests / month
03Do you stop tests early when results look "significant"?
03 Peeking
04Do you correct for multiple comparisons (multiple metrics, segments, variants)?
04 Multiple-comparison correction
05Who reviews experiment results before they ship?
05 Review
Question 1 of 5
Your trustworthiness score
0/ 100
Top 3 risks for your setup
Want me to find and fix these in your actual data?
The Audit is $4,000, 14 days, fixed fee. At least one finding that changes a shipping decision, or you don't pay.
Book a 20-min call
✓ On its way. Check your inbox in the next 2 minutes.
Selected work · five systems in production

The fix isn't a smarter analyst — it's the layer above the default tool.

Each card has two sides. Front: the one-line idea. Click to flip: full case, interactive demos, real metrics.

Scroll on cards · hover to flip
Case 01 · Experimentation
Sequential testing — answers faster.
30–40% faster reads
Same confidence, not less. Calibrated on real metric distributions · more on clear winners
Default tools stop tests at the wrong moment. A sequential layer above them stops at the right one.
Flip for details + live demo i / v
↩ Click anywhere to return
Case 01 · Experimentation

Sequential testing

Default A/B tooling stops at the wrong moment. The fix is a sequential layer above the default: OBF boundaries calibrated through Monte Carlo on real metric distributions.

30–40%
Faster decisions
47→72%
Realised power
26→14d
Avg duration
+42%
Tests / year
Sequential plannerLive
Daily traffic12k
Baseline CR3.8%
MDE (rel)2.0%
Sample / arm
Fixed
Sequential
Z-path · OBF boundary
Case 02 · Prioritisation
Caliper — which test do we run next?
Stop arguing. Start ranking.
Engineering routed to highest-ROI tests · Calibrated on 121 past experiments
Backlogs decided by the loudest voice in the room. This one's decided by 121 past experiments and one comparable Index.
Flip for details + live demo ii / v
↩ Click anywhere to return
Case 02 · Prioritisation

Caliper

Classifies hypotheses by feature type, calibrates per-category coefficients with Bayesian shrinkage on 121 past experiments, and collapses uplift × dev cost × historical performance into one comparable Index.

121
Experiments calibrated
60+
Mixpanel configs
4 types
Feature classifier
1
Number to argue about
🧪 Type your hypothesisLive
HypothesisTry editing ↓
Messaging
Visual UI
Flow
Personalisation
Prioritisation Index
feature weight × classification confidence
Case 03 · Multi-agent AI
Smart Analytics — agents for the grunt work.
Scale without hiring
~10h/wk per analyst back to real analysis · 3 CrewAI crews · Slack-native
Half the analyst's week is intake docs and summaries. Three AI crews now handle it. Slack-native, in production.
Flip for details iii / v
↩ Click anywhere to return
Case 03 · Multi-agent AI

Smart Analytics

CrewAI system with three crews — post-analysis summary, hypothesis generation, experiment planning. AI removes the grunt work, not the judgement.

3 crews
Summary · hypothesis · planner
CrewAI
Multi-agent framework
Slack
Native interface
~10h/wk
Analyst time recovered
# abtest-resultsCrewAI · Slack
user · 2:14 PM
summarize EXP-2024-148
A
Analyst Bot 2:14 PM

checkout-cta-color · 16 days

Test wins +2.1% (95% CI: 0.8–3.4%)

No interaction with traffic source

Mobile only — desktop variant flat

Recommend: ship to mobile; rerun desktop Q3.

Case 04 · Causal inference
When there's no control group.
Know what your big launches did
Defensible attribution where A/B is impossible · ITS · DiD · CITS
Some features can't be A/B tested. Interrupted time series and difference-in-differences measure them anyway.
Flip for details iv / v
↩ Click anywhere to return
Case 04 · Causal inference

Causal inference

Quasi-experimental measurement framework. Interrupted Time Series + Difference-in-Differences applied to feature rollouts that never had an A/B — subscriber discount, wallet adoption.

ITS
Interrupted time series
DiD
Difference-in-differences
CITS
Controlled ITS
2 rollouts
Production case studies
ITS · Subscriber discountPA-1095
INTERVENTION
Actual
Counterfactual
Treatment effect+8.4%95% CI [5.2, 11.6]
Case 05 · Predictive systems
Churn — predictive systems in production.
Save churners before they leave
Right intervention to right cohort · 0.81 top-decile precision · 4-segment routing
Survival curves are pretty. Predictive routing of cohorts to interventions is what actually saves customers.
Flip for details v / v
↩ Click anywhere to return
Case 05 · Predictive systems

Churn — predictive routing

XGBoost + SHAP-driven feature framing. The predictive layer routes cohorts to interventions — and the causal-inference rigour validates whether they actually worked.

XGBoost
Model
SHAP
Feature framing
4 segments
Routed to treatments
DiD
Post-treatment attribution
Churn model · SHAPXGBoost
recency_days
.42
avg_orders_mo
.33
first_promo_used
.26
subscription_tier
.20
avg_returns_yr
.16
Two ways to work together
The Audit  $4,000 · 2 weeks
I score exactly where your tests are lying, and hand you the roadmap.
Fractional Lead  from $8k/mo
I run the fix — 2–3 days a week — and level up your team.
Receipts, not adjectives
59×
a revenue metric was inflating results — it looked exactly like real signal. Caught before the team shipped on it.
2 rates ↑
churned and retained both rose in one test — a polarization artifact. Killed a "winner" that would have hurt.
30–40%
earlier that experiments could have stopped, with sequential testing — faster decisions, same confidence.
How we work together

Start with a fixed-fee audit. Keep me if it's worth it.

No open-ended retainers to sign blind. Two weeks, a clear scorecard, then you decide.

The wedge · 2 weeks

Experimentation Pipeline Audit

$4,000 fixed
14 days · 50% upfront
  • Statistical rigor — peeking, power, triggering, run-duration, multiple comparisons
  • Metric design — proxy metrics, guardrails, the "both rates rose" traps
  • Tooling — Amplitude / Statsig / GrowthBook / Eppo, configured for honest reads
  • Deliverables — a trustworthiness scorecard + a prioritized test roadmap + a team walkthrough
The guaranteeAt least one finding that changes a shipping decision, or you don't pay. Not a count of nitpicks. One call you'd otherwise have gotten wrong.
The fee credits toward your first month if we continue.
The engagement · ongoing

Fractional Experimentation Lead

from $8,000 / month
2–3 days/week · 3-month minimum
  • I run the roadmap I built in the audit — and own the experiment review
  • Install the rigor your platform was bought for but never got
  • Causal measurement for launches you can't A/B test — interrupted time series, diff-in-diff
  • Level up the team so the practice outlasts me
No lock-in3-month start, then month-to-month. You own every playbook and dashboard I build — leave the day the practice runs without me.
You get the senior operator directly — not an agency junior.
The honest comparison

The real question isn't my fee. It's what the alternative costs you.

Three ways to fix an experimentation pipeline you can't trust — in time, in cash, and in bad ships.

Hire it in-house A senior experimentation lead, full-time.
Do nothing Keep shipping on the current pipeline.
Time to first insight
3–5 months to hire + ramp
Never
Year-1 cash cost
$180k–$240k fully loaded
$0 up front
Who owns the rigor
Whoever you can hire and retain
No one
Time to productive
3–6 months to full speed
What you commit to
Salary, equity, severance
Your false "wins"
Found once they ramp
Keep shipping

The audit costs less than two weeks of a senior hire's loaded salary — and tells you whether you even need one.

Not sure how much of your test data you can trust?

Book a 20-min call
Who this is for
  • Seed to Series B, with a product and real traffic
  • A data team of 1–5 that can't yet trust its experiment reads
  • Just hired (or about to) for experimentation or growth analytics
  • Bought the platform — but no senior person owns the rigor
Probably not a fit

Enterprises with a dedicated experimentation platform team, or pre-product companies with no traffic to test on yet. I'd tell you so on the call rather than take the fee.

Who you'd be working with

I'm David Arzumanian. Eight years making numbers tell the truth in product analytics.

I've spent my career on the unglamorous half of experimentation: the part where a "win" turns out to be a logging bug, a proxy metric quietly disagrees with revenue, and someone has to say "we can't ship on this yet." Most recently I built the experimentation rigor at a subscription e-commerce company at scale. I work these traps out loud on LinkedIn and in long-form on the writing page. Every role and how I think is public — verify all of it before we talk.
Common questions

Questions I get from product teams

Why did my A/B test "winner" stop working after we shipped it?
Usually one of three things: you stopped the test early the moment it looked significant (peeking inflates false positives two to five times), the lift was regression to the mean, or it was a novelty effect that faded. The number was real on screen and noise in reality. A pre-set run length and a sequential boundary tell the difference before you ship.
Is it bad to stop a test early when it hits significance?
Yes, if you stop the instant a fixed-horizon p-value crosses 0.05. Checking repeatedly and stopping on the first "win" is peeking, and it can turn a 5% false-positive rate into 20 to 30%. The fix is not "never look" — it is sequential testing, which lets you stop early safely with boundaries calibrated for continuous monitoring.
Do I need to correct for multiple comparisons?
If you read several metrics, segments, or variants off one test and celebrate whichever crosses the line, then yes. Every extra comparison is another lottery ticket for a false positive. Without correction a meaningful share of your "wins" are noise; with it, you ship on signal.
When should I hire a full-time experimentation lead vs. a fractional one?
Below roughly 8 to 10 tests a month, a full-time hire is hard to justify and the real problem is usually underpowered tests, not headcount. A fractional lead installs the rigor, owns the review, and levels up your team two to three days a week. When test volume and stakes outgrow that, you hire in-house and I hand it off.
What does the $4,000 audit actually include?
Two weeks, fixed fee. I review your statistical rigor, metric design, and tooling, then hand you a trustworthiness scorecard, a prioritized roadmap, and a team walkthrough. The guarantee: at least one finding that changes a shipping decision, or you don't pay. The fee credits toward your first month if we continue.
Which A/B testing tools do you work with?
The methodology is tool-agnostic — I audit whatever you run. In practice I am deepest in Statsig, Eppo, GrowthBook, and Amplitude. The point is not the tool; it is whether you are using its stats engine in a way you can actually trust.

Which of your numbers are safe to ship on?

Book a 20-minute call