What does the 4000 dollar audit include?

Two weeks, fixed fee. A review of statistical rigor, metric design, and tooling, delivered as a trustworthiness scorecard, a prioritized roadmap, and a team walkthrough. Guarantee: at least one finding that changes a shipping decision or you don't pay, and the fee credits toward your first month if you continue.

Experimentation · Causal Inference

I make A/B testing trustworthy.

Q: Why did my A/B test winner stop working after we shipped it?

Usually peeking (stopping early the moment it looked significant, which inflates false positives), regression to the mean, or a novelty effect that faded. The lift was real on screen and noise in reality. A pre-set run length and a sequential boundary tell the difference before you ship.

Q: When should I hire a full-time experimentation lead vs a fractional one?

Below roughly 8 to 10 tests a month a full-time hire is hard to justify and the real problem is usually underpowered tests. A fractional lead installs the rigor and owns the review two to three days a week; when volume and stakes outgrow that, you hire in-house.

Q: Which A/B testing tools do you work with?

The methodology is tool-agnostic and I audit whatever you run; deepest in Statsig, Eppo, GrowthBook, and Amplitude. The point is whether you are using the stats engine in a way you can trust.

You ship "wins" that are noise. You kill tests that were real. I'm the fractional senior owner who makes the difference obvious — and tells you which numbers are safe to ship on.

Take the 90-second diagnostic ↓ Book a 20-min call

Fractional · 2–3 days/week · remote, full US-timezone overlap

Free diagnostic · no email required

Are your tests lying to you?

Five questions. I'll score your pipeline's trustworthiness out of 100, surface the top 3 risks specific to your setup, and tell you exactly which of your "wins" are most likely noise.

5 questions ~90 seconds 0 sign-up

01Which A/B testing tool do you use?

01 Tool —

02How many A/B tests do you ship per month?

5 per month

02 Tests / month —

03Do you stop tests early when results look "significant"?

03 Peeking —

04Do you correct for multiple comparisons (multiple metrics, segments, variants)?

04 Multiple-comparison correction —

05Who reviews experiment results before they ship?

05 Review —

Question 1 of 5

Your trustworthiness score

0/ 100

—

Top 3 risks for your setup

Want me to find and fix these in your actual data?

The Audit is $4,000, 14 days, fixed fee. At least one finding that changes a shipping decision, or you don't pay.

Book a 20-min call →

✓ On its way. Check your inbox in the next 2 minutes.

Selected work · five systems in production

The fix isn't a smarter analyst — it's the layer above the default tool.

Each card has two sides. Front: the one-line idea. Click to flip: full case, interactive demos, real metrics.

Scroll on cards · hover to flip

Case 01 · Experimentation

Sequential testing — answers faster.

30–40% faster reads

Same confidence, not less. Calibrated on real metric distributions · more on clear winners

Default tools stop tests at the wrong moment. A sequential layer above them stops at the right one.

↻Flip for details + live demo i / v

↩ Click anywhere to return

Case 01 · Experimentation

Sequential testing

Default A/B tooling stops at the wrong moment. The fix is a sequential layer above the default: OBF boundaries calibrated through Monte Carlo on real metric distributions.

30–40%

Faster decisions

47→72%

Realised power

26→14d

Avg duration

+42%

Tests / year

Case study on request →

Sequential plannerLive

Daily traffic12k

Baseline CR3.8%

MDE (rel)2.0%

Sample / arm

—

Fixed

—

Sequential

—

Z-path · OBF boundary—

Case 02 · Prioritisation

Caliper — which test do we run next?

Stop arguing. Start ranking.

Engineering routed to highest-ROI tests · Calibrated on 121 past experiments

Backlogs decided by the loudest voice in the room. This one's decided by 121 past experiments and one comparable Index.

↻Flip for details + live demo ii / v

↩ Click anywhere to return

Case 02 · Prioritisation

Caliper

Classifies hypotheses by feature type, calibrates per-category coefficients with Bayesian shrinkage on 121 past experiments, and collapses uplift × dev cost × historical performance into one comparable Index.

121

Experiments calibrated

60+

Mixpanel configs

4 types

Feature classifier

Number to argue about

Case study on request →

🧪 Type your hypothesisLive

HypothesisTry editing ↓

—

Messaging

—

Visual UI

—

Flow

—

Personalisation

—

Prioritisation Index

feature weight × classification confidence

—

Case 03 · Multi-agent AI

Smart Analytics — agents for the grunt work.

Scale without hiring

~10h/wk per analyst back to real analysis · 3 CrewAI crews · Slack-native

Half the analyst's week is intake docs and summaries. Three AI crews now handle it. Slack-native, in production.

↻Flip for details iii / v

↩ Click anywhere to return

Case 03 · Multi-agent AI

Smart Analytics

CrewAI system with three crews — post-analysis summary, hypothesis generation, experiment planning. AI removes the grunt work, not the judgement.

3 crews

Summary · hypothesis · planner

CrewAI

Multi-agent framework

Slack

Native interface

~10h/wk

Analyst time recovered

Draft on request →

# abtest-resultsCrewAI · Slack

user · 2:14 PM

summarize EXP-2024-148

Analyst Bot 2:14 PM

checkout-cta-color · 16 days

✓ Test wins +2.1% (95% CI: 0.8–3.4%)

✓ No interaction with traffic source

⚠ Mobile only — desktop variant flat

Recommend: ship to mobile; rerun desktop Q3.

Case 04 · Causal inference

When there's no control group.

Know what your big launches did

Defensible attribution where A/B is impossible · ITS · DiD · CITS

Some features can't be A/B tested. Interrupted time series and difference-in-differences measure them anyway.

↻Flip for details iv / v

↩ Click anywhere to return

Case 04 · Causal inference

Causal inference

Quasi-experimental measurement framework. Interrupted Time Series + Difference-in-Differences applied to feature rollouts that never had an A/B — subscriber discount, wallet adoption.

ITS

Interrupted time series

DiD

Difference-in-differences

CITS

Controlled ITS

2 rollouts

Production case studies

Draft on request →

ITS · Subscriber discountPA-1095

Actual

Counterfactual

Treatment effect+8.4%95% CI [5.2, 11.6]

Case 05 · Predictive systems

Churn — predictive systems in production.

Save churners before they leave

Right intervention to right cohort · 0.81 top-decile precision · 4-segment routing

Survival curves are pretty. Predictive routing of cohorts to interventions is what actually saves customers.

↻Flip for details v / v

↩ Click anywhere to return

Case 05 · Predictive systems

Churn — predictive routing

XGBoost + SHAP-driven feature framing. The predictive layer routes cohorts to interventions — and the causal-inference rigour validates whether they actually worked.

XGBoost

Model

SHAP

Feature framing

4 segments

Routed to treatments

DiD

Post-treatment attribution

Draft on request →

Churn model · SHAPXGBoost

recency_days

.42

avg_orders_mo

.33

first_promo_used

.26

subscription_tier

.20

avg_returns_yr

.16

Two ways to work together

The Audit $4,000 · 2 weeks

I score exactly where your tests are lying, and hand you the roadmap.

Fractional Lead from $8k/mo

I run the fix — 2–3 days a week — and level up your team.

Receipts, not adjectives

59×

a revenue metric was inflating results — it looked exactly like real signal. Caught before the team shipped on it.

2 rates ↑

churned and retained both rose in one test — a polarization artifact. Killed a "winner" that would have hurt.

30–40%

earlier that experiments could have stopped, with sequential testing — faster decisions, same confidence.

How we work together

Start with a fixed-fee audit. Keep me if it's worth it.

No open-ended retainers to sign blind. Two weeks, a clear scorecard, then you decide.

The wedge · 2 weeks

Experimentation Pipeline Audit

$4,000 fixed

14 days · 50% upfront

Statistical rigor — peeking, power, triggering, run-duration, multiple comparisons
Metric design — proxy metrics, guardrails, the "both rates rose" traps
Tooling — Amplitude / Statsig / GrowthBook / Eppo, configured for honest reads
Deliverables — a trustworthiness scorecard + a prioritized test roadmap + a team walkthrough

✓ The guaranteeAt least one finding that changes a shipping decision, or you don't pay. Not a count of nitpicks. One call you'd otherwise have gotten wrong.

The fee credits toward your first month if we continue.

The engagement · ongoing

Fractional Experimentation Lead

from $8,000 / month

2–3 days/week · 3-month minimum

I run the roadmap I built in the audit — and own the experiment review
Install the rigor your platform was bought for but never got
Causal measurement for launches you can't A/B test — interrupted time series, diff-in-diff
Level up the team so the practice outlasts me

↻ No lock-in3-month start, then month-to-month. You own every playbook and dashboard I build — leave the day the practice runs without me.

You get the senior operator directly — not an agency junior.

The honest comparison

The real question isn't my fee. It's what the alternative costs you.

Three ways to fix an experimentation pipeline you can't trust — in time, in cash, and in bad ships.

Recommended Audit → Fractional Start at $4k. Scale only if it works.

Hire it in-house A senior experimentation lead, full-time.

Do nothing Keep shipping on the current pipeline.

Time to first insight

2 weeks

3–5 months to hire + ramp

Never

Year-1 cash cost

$4k audit, then from $8k/mo only if you continue

$180k–$240k fully loaded

$0 up front

Who owns the rigor

A senior operator, week one

Whoever you can hire and retain

No one

Time to productive

Immediately

3–6 months to full speed

—

What you commit to

Cancel after the audit

Salary, equity, severance

—

Your false "wins"

Found & quantified in 14 days

Found once they ramp

Keep shipping

The audit costs less than two weeks of a senior hire's loaded salary — and tells you whether you even need one.

Not sure how much of your test data you can trust?

Book a 20-min call

Who this is for

Seed to Series B, with a product and real traffic
A data team of 1–5 that can't yet trust its experiment reads
Just hired (or about to) for experimentation or growth analytics
Bought the platform — but no senior person owns the rigor

Probably not a fit

Enterprises with a dedicated experimentation platform team, or pre-product companies with no traffic to test on yet. I'd tell you so on the call rather than take the fee.

Who you'd be working with

I'm David Arzumanian. Eight years making numbers tell the truth in product analytics.

I've spent my career on the unglamorous half of experimentation: the part where a "win" turns out to be a logging bug, a proxy metric quietly disagrees with revenue, and someone has to say "we can't ship on this yet." Most recently I built the experimentation rigor at a subscription e-commerce company at scale. I work these traps out loud on LinkedIn and in long-form on the writing page. Every role and how I think is public — verify all of it before we talk.

Common questions

Questions I get from product teams

Why did my A/B test "winner" stop working after we shipped it?

Usually one of three things: you stopped the test early the moment it looked significant (peeking inflates false positives two to five times), the lift was regression to the mean, or it was a novelty effect that faded. The number was real on screen and noise in reality. A pre-set run length and a sequential boundary tell the difference before you ship.

Is it bad to stop a test early when it hits significance?

Yes, if you stop the instant a fixed-horizon p-value crosses 0.05. Checking repeatedly and stopping on the first "win" is peeking, and it can turn a 5% false-positive rate into 20 to 30%. The fix is not "never look" — it is sequential testing, which lets you stop early safely with boundaries calibrated for continuous monitoring.

Do I need to correct for multiple comparisons?

If you read several metrics, segments, or variants off one test and celebrate whichever crosses the line, then yes. Every extra comparison is another lottery ticket for a false positive. Without correction a meaningful share of your "wins" are noise; with it, you ship on signal.

When should I hire a full-time experimentation lead vs. a fractional one?

Below roughly 8 to 10 tests a month, a full-time hire is hard to justify and the real problem is usually underpowered tests, not headcount. A fractional lead installs the rigor, owns the review, and levels up your team two to three days a week. When test volume and stakes outgrow that, you hire in-house and I hand it off.

What does the $4,000 audit actually include?

Two weeks, fixed fee. I review your statistical rigor, metric design, and tooling, then hand you a trustworthiness scorecard, a prioritized roadmap, and a team walkthrough. The guarantee: at least one finding that changes a shipping decision, or you don't pay. The fee credits toward your first month if we continue.

Which A/B testing tools do you work with?

The methodology is tool-agnostic — I audit whatever you run. In practice I am deepest in Statsig, Eppo, GrowthBook, and Amplitude. The point is not the tool; it is whether you are using its stats engine in a way you can actually trust.

Which of your numbers are safe to ship on?

Book a 20-minute call