Context
Fortune 100 retailer, tens of millions of shoppers, dozens of concurrent experiments. The challenge was to scale from 3 to 8+ tests/week while maintaining statistical rigor.
The Problem
Legacy system: 15+ sequential queries per experiment = ~15 min runtime. As metrics increased, latency scaled linearly toward 25-30 minutes.
Bottleneck: Business needed 8+ concurrent experiments per week = impossible at 15 min/test.
LEGACY ARCHITECTURE (Sequential)
Runtime: ~15 mins / experiment
[ Transactions Query ]
⬇
[ Engagement Query ]
⬇
[ Margin Query ]
⬇
[ ... x12 ]
⬇
[ Hypothesis Test ]
🔴 O(N) Table Scans matches metric count
🔴 High I/O overhead & Redundant readsThe Solution
Part 1: Parallel Metric Bundles
We re-architected the system to treat metrics not as individual queries, but as bundles based on their source. By grouping all transaction-based metrics into a single pass and parallelizing independent streams, we reduced runtime to ~4 minutes (75% reduction).
OPTIMIZED ARCHITECTURE (Parallel Bundles)
Runtime: ~4 mins / experiment
[ Txn Bundle ] [ Eng Bundle ] [ Other ]
│ │ │
└─────────┬───────┴───────────────┘
▼
[ Unified Merge ]
▼
[ CUPED Variance- ]
[ Reduction ]
▼
[ Dashboard output ]
🟢 Single pass per massive table
🟢 Parallel execution & Pre-computationPart 2: CUPED Variance Reduction
Implemented CUPED (Controlled-Experiment Using Pre-Experiment Data) using pre-experiment user behavior as covariates to reduce noise. This resulted in a 30% lower Minimum Detectable Effect (MDE), allowing us to shorten test duration from 6-8 weeks to 3-4 weeks.
The "Shadow" Implementation
While this architectural shift drove the performance gains, we didn't wait for a full-stack engineering team to build the platform before we started reaping the rewards. I needed a way to execute these parallel bundles now, not in six months.
To do that, I built a "shadow" application using nothing but Jupyter Notebooks, Excel config files, and a driver script. It was the scrappy prototype that validated the architecture.
The "Sandbox" Strategy: How I Turned a Jupyter Notebook into an Enterprise App
Results
- ➜ Inference Time: Reduced from 15 min to 4 min.
- ➜ Velocity: Scaled from 3 to 8+ tests/week.
- ➜ Sensitivity: Can detect 0.5% lifts (vs 1.5% before).
- ➜ Impact: $150M+ in attributed revenue enabled by velocity + sensitivity.
Technologies
Lessons
- O(1) table scans force good architecture. Constraints drive better design than unlimited compute.
- Variance reduction compounds with velocity. Lower noise means faster tests, which means more tests, which means more learning.
- Infrastructure quality enables organizational velocity. You can't change culture if the tools are slow.