Back to Home
ARCHITECTURE JAN 2024 8 MIN READ

From Sequential To Parallel: Scaling Experimentation Inference

A technical retrospective on re-architecting a data pipeline to reduce runtime by 75% and unlock $150M+ in revenue attribution through variance reduction.
Read the strategy behind this architecture →

Context

Fortune 100 retailer, tens of millions of shoppers, dozens of concurrent experiments. The challenge was to scale from 3 to 8+ tests/week while maintaining statistical rigor.

The Problem

Legacy system: 15+ sequential queries per experiment = ~15 min runtime. As metrics increased, latency scaled linearly toward 25-30 minutes.

Bottleneck: Business needed 8+ concurrent experiments per week = impossible at 15 min/test.

LEGACY ARCHITECTURE (Sequential)
Runtime: ~15 mins / experiment

[ Transactions Query ] 
        ⬇
[  Engagement Query  ] 
        ⬇
[     Margin Query   ] 
        ⬇
[      ... x12       ]
        ⬇
[   Hypothesis Test  ]

🔴 O(N) Table Scans matches metric count
🔴 High I/O overhead & Redundant reads

The Solution

Part 1: Parallel Metric Bundles

We re-architected the system to treat metrics not as individual queries, but as bundles based on their source. By grouping all transaction-based metrics into a single pass and parallelizing independent streams, we reduced runtime to ~4 minutes (75% reduction).

OPTIMIZED ARCHITECTURE (Parallel Bundles)
Runtime: ~4 mins / experiment

[ Txn Bundle ]    [ Eng Bundle ]    [ Other ]
      │                 │               │
      └─────────┬───────┴───────────────┘
                ▼
      [   Unified Merge   ]
                ▼
      [  CUPED Variance-  ]
      [     Reduction     ]
                ▼
      [   Dashboard output ]

🟢 Single pass per massive table
🟢 Parallel execution & Pre-computation

Part 2: CUPED Variance Reduction

Implemented CUPED (Controlled-Experiment Using Pre-Experiment Data) using pre-experiment user behavior as covariates to reduce noise. This resulted in a 30% lower Minimum Detectable Effect (MDE), allowing us to shorten test duration from 6-8 weeks to 3-4 weeks.

The "Shadow" Implementation

While this architectural shift drove the performance gains, we didn't wait for a full-stack engineering team to build the platform before we started reaping the rewards. I needed a way to execute these parallel bundles now, not in six months.

To do that, I built a "shadow" application using nothing but Jupyter Notebooks, Excel config files, and a driver script. It was the scrappy prototype that validated the architecture.

READ THE IMPLEMENTATION STORY

The "Sandbox" Strategy: How I Turned a Jupyter Notebook into an Enterprise App

Results

  • Inference Time: Reduced from 15 min to 4 min.
  • Velocity: Scaled from 3 to 8+ tests/week.
  • Sensitivity: Can detect 0.5% lifts (vs 1.5% before).
  • Impact: $150M+ in attributed revenue enabled by velocity + sensitivity.

Technologies

Python PySpark GCP BigQuery Bayesian Statistics CUPED

Lessons

  1. O(1) table scans force good architecture. Constraints drive better design than unlimited compute.
  2. Variance reduction compounds with velocity. Lower noise means faster tests, which means more tests, which means more learning.
  3. Infrastructure quality enables organizational velocity. You can't change culture if the tools are slow.