AI research experimentation design
27-08-2025
- https://www.amplifypartners.com/blog-posts/the-ai-research-experimentation-problem
- difficulties in building genai applications
- high dimensionality and stochastic nature, hard to spot biases and minimize random errors
- human preferences shift
- report uncertainty via confidence intervals, using hypothesis tests to compare models, and explicitly planning for statistical power so experiments can reliably detect meaningful effects
- principles
- designing better experiments
- data contamination
- test data (benchmarks) gets added to sites across the internetincluded in training data
- dataset scrubbing is difficult / impossibls, some are watermarking becnhmark datasets so models can recognise and skip
- data contamination
- running better experiments
- evaluations are often deffered until the end of the training run (instead of being used as a feedback mechanism applied throughout)
- making evals faster and easier to integrate into training workflows is one high-impact way to accelerate AI progress
- evals should start small and expand as the system improves, offering valuable signal
- writing good software requires a good toolchain to test and verify code quickly, fast feedback cycles are core
- elimating bugs
- keep the debug loop tight
- continuosuly validate with training invariants inferred through traces (rules that must hold true throughout training)
- use logging (hindsight logging with model checkpoints)
- analysing experiments better
- mechanistic interpretability (activation patching, attribution patching, and probing)
- circuit tracing to idenitfy interpretable features
- mechanistic interpretability (activation patching, attribution patching, and probing)
- designing better experiments
We talk a lot about scaling models. Maybe it's time we scaled good science.