AI research experimentation design

27-08-2025

https://www.amplifypartners.com/blog-posts/the-ai-research-experimentation-problem
difficulties in building genai applications
- high dimensionality and stochastic nature, hard to spot biases and minimize random errors
- human preferences shift
  - https://arxiv.org/abs/2411.00640
- report uncertainty via confidence intervals, using hypothesis tests to compare models, and explicitly planning for statistical power so experiments can reliably detect meaningful effects
principles
- designing better experiments
  - data contamination
    - test data (benchmarks) gets added to sites across the internetincluded in training data
    - dataset scrubbing is difficult / impossibls, some are watermarking becnhmark datasets so models can recognise and skip
- running better experiments
  - evaluations are often deffered until the end of the training run (instead of being used as a feedback mechanism applied throughout)
  - making evals faster and easier to integrate into training workflows is one high-impact way to accelerate AI progress
  - evals should start small and expand as the system improves, offering valuable signal
  - writing good software requires a good toolchain to test and verify code quickly, fast feedback cycles are core
  - elimating bugs
    - keep the debug loop tight
    - continuosuly validate with training invariants inferred through traces (rules that must hold true throughout training)
    - use logging (hindsight logging with model checkpoints)
- analysing experiments better
  - mechanistic interpretability (activation patching, attribution patching, and probing)
    - circuit tracing to idenitfy interpretable features

We talk a lot about scaling models. Maybe it's time we scaled good science.