AI research experimentation design

27-08-2025

  • https://www.amplifypartners.com/blog-posts/the-ai-research-experimentation-problem
  • difficulties in building genai applications
    • high dimensionality and stochastic nature, hard to spot biases and minimize random errors
    • human preferences shift
    • report uncertainty via confidence intervals, using hypothesis tests to compare models, and explicitly planning for statistical power so experiments can reliably detect meaningful effects
  • principles
    • designing better experiments
      • data contamination
        • test data (benchmarks) gets added to sites across the internetincluded in training data
        • dataset scrubbing is difficult / impossibls, some are watermarking becnhmark datasets so models can recognise and skip
    • running better experiments
      • evaluations are often deffered until the end of the training run (instead of being used as a feedback mechanism applied throughout)
      • making evals faster and easier to integrate into training workflows is one high-impact way to accelerate AI progress
      • evals should start small and expand as the system improves, offering valuable signal
      • writing good software requires a good toolchain to test and verify code quickly, fast feedback cycles are core
      • elimating bugs
        • keep the debug loop tight
        • continuosuly validate with training invariants inferred through traces (rules that must hold true throughout training)
        • use logging (hindsight logging with model checkpoints)
    • analysing experiments better
      • mechanistic interpretability (activation patching, attribution patching, and probing)
        • circuit tracing to idenitfy interpretable features

We talk a lot about scaling models. Maybe it's time we scaled good science.