The second half of AI shifts from solving problems to to defining problems

27-08-2025

https://ysymyth.github.io/The-Second-Half/

  • ai progress
    • current ai progress is focussed on building new models and methods to improve existing benchmarks
    • future ai progress should be focussed on evaluating real-world progress (evaluations become more important than training, with a shift from solving problems to defining problems and measuring real-world value add)
  • contrast to reinforcement learning
    • RL requires algorithms, environments, priors
      • without priors, reinforcement learning cannot generalise well (for example, matching zero-shot performance of humans on games)
      • priors can be obtained in ways completely unrelated to RL, using language pre-training to distill general commonsense (inherently, some domains will be outside the distribution of internet text, so not all tasks benefit, like controlling computer games or playing games)
    • how can we make agents generalise across tasks
      • couple the right priors (language pre-training) with the right environments (language reasoning as actions), and the right algorithms become trivial, we have seen this progress with agents
  • the utility problem
    • existing systems can surpass human expertise at most tasks, but this doesn't seem to translate into real-world value (not much has changed), the current working recipe works fine because when intelligence is low then improving intelligence generally improves utility, but as intelligence increases, the way in which that progress translates into real-world value becomes less clear
  • thinking about evaluation setups
    • current evaluation setups are different from real-world setups in many ways
      • current evaluations run automatically
      • current evaluations run independent, identically distributed (IID)
  • new evaluation setups
  • working recipe (1): build models which beat benchmarks -> create harder benchmarks -> repeat
  • working recipe (2): build evaluation setups for real-world utility -> use working recipe (1) -> repeat