Consolidation of model architecture

October 1, 2023

Consolidation of model architecture:

  • in 2000s
    • completely indepedent architecture (ie: vision, speech, NLP, RL), some not even ML-based
    • little collaboration
  • in 2010s
    • diverse architectures, but a transition to ML and specifically neural networks
    • easier to collaborate
  • in 2020s
    • convergence on the transformer as underlying architecture
    • extremely simple/flexible modelling framework (ie: train on sequences of words, text, image patches, state / action / reward transitions)
    • research ideas are easily shared and relevant across domains
      • reinforcing cycle of progress
      • concentrate software, hardware and infrastrucure
    • contrast to biology
      • neocortex has a highly uniform architecture across input modalities, indicating that a unified architecture might be an efficient design principle

Distinguishing features between transformers:

  • the data
  • the input / output specification that maps problem into and out of a sequence of vectors
  • the type of positional encoder and problem-specific structured sparsity pattern in the attention mask

In-context learning:

  • special-purpose computers
    • previous neural network architectures
  • general-purpose computers
    • two ingredients for general-purpose computers
      • appropriate architecture
      • training objective hard enough to force the optimisation to converge on it in the weights space of the network
    • transformer architecture
      • language modelling (next word prediction) is a great objective, simple to define and collect data for at scale, multi-tasking across domains
      • ability to learn via activations at runtime and not via changes to the weights of the models
        • reconfigurable at runtime to run natural language programs
      • emergent-only attributes that are only observed at scale
    • the core unlock was achieving a general purpose computer neural net via simple scalable objectives that have strong training signal

References: