Transformers United - DL Models that have revolutionized NLP, CV, RL

09-03-2023

  • url: https://www.youtube.com/watch?v=P127jhj-8-Y&list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM&index=3
  • early attention mechanisms
    • RNNs, Seq2Seq, LSTMs, GRUs
      • good for encoding history
      • bad for long sequences, context
  • current attention mechanisms
    • hard attention
      • learn attention weight in [0, 1] - closed interval / inclusive
        • image of cat, slightly blurred
      • expensive compute
    • soft attention
      • learn attention weight in 1 - set
        • image of cat, edges cut off
      • non-differentiability
    • local attention
      • local hard attention with global soft attention
    • global attention
      • soft attention
  • transformers
    • self-attention
      • Attention is All You Need
      • given a query, find a set of keys most similar to the query and return the corresponding values
        • self-attention is performed multiple times to provide multiple representation subspaces for each layer
    • power of transformers
      • positional embeddings
        • impart the notion of ordering, every token is related to every other token
      • nonlinearities
        • implemented as a simple feed-forward network
        • allows for complex mappings between inputs and outputs
      • masking
        • parallelise operations while not looking at the future
        • keep information about the future from leaking into the past
        • used in decoder blocks
    • encoder-decoder architecure
      • encoder block, three components
        • self-attention layer (captures linear relationships)
          • layer norm
        • feedforward layer (captures nonlinear relationships)
          • layer norm
      • decoder block
        • self-attention layer (captures linear relationships)
          • layer norm
        • feedforward layer (captures nonlinear relationships)
          • layer norm
        • multi-head attention layer
          • masking (cannot look into future)
    • advantages
      • constant path between any two positions in a sequence
        • every token is talking to every other token
      • good parallelization (no sequential computation within a layer)
    • disadvantages
      • quadratic time, scaling issues