Transformers United - DL Models that have revolutionized NLP, CV, RL

09-03-2023

url: https://www.youtube.com/watch?v=P127jhj-8-Y&list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM&index=3
early attention mechanisms
- RNNs, Seq2Seq, LSTMs, GRUs
  - good for encoding history
  - bad for long sequences, context
current attention mechanisms
- hard attention
  - learn attention weight in [0, 1] - closed interval / inclusive
    - image of cat, slightly blurred
  - expensive compute
- soft attention
  - learn attention weight in 1 - set
    - image of cat, edges cut off
  - non-differentiability
- local attention
  - local hard attention with global soft attention
- global attention
  - soft attention
transformers
- self-attention
  - Attention is All You Need
  - given a query, find a set of keys most similar to the query and return the corresponding values
    - self-attention is performed multiple times to provide multiple representation subspaces for each layer
- power of transformers
  - positional embeddings
    - impart the notion of ordering, every token is related to every other token
  - nonlinearities
    - implemented as a simple feed-forward network
    - allows for complex mappings between inputs and outputs
  - masking
    - parallelise operations while not looking at the future
    - keep information about the future from leaking into the past
    - used in decoder blocks
- encoder-decoder architecure
  - encoder block, three components
    - self-attention layer (captures linear relationships)
      - layer norm
    - feedforward layer (captures nonlinear relationships)
      - layer norm
  - decoder block
    - self-attention layer (captures linear relationships)
      - layer norm
    - feedforward layer (captures nonlinear relationships)
      - layer norm
    - multi-head attention layer
      - masking (cannot look into future)
- advantages
  - constant path between any two positions in a sequence
    - every token is talking to every other token
  - good parallelization (no sequential computation within a layer)
- disadvantages
  - quadratic time, scaling issues