Transformers United - DL Models that have revolutionized NLP, CV, RL
09-03-2023
- url: https://www.youtube.com/watch?v=P127jhj-8-Y&list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM&index=3
- early attention mechanisms
- RNNs, Seq2Seq, LSTMs, GRUs
- good for encoding history
- bad for long sequences, context
- RNNs, Seq2Seq, LSTMs, GRUs
- current attention mechanisms
- hard attention
- learn attention weight in [0, 1] - closed interval / inclusive
- image of cat, slightly blurred
- expensive compute
- learn attention weight in [0, 1] - closed interval / inclusive
- soft attention
- learn attention weight in 1 - set
- image of cat, edges cut off
- non-differentiability
- learn attention weight in 1 - set
- local attention
- local hard attention with global soft attention
- global attention
- soft attention
- hard attention
- transformers
- self-attention
- Attention is All You Need
- given a query, find a set of keys most similar to the query and return the corresponding values
- self-attention is performed multiple times to provide multiple representation subspaces for each layer
- power of transformers
- positional embeddings
- impart the notion of ordering, every token is related to every other token
- nonlinearities
- implemented as a simple feed-forward network
- allows for complex mappings between inputs and outputs
- masking
- parallelise operations while not looking at the future
- keep information about the future from leaking into the past
- used in decoder blocks
- positional embeddings
- encoder-decoder architecure
- encoder block, three components
- self-attention layer (captures linear relationships)
- layer norm
- feedforward layer (captures nonlinear relationships)
- layer norm
- self-attention layer (captures linear relationships)
- decoder block
- self-attention layer (captures linear relationships)
- layer norm
- feedforward layer (captures nonlinear relationships)
- layer norm
- multi-head attention layer
- masking (cannot look into future)
- self-attention layer (captures linear relationships)
- encoder block, three components
- advantages
- constant path between any two positions in a sequence
- every token is talking to every other token
- good parallelization (no sequential computation within a layer)
- constant path between any two positions in a sequence
- disadvantages
- quadratic time, scaling issues
- self-attention