Embeddings

27-08-2024

General architecture

https://www.lesswrong.com/posts/pHPmMGEMYefk9jLeh/llm-basics-embedding-spaces-transformer-token-vectors-are
tokeniser (turn words into tokens or tokens into words)
transformer (turn N tokens into a prediction for the (N+1)th token)
- input embeddings
  - turn tokens into vectors
- encoder layers
  - adds information about the position of each token
- decoder layers
  - uses information from all states
  - uses information from current state
- output embeddings
  - turns prediction vectors into token probabilities

convert unstructured text into a structured format, and vice versa
tokenization constructs a one-to-one map of token strings to token IDs
- ie a sequence of characters mapped to a number
the exact tokenization process varies between encoding models

Types of tokenisers:

https://huggingface.co/docs/transformers/tokenizer_summary
byte pair encoding (BPE)
- its reversible and lossless
- it works on arbitrary text
- it compresses the text; the token sequence is shorter than the bytes corresponding to the original text (~ each token corresponds to about 4 bytes)
- it attempts to let the model see common subwords

Tools:

tiktoken - https://github.com/openai/tiktoken (BPE tokeniser from OpenAI models)
- five encoding models
  - r50k_base, p50k_base, p50k_edit, cl100k_base, gpt2
- fetched from blob storage
  - https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py

a vector is a single-dimensional array (having both magnitude and direction)
an embedding is vector representation of a token
- vector embeddings encode semantic contexts of the tokens and its relation to other tokens
- an embedding model places each token into a high-dimensional vector space known as an embedding matrix

embedding model
vector embeddings
indexing (map vectors to relevant data structures)
- random projection, PQ, LSH, HNSW
querying
post-processing

Embedding function:

There exist standalone vector indices like FAIS, to improve the search and retrieval of vector embeddings.

Filtering:

Indexing:

Metadata:

Similarity metrics:

compute cosine similarity by taking the cosine of the angle between two vectors
this is a normalised form of the dot product (which would otherwise be difficult to interpret in absolute terms)
- disregards the magnitude of both vectors
- vectors with large or small values will have the same cosine similarity as long as they point in the same direction
not suitable when you have data where the magnitude of the vectors is important and should be taken into account (for example, image embeddings based on pixel intensities)

Multiplying vectors:

we can use the cross product (which returns a vector) or the dot product (which returns a scalar)
we introduce the cosine function
- it makes sense to multiply their lengths together, but only when they're pointing in the same direction
- we make them point in the same direction by multiplying by COS(theta)

Cosine simalrity:

HuggingFace hub:

local cache
- C:\Users\lmiloszewski\.cache\huggingface\hub\models--sentence-transformers--all-MiniLM-L6-v2

Models: