Embeddings
27-08-2024
General architecture
- https://www.lesswrong.com/posts/pHPmMGEMYefk9jLeh/llm-basics-embedding-spaces-transformer-token-vectors-are
- tokeniser (turn words into tokens or tokens into words)
- transformer (turn N tokens into a prediction for the (N+1)th token)
- input embeddings
- turn tokens into vectors
- encoder layers
- adds information about the position of each token
- decoder layers
- uses information from all states
- uses information from current state
- output embeddings
- turns prediction vectors into token probabilities
- input embeddings
Tokens
- convert unstructured text into a structured format, and vice versa
- tokenization constructs a one-to-one map of token strings to token IDs
- ie a sequence of characters mapped to a number
- the exact tokenization process varies between encoding models
Types of tokenisers:
- https://huggingface.co/docs/transformers/tokenizer_summary
- byte pair encoding (BPE)
- its reversible and lossless
- it works on arbitrary text
- it compresses the text; the token sequence is shorter than the bytes corresponding to the original text (~ each token corresponds to about 4 bytes)
- it attempts to let the model see common subwords
Tools:
tiktoken
- https://github.com/openai/tiktoken (BPE tokeniser from OpenAI models)- five encoding models
r50k_base
,p50k_base
,p50k_edit
,cl100k_base
,gpt2
- fetched from blob storage
- five encoding models
Vectors and embeddings
- a vector is a single-dimensional array (having both magnitude and direction)
- an embedding is vector representation of a token
- vector embeddings encode semantic contexts of the tokens and its relation to other tokens
- an embedding model places each token into a high-dimensional vector space known as an embedding matrix
Vector databases
- embedding model
- vector embeddings
- indexing (map vectors to relevant data structures)
- random projection, PQ, LSH, HNSW
- querying
- post-processing
Embedding function:
- CRUD operations done with data in its raw form
- vector database needs to know how to your data to embeddings
There exist standalone vector indices like FAIS, to improve the search and retrieval of vector embeddings.
Filtering:
- pre-filtering
- post-filtering
Indexing:
- indexing algorithms to group similar embeddings together
- approximate nearest neighbor search
Metadata:
- help give context and make query results more precise
Similarity metrics
Similarity metrics:
- cosine similarity
- direction only
- dot product
- magnitude and direction
- Euclidean distance
- magnitude and direction
Cosine Similarity
- compute cosine similarity by taking the cosine of the angle between two vectors
- this is a normalised form of the dot product (which would otherwise be difficult to interpret in absolute terms)
- disregards the magnitude of both vectors
- vectors with large or small values will have the same cosine similarity as long as they point in the same direction
- not suitable when you have data where the magnitude of the vectors is important and should be taken into account (for example, image embeddings based on pixel intensities)
Multiplying vectors:
- we can use the cross product (which returns a vector) or the dot product (which returns a scalar)
- we introduce the cosine function
- it makes sense to multiply their lengths together, but only when they're pointing in the same direction
- we make them point in the same direction by multiplying by COS(theta)
Cosine simalrity:
- cosine function is 360 degrees, with a range of -1 to 1
- COS(0) = 1
- COS(90) = 0
- COS(180) = -1
Models
HuggingFace hub:
- local cache
C:\Users\lmiloszewski\.cache\huggingface\hub\models--sentence-transformers--all-MiniLM-L6-v2
Models:
- model =
all-MiniLM-L6-v2
- https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
- pre-trained model = https://huggingface.co/nreimers/MiniLM-L6-H384-uncased
- max sequence length = 256
- dimensions = 384
- score functions = dot-product, cosine-similarity, euclidean distance
- https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
- model =
text-embedding-3-small
- https://platform.openai.com/docs/guides/embeddings/embedding-models
- max sequence length = 8191
- dimensions = 1536
- score functions = cosine-similarity
- other
- native support for shortening embeddings