Embeddings

27-08-2024

General architecture

Tokens

  • convert unstructured text into a structured format, and vice versa
  • tokenization constructs a one-to-one map of token strings to token IDs
    • ie a sequence of characters mapped to a number
  • the exact tokenization process varies between encoding models

Types of tokenisers:

  • https://huggingface.co/docs/transformers/tokenizer_summary
  • byte pair encoding (BPE)
    • its reversible and lossless
    • it works on arbitrary text
    • it compresses the text; the token sequence is shorter than the bytes corresponding to the original text (~ each token corresponds to about 4 bytes)
    • it attempts to let the model see common subwords

Tools:

Vectors and embeddings

  • a vector is a single-dimensional array (having both magnitude and direction)
  • an embedding is vector representation of a token
    • vector embeddings encode semantic contexts of the tokens and its relation to other tokens
    • an embedding model places each token into a high-dimensional vector space known as an embedding matrix

Vector databases

  • embedding model
  • vector embeddings
  • indexing (map vectors to relevant data structures)
    • random projection, PQ, LSH, HNSW
  • querying
  • post-processing

Embedding function:

  • CRUD operations done with data in its raw form
  • vector database needs to know how to your data to embeddings

There exist standalone vector indices like FAIS, to improve the search and retrieval of vector embeddings.

Filtering:

  • pre-filtering
  • post-filtering

Indexing:

  • indexing algorithms to group similar embeddings together
  • approximate nearest neighbor search

Metadata:

  • help give context and make query results more precise

Similarity metrics

Similarity metrics:

  • cosine similarity
    • direction only
  • dot product
    • magnitude and direction
  • Euclidean distance
    • magnitude and direction

Cosine Similarity

  • compute cosine similarity by taking the cosine of the angle between two vectors
  • this is a normalised form of the dot product (which would otherwise be difficult to interpret in absolute terms)
    • disregards the magnitude of both vectors
    • vectors with large or small values will have the same cosine similarity as long as they point in the same direction
  • not suitable when you have data where the magnitude of the vectors is important and should be taken into account (for example, image embeddings based on pixel intensities)

Multiplying vectors:

  • we can use the cross product (which returns a vector) or the dot product (which returns a scalar)
  • we introduce the cosine function
    • it makes sense to multiply their lengths together, but only when they're pointing in the same direction
    • we make them point in the same direction by multiplying by COS(theta)

Cosine simalrity:

  • cosine function is 360 degrees, with a range of -1 to 1
  • COS(0) = 1
  • COS(90) = 0
  • COS(180) = -1

Models

HuggingFace hub:

  • local cache
    • C:\Users\lmiloszewski\.cache\huggingface\hub\models--sentence-transformers--all-MiniLM-L6-v2

Models: