Text Embeddings Deep Dive: How AI Understands Meaning
What Are Embeddings?
An embedding is a numerical representation of data (text, image, audio) as a dense vector in high-dimensional space. The key property: semantically similar items are geometrically close — their vectors point in similar directions.
Semantic Space Visualization (2D projection of 1536-D space)
"dog" • • "puppy" "cat" • • "kitten"
• "pet"
"car" • • "automobile" "bicycle" •
• "vehicle"
Similar concepts cluster together in embedding space.
Dissimilar concepts are far apart.
How Embedding Models Work
Modern embedding models are transformer networks trained with contrastive learning. Given a pair of semantically related sentences, the model learns to push their embeddings together. Given unrelated sentences, it pushes them apart.
python1# The training objective (simplified) 2def contrastive_loss(anchor, positive, negatives, temperature=0.07): 3 """ 4 anchor: embedding of query sentence 5 positive: embedding of semantically similar sentence 6 negatives: embeddings of unrelated sentences 7 """ 8 import torch 9 import torch.nn.functional as F 10 11 # Cosine similarities 12 sim_positive = F.cosine_similarity(anchor, positive, dim=-1) 13 sim_negatives = F.cosine_similarity( 14 anchor.unsqueeze(1), negatives, dim=-1 15 ) # shape: [batch, num_negatives] 16 17 # InfoNCE loss 18 logits = torch.cat([sim_positive.unsqueeze(1), sim_negatives], dim=1) 19 logits /= temperature 20 labels = torch.zeros(logits.size(0), dtype=torch.long) 21 return F.cross_entropy(logits, labels)
Embedding Models Comparison
┌─────────────────────────────────────────────────────────────────┐
│ Embedding Model Comparison (2024-2025) │
├──────────────────────────┬────────┬──────────┬─────────────────┤
│ Model │ Dims │ MTEB↑ │ Cost/1M tokens │
├──────────────────────────┼────────┼──────────┼─────────────────┤
│ text-embedding-3-large │ 3072 │ 64.6 │ $0.13 │
│ text-embedding-3-small │ 1536 │ 62.3 │ $0.02 │
│ text-embedding-ada-002 │ 1536 │ 61.0 │ $0.10 │
│ voyage-large-2 │ 1536 │ 67.1 │ $0.12 │
│ cohere-embed-v3 │ 1024 │ 64.5 │ $0.10 │
│ bge-large-en-v1.5 (OSS) │ 1024 │ 63.5 │ Free │
│ e5-mistral-7b (OSS) │ 4096 │ 66.6 │ Self-hosted │
└──────────────────────────┴────────┴──────────┴─────────────────┘
MTEB = Massive Text Embedding Benchmark (higher is better)
Generating Embeddings in Practice
python1from openai import OpenAI 2import numpy as np 3from typing import Union 4 5client = OpenAI() 6 7def get_embedding(text: Union[str, list[str]], model="text-embedding-3-small") -> np.ndarray: 8 """Get embedding(s) for text. Handles batching automatically.""" 9 if isinstance(text, str): 10 text = [text] 11 12 # OpenAI recommends replacing newlines 13 text = [t.replace("\n", " ") for t in text] 14 15 response = client.embeddings.create(input=text, model=model) 16 embeddings = np.array([item.embedding for item in response.data]) 17 18 # L2 normalize for cosine similarity via dot product 19 norms = np.linalg.norm(embeddings, axis=1, keepdims=True) 20 return embeddings / norms 21 22 23def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: 24 """Cosine similarity between two normalized vectors is just dot product.""" 25 return float(np.dot(a, b)) 26 27 28# Example: semantic similarity 29sentences = [ 30 "The quick brown fox jumps over the lazy dog", 31 "A fast auburn fox leaps above the sleepy hound", # paraphrase 32 "Python is a programming language", # unrelated 33] 34 35embeddings = get_embedding(sentences) 36print(cosine_similarity(embeddings[0], embeddings[1])) # ~0.92 (similar) 37print(cosine_similarity(embeddings[0], embeddings[2])) # ~0.21 (different)
Dimensionality Reduction for Visualization
python1import umap 2import matplotlib.pyplot as plt 3from sklearn.preprocessing import LabelEncoder 4 5def visualize_embeddings(texts: list[str], labels: list[str], embeddings: np.ndarray): 6 """Reduce 1536-D embeddings to 2-D for visualization using UMAP.""" 7 reducer = umap.UMAP(n_components=2, metric="cosine", random_state=42) 8 reduced = reducer.fit_transform(embeddings) 9 10 le = LabelEncoder() 11 label_ids = le.fit_transform(labels) 12 13 plt.figure(figsize=(12, 8)) 14 scatter = plt.scatter(reduced[:, 0], reduced[:, 1], 15 c=label_ids, cmap="tab10", alpha=0.7, s=100) 16 17 for i, text in enumerate(texts): 18 plt.annotate(text[:30], (reduced[i, 0], reduced[i, 1]), 19 fontsize=7, alpha=0.8) 20 21 plt.colorbar(scatter) 22 plt.title("Semantic Embedding Space") 23 plt.savefig("embeddings_viz.png", dpi=150, bbox_inches="tight")
Building a Semantic Search Engine from Scratch
python1import faiss 2import numpy as np 3import pickle 4from dataclasses import dataclass 5 6@dataclass 7class SearchResult: 8 text: str 9 score: float 10 metadata: dict 11 12class SemanticSearchEngine: 13 """ 14 In-memory semantic search using FAISS (Facebook AI Similarity Search). 15 Production-ready, handles millions of vectors efficiently. 16 """ 17 def __init__(self, dimension: int = 1536): 18 self.dimension = dimension 19 # HNSW index: fast approximate nearest neighbor 20 self.index = faiss.IndexHNSWFlat(dimension, 32) # 32 = M parameter 21 self.index.hnsw.efConstruction = 200 22 self.index.hnsw.efSearch = 128 23 self.documents: list[str] = [] 24 self.metadata: list[dict] = [] 25 26 def add_documents(self, texts: list[str], metadata: list[dict] = None): 27 """Embed and index documents.""" 28 embeddings = get_embedding(texts) 29 embeddings = embeddings.astype(np.float32) 30 31 self.index.add(embeddings) 32 self.documents.extend(texts) 33 self.metadata.extend(metadata or [{} for _ in texts]) 34 35 print(f"Index now contains {self.index.ntotal} vectors") 36 37 def search(self, query: str, top_k: int = 5) -> list[SearchResult]: 38 query_embedding = get_embedding(query).astype(np.float32) 39 40 # Cosine similarity search (on L2-normalized vectors, cosine = 1 - L2²/2) 41 scores, indices = self.index.search(query_embedding, top_k) 42 43 results = [] 44 for score, idx in zip(scores[0], indices[0]): 45 if idx != -1: # -1 means no result 46 results.append(SearchResult( 47 text=self.documents[idx], 48 score=float(score), 49 metadata=self.metadata[idx] 50 )) 51 return results 52 53 def save(self, path: str): 54 faiss.write_index(self.index, f"{path}.faiss") 55 with open(f"{path}.pkl", "wb") as f: 56 pickle.dump({"documents": self.documents, "metadata": self.metadata}, f) 57 58 @classmethod 59 def load(cls, path: str) -> "SemanticSearchEngine": 60 engine = cls() 61 engine.index = faiss.read_index(f"{path}.faiss") 62 with open(f"{path}.pkl", "rb") as f: 63 data = pickle.load(f) 64 engine.documents = data["documents"] 65 engine.metadata = data["metadata"] 66 return engine
Fine-Tuning Embeddings for Your Domain
Pre-trained embeddings may underperform on specialized domains (medical, legal, code). Fine-tune with your data:
python1from sentence_transformers import SentenceTransformer, InputExample, losses 2from torch.utils.data import DataLoader 3 4# Training data: pairs of (query, relevant_document) 5train_examples = [ 6 InputExample(texts=["What is the refund policy?", "We offer 30-day money-back guarantee"]), 7 InputExample(texts=["How do I reset my password?", "Click forgot password on the login page"]), 8 # ... thousands more pairs 9] 10 11model = SentenceTransformer("BAAI/bge-large-en-v1.5") 12 13train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32) 14train_loss = losses.MultipleNegativesRankingLoss(model) 15 16model.fit( 17 train_objectives=[(train_dataloader, train_loss)], 18 epochs=3, 19 warmup_steps=100, 20 output_path="./fine-tuned-embeddings", 21 show_progress_bar=True 22)
Matryoshka Embeddings (MRL)
OpenAI's text-embedding-3 models support dimension reduction without re-embedding — you can use the first N dimensions and still get good quality:
python1def get_matryoshka_embedding(text: str, dimensions: int = 256) -> list[float]: 2 """Get reduced-dimension embedding. Trade quality for speed/storage.""" 3 response = client.embeddings.create( 4 model="text-embedding-3-large", 5 input=text, 6 dimensions=dimensions # Can be any value up to 3072 7 ) 8 return response.data[0].embedding 9 10# 256-dim embedding uses ~8x less storage/compute than 3072-dim 11# Quality on MTEB: 3072-dim=64.6, 256-dim=62.3 (only -3.6% quality loss)
Key Concepts Summary
- Cosine similarity: Best metric for semantic similarity; range [-1, 1]
- Dot product: Equivalent to cosine on L2-normalized vectors; faster
- Euclidean distance: Use when magnitude matters (rare in NLP)
- Context window: Embedding models have token limits (8191 for ada-002)
- Batch size: Always batch embedding calls — 100x cheaper than one-by-one
- Caching: Cache embeddings; same text always produces the same vector