Text Embeddings Deep Dive: How AI Understands Meaning

What Are Embeddings?

An embedding is a numerical representation of data (text, image, audio) as a dense vector in high-dimensional space. The key property: semantically similar items are geometrically close — their vectors point in similar directions.

  Semantic Space Visualization (2D projection of 1536-D space)

  "dog" •         • "puppy"       "cat" •   • "kitten"
                                          
         • "pet"
                        
  "car" •  • "automobile"    "bicycle" •
         • "vehicle"

  Similar concepts cluster together in embedding space.
  Dissimilar concepts are far apart.

How Embedding Models Work

Modern embedding models are transformer networks trained with contrastive learning. Given a pair of semantically related sentences, the model learns to push their embeddings together. Given unrelated sentences, it pushes them apart.

python
1# The training objective (simplified)
2def contrastive_loss(anchor, positive, negatives, temperature=0.07):
3    """
4    anchor: embedding of query sentence
5    positive: embedding of semantically similar sentence  
6    negatives: embeddings of unrelated sentences
7    """
8    import torch
9    import torch.nn.functional as F
10    
11    # Cosine similarities
12    sim_positive = F.cosine_similarity(anchor, positive, dim=-1)
13    sim_negatives = F.cosine_similarity(
14        anchor.unsqueeze(1), negatives, dim=-1
15    )  # shape: [batch, num_negatives]
16    
17    # InfoNCE loss
18    logits = torch.cat([sim_positive.unsqueeze(1), sim_negatives], dim=1)
19    logits /= temperature
20    labels = torch.zeros(logits.size(0), dtype=torch.long)
21    return F.cross_entropy(logits, labels)

Embedding Models Comparison

┌─────────────────────────────────────────────────────────────────┐
│              Embedding Model Comparison (2024-2025)             │
├──────────────────────────┬────────┬──────────┬─────────────────┤
│ Model                    │  Dims  │  MTEB↑   │  Cost/1M tokens │
├──────────────────────────┼────────┼──────────┼─────────────────┤
│ text-embedding-3-large   │  3072  │  64.6    │  $0.13          │
│ text-embedding-3-small   │  1536  │  62.3    │  $0.02          │
│ text-embedding-ada-002   │  1536  │  61.0    │  $0.10          │
│ voyage-large-2           │  1536  │  67.1    │  $0.12          │
│ cohere-embed-v3          │  1024  │  64.5    │  $0.10          │
│ bge-large-en-v1.5 (OSS)  │  1024  │  63.5    │  Free           │
│ e5-mistral-7b (OSS)      │  4096  │  66.6    │  Self-hosted    │
└──────────────────────────┴────────┴──────────┴─────────────────┘
  MTEB = Massive Text Embedding Benchmark (higher is better)

Generating Embeddings in Practice

python
1from openai import OpenAI
2import numpy as np
3from typing import Union
4
5client = OpenAI()
6
7def get_embedding(text: Union[str, list[str]], model="text-embedding-3-small") -> np.ndarray:
8    """Get embedding(s) for text. Handles batching automatically."""
9    if isinstance(text, str):
10        text = [text]
11    
12    # OpenAI recommends replacing newlines
13    text = [t.replace("\n", " ") for t in text]
14    
15    response = client.embeddings.create(input=text, model=model)
16    embeddings = np.array([item.embedding for item in response.data])
17    
18    # L2 normalize for cosine similarity via dot product
19    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
20    return embeddings / norms
21
22
23def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
24    """Cosine similarity between two normalized vectors is just dot product."""
25    return float(np.dot(a, b))
26
27
28# Example: semantic similarity
29sentences = [
30    "The quick brown fox jumps over the lazy dog",
31    "A fast auburn fox leaps above the sleepy hound",  # paraphrase
32    "Python is a programming language",               # unrelated
33]
34
35embeddings = get_embedding(sentences)
36print(cosine_similarity(embeddings[0], embeddings[1]))  # ~0.92 (similar)
37print(cosine_similarity(embeddings[0], embeddings[2]))  # ~0.21 (different)

Dimensionality Reduction for Visualization

python
1import umap
2import matplotlib.pyplot as plt
3from sklearn.preprocessing import LabelEncoder
4
5def visualize_embeddings(texts: list[str], labels: list[str], embeddings: np.ndarray):
6    """Reduce 1536-D embeddings to 2-D for visualization using UMAP."""
7    reducer = umap.UMAP(n_components=2, metric="cosine", random_state=42)
8    reduced = reducer.fit_transform(embeddings)
9    
10    le = LabelEncoder()
11    label_ids = le.fit_transform(labels)
12    
13    plt.figure(figsize=(12, 8))
14    scatter = plt.scatter(reduced[:, 0], reduced[:, 1],
15                         c=label_ids, cmap="tab10", alpha=0.7, s=100)
16    
17    for i, text in enumerate(texts):
18        plt.annotate(text[:30], (reduced[i, 0], reduced[i, 1]),
19                    fontsize=7, alpha=0.8)
20    
21    plt.colorbar(scatter)
22    plt.title("Semantic Embedding Space")
23    plt.savefig("embeddings_viz.png", dpi=150, bbox_inches="tight")

Building a Semantic Search Engine from Scratch

python
1import faiss
2import numpy as np
3import pickle
4from dataclasses import dataclass
5
6@dataclass
7class SearchResult:
8    text: str
9    score: float
10    metadata: dict
11
12class SemanticSearchEngine:
13    """
14    In-memory semantic search using FAISS (Facebook AI Similarity Search).
15    Production-ready, handles millions of vectors efficiently.
16    """
17    def __init__(self, dimension: int = 1536):
18        self.dimension = dimension
19        # HNSW index: fast approximate nearest neighbor
20        self.index = faiss.IndexHNSWFlat(dimension, 32)  # 32 = M parameter
21        self.index.hnsw.efConstruction = 200
22        self.index.hnsw.efSearch = 128
23        self.documents: list[str] = []
24        self.metadata: list[dict] = []
25    
26    def add_documents(self, texts: list[str], metadata: list[dict] = None):
27        """Embed and index documents."""
28        embeddings = get_embedding(texts)
29        embeddings = embeddings.astype(np.float32)
30        
31        self.index.add(embeddings)
32        self.documents.extend(texts)
33        self.metadata.extend(metadata or [{} for _ in texts])
34        
35        print(f"Index now contains {self.index.ntotal} vectors")
36    
37    def search(self, query: str, top_k: int = 5) -> list[SearchResult]:
38        query_embedding = get_embedding(query).astype(np.float32)
39        
40        # Cosine similarity search (on L2-normalized vectors, cosine = 1 - L2²/2)
41        scores, indices = self.index.search(query_embedding, top_k)
42        
43        results = []
44        for score, idx in zip(scores[0], indices[0]):
45            if idx != -1:  # -1 means no result
46                results.append(SearchResult(
47                    text=self.documents[idx],
48                    score=float(score),
49                    metadata=self.metadata[idx]
50                ))
51        return results
52    
53    def save(self, path: str):
54        faiss.write_index(self.index, f"{path}.faiss")
55        with open(f"{path}.pkl", "wb") as f:
56            pickle.dump({"documents": self.documents, "metadata": self.metadata}, f)
57    
58    @classmethod
59    def load(cls, path: str) -> "SemanticSearchEngine":
60        engine = cls()
61        engine.index = faiss.read_index(f"{path}.faiss")
62        with open(f"{path}.pkl", "rb") as f:
63            data = pickle.load(f)
64        engine.documents = data["documents"]
65        engine.metadata = data["metadata"]
66        return engine

Fine-Tuning Embeddings for Your Domain

Pre-trained embeddings may underperform on specialized domains (medical, legal, code). Fine-tune with your data:

python
1from sentence_transformers import SentenceTransformer, InputExample, losses
2from torch.utils.data import DataLoader
3
4# Training data: pairs of (query, relevant_document)
5train_examples = [
6    InputExample(texts=["What is the refund policy?", "We offer 30-day money-back guarantee"]),
7    InputExample(texts=["How do I reset my password?", "Click forgot password on the login page"]),
8    # ... thousands more pairs
9]
10
11model = SentenceTransformer("BAAI/bge-large-en-v1.5")
12
13train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
14train_loss = losses.MultipleNegativesRankingLoss(model)
15
16model.fit(
17    train_objectives=[(train_dataloader, train_loss)],
18    epochs=3,
19    warmup_steps=100,
20    output_path="./fine-tuned-embeddings",
21    show_progress_bar=True
22)

Matryoshka Embeddings (MRL)

OpenAI's text-embedding-3 models support dimension reduction without re-embedding — you can use the first N dimensions and still get good quality:

python
1def get_matryoshka_embedding(text: str, dimensions: int = 256) -> list[float]:
2    """Get reduced-dimension embedding. Trade quality for speed/storage."""
3    response = client.embeddings.create(
4        model="text-embedding-3-large",
5        input=text,
6        dimensions=dimensions  # Can be any value up to 3072
7    )
8    return response.data[0].embedding
9
10# 256-dim embedding uses ~8x less storage/compute than 3072-dim
11# Quality on MTEB: 3072-dim=64.6, 256-dim=62.3 (only -3.6% quality loss)

Key Concepts Summary

Cosine similarity: Best metric for semantic similarity; range [-1, 1]
Dot product: Equivalent to cosine on L2-normalized vectors; faster
Euclidean distance: Use when magnitude matters (rare in NLP)
Context window: Embedding models have token limits (8191 for ada-002)
Batch size: Always batch embedding calls — 100x cheaper than one-by-one
Caching: Cache embeddings; same text always produces the same vector

Text Embeddings Deep Dive: How AI Understands Meaning

Text Embeddings Deep Dive: How AI Understands Meaning

What Are Embeddings?

How Embedding Models Work

Embedding Models Comparison

Generating Embeddings in Practice

Dimensionality Reduction for Visualization

Building a Semantic Search Engine from Scratch

Fine-Tuning Embeddings for Your Domain

Matryoshka Embeddings (MRL)

Key Concepts Summary

Sumit Kumar Pandey

Share this article

Discussion (0)