Back to all articles
Featured image for article: Text Embeddings Deep Dive: How AI Understands Meaning
AI
20 min read2,851 views

Text Embeddings Deep Dive: How AI Understands Meaning

A comprehensive guide to text embeddings — from theory to production. Learn how vector representations capture semantic meaning, how to choose the right model, and how to build similarity search at scale.

#Embeddings#Semantic Search#FAISS#AI#NLP#Vector Search

Text Embeddings Deep Dive: How AI Understands Meaning

What Are Embeddings?

An embedding is a numerical representation of data (text, image, audio) as a dense vector in high-dimensional space. The key property: semantically similar items are geometrically close — their vectors point in similar directions.

  Semantic Space Visualization (2D projection of 1536-D space)

  "dog" •         • "puppy"       "cat" •   • "kitten"
                                          
         • "pet"
                        
  "car" •  • "automobile"    "bicycle" •
         • "vehicle"

  Similar concepts cluster together in embedding space.
  Dissimilar concepts are far apart.

How Embedding Models Work

Modern embedding models are transformer networks trained with contrastive learning. Given a pair of semantically related sentences, the model learns to push their embeddings together. Given unrelated sentences, it pushes them apart.

python
1# The training objective (simplified) 2def contrastive_loss(anchor, positive, negatives, temperature=0.07): 3 """ 4 anchor: embedding of query sentence 5 positive: embedding of semantically similar sentence 6 negatives: embeddings of unrelated sentences 7 """ 8 import torch 9 import torch.nn.functional as F 10 11 # Cosine similarities 12 sim_positive = F.cosine_similarity(anchor, positive, dim=-1) 13 sim_negatives = F.cosine_similarity( 14 anchor.unsqueeze(1), negatives, dim=-1 15 ) # shape: [batch, num_negatives] 16 17 # InfoNCE loss 18 logits = torch.cat([sim_positive.unsqueeze(1), sim_negatives], dim=1) 19 logits /= temperature 20 labels = torch.zeros(logits.size(0), dtype=torch.long) 21 return F.cross_entropy(logits, labels)

Embedding Models Comparison

┌─────────────────────────────────────────────────────────────────┐
│              Embedding Model Comparison (2024-2025)             │
├──────────────────────────┬────────┬──────────┬─────────────────┤
│ Model                    │  Dims  │  MTEB↑   │  Cost/1M tokens │
├──────────────────────────┼────────┼──────────┼─────────────────┤
│ text-embedding-3-large   │  3072  │  64.6    │  $0.13          │
│ text-embedding-3-small   │  1536  │  62.3    │  $0.02          │
│ text-embedding-ada-002   │  1536  │  61.0    │  $0.10          │
│ voyage-large-2           │  1536  │  67.1    │  $0.12          │
│ cohere-embed-v3          │  1024  │  64.5    │  $0.10          │
│ bge-large-en-v1.5 (OSS)  │  1024  │  63.5    │  Free           │
│ e5-mistral-7b (OSS)      │  4096  │  66.6    │  Self-hosted    │
└──────────────────────────┴────────┴──────────┴─────────────────┘
  MTEB = Massive Text Embedding Benchmark (higher is better)

Generating Embeddings in Practice

python
1from openai import OpenAI 2import numpy as np 3from typing import Union 4 5client = OpenAI() 6 7def get_embedding(text: Union[str, list[str]], model="text-embedding-3-small") -> np.ndarray: 8 """Get embedding(s) for text. Handles batching automatically.""" 9 if isinstance(text, str): 10 text = [text] 11 12 # OpenAI recommends replacing newlines 13 text = [t.replace("\n", " ") for t in text] 14 15 response = client.embeddings.create(input=text, model=model) 16 embeddings = np.array([item.embedding for item in response.data]) 17 18 # L2 normalize for cosine similarity via dot product 19 norms = np.linalg.norm(embeddings, axis=1, keepdims=True) 20 return embeddings / norms 21 22 23def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: 24 """Cosine similarity between two normalized vectors is just dot product.""" 25 return float(np.dot(a, b)) 26 27 28# Example: semantic similarity 29sentences = [ 30 "The quick brown fox jumps over the lazy dog", 31 "A fast auburn fox leaps above the sleepy hound", # paraphrase 32 "Python is a programming language", # unrelated 33] 34 35embeddings = get_embedding(sentences) 36print(cosine_similarity(embeddings[0], embeddings[1])) # ~0.92 (similar) 37print(cosine_similarity(embeddings[0], embeddings[2])) # ~0.21 (different)

Dimensionality Reduction for Visualization

python
1import umap 2import matplotlib.pyplot as plt 3from sklearn.preprocessing import LabelEncoder 4 5def visualize_embeddings(texts: list[str], labels: list[str], embeddings: np.ndarray): 6 """Reduce 1536-D embeddings to 2-D for visualization using UMAP.""" 7 reducer = umap.UMAP(n_components=2, metric="cosine", random_state=42) 8 reduced = reducer.fit_transform(embeddings) 9 10 le = LabelEncoder() 11 label_ids = le.fit_transform(labels) 12 13 plt.figure(figsize=(12, 8)) 14 scatter = plt.scatter(reduced[:, 0], reduced[:, 1], 15 c=label_ids, cmap="tab10", alpha=0.7, s=100) 16 17 for i, text in enumerate(texts): 18 plt.annotate(text[:30], (reduced[i, 0], reduced[i, 1]), 19 fontsize=7, alpha=0.8) 20 21 plt.colorbar(scatter) 22 plt.title("Semantic Embedding Space") 23 plt.savefig("embeddings_viz.png", dpi=150, bbox_inches="tight")

Building a Semantic Search Engine from Scratch

python
1import faiss 2import numpy as np 3import pickle 4from dataclasses import dataclass 5 6@dataclass 7class SearchResult: 8 text: str 9 score: float 10 metadata: dict 11 12class SemanticSearchEngine: 13 """ 14 In-memory semantic search using FAISS (Facebook AI Similarity Search). 15 Production-ready, handles millions of vectors efficiently. 16 """ 17 def __init__(self, dimension: int = 1536): 18 self.dimension = dimension 19 # HNSW index: fast approximate nearest neighbor 20 self.index = faiss.IndexHNSWFlat(dimension, 32) # 32 = M parameter 21 self.index.hnsw.efConstruction = 200 22 self.index.hnsw.efSearch = 128 23 self.documents: list[str] = [] 24 self.metadata: list[dict] = [] 25 26 def add_documents(self, texts: list[str], metadata: list[dict] = None): 27 """Embed and index documents.""" 28 embeddings = get_embedding(texts) 29 embeddings = embeddings.astype(np.float32) 30 31 self.index.add(embeddings) 32 self.documents.extend(texts) 33 self.metadata.extend(metadata or [{} for _ in texts]) 34 35 print(f"Index now contains {self.index.ntotal} vectors") 36 37 def search(self, query: str, top_k: int = 5) -> list[SearchResult]: 38 query_embedding = get_embedding(query).astype(np.float32) 39 40 # Cosine similarity search (on L2-normalized vectors, cosine = 1 - L2²/2) 41 scores, indices = self.index.search(query_embedding, top_k) 42 43 results = [] 44 for score, idx in zip(scores[0], indices[0]): 45 if idx != -1: # -1 means no result 46 results.append(SearchResult( 47 text=self.documents[idx], 48 score=float(score), 49 metadata=self.metadata[idx] 50 )) 51 return results 52 53 def save(self, path: str): 54 faiss.write_index(self.index, f"{path}.faiss") 55 with open(f"{path}.pkl", "wb") as f: 56 pickle.dump({"documents": self.documents, "metadata": self.metadata}, f) 57 58 @classmethod 59 def load(cls, path: str) -> "SemanticSearchEngine": 60 engine = cls() 61 engine.index = faiss.read_index(f"{path}.faiss") 62 with open(f"{path}.pkl", "rb") as f: 63 data = pickle.load(f) 64 engine.documents = data["documents"] 65 engine.metadata = data["metadata"] 66 return engine

Fine-Tuning Embeddings for Your Domain

Pre-trained embeddings may underperform on specialized domains (medical, legal, code). Fine-tune with your data:

python
1from sentence_transformers import SentenceTransformer, InputExample, losses 2from torch.utils.data import DataLoader 3 4# Training data: pairs of (query, relevant_document) 5train_examples = [ 6 InputExample(texts=["What is the refund policy?", "We offer 30-day money-back guarantee"]), 7 InputExample(texts=["How do I reset my password?", "Click forgot password on the login page"]), 8 # ... thousands more pairs 9] 10 11model = SentenceTransformer("BAAI/bge-large-en-v1.5") 12 13train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32) 14train_loss = losses.MultipleNegativesRankingLoss(model) 15 16model.fit( 17 train_objectives=[(train_dataloader, train_loss)], 18 epochs=3, 19 warmup_steps=100, 20 output_path="./fine-tuned-embeddings", 21 show_progress_bar=True 22)

Matryoshka Embeddings (MRL)

OpenAI's text-embedding-3 models support dimension reduction without re-embedding — you can use the first N dimensions and still get good quality:

python
1def get_matryoshka_embedding(text: str, dimensions: int = 256) -> list[float]: 2 """Get reduced-dimension embedding. Trade quality for speed/storage.""" 3 response = client.embeddings.create( 4 model="text-embedding-3-large", 5 input=text, 6 dimensions=dimensions # Can be any value up to 3072 7 ) 8 return response.data[0].embedding 9 10# 256-dim embedding uses ~8x less storage/compute than 3072-dim 11# Quality on MTEB: 3072-dim=64.6, 256-dim=62.3 (only -3.6% quality loss)

Key Concepts Summary

  • Cosine similarity: Best metric for semantic similarity; range [-1, 1]
  • Dot product: Equivalent to cosine on L2-normalized vectors; faster
  • Euclidean distance: Use when magnitude matters (rare in NLP)
  • Context window: Embedding models have token limits (8191 for ada-002)
  • Batch size: Always batch embedding calls — 100x cheaper than one-by-one
  • Caching: Cache embeddings; same text always produces the same vector
Profile picture of Sumit Kumar Pandey

Sumit Kumar Pandey

Full-Stack Developer

Full-Stack Developer with 5+ years of experience building scalable web applications. Passionate about clean code, performance optimization, and modern web technologies.

About the Author

Author information for Sumit Kumar Pandey

Share this article

Found this helpful? Share with your network!

0 shares

Discussion (0)

Share your thoughts and join the conversation

Leave a comment

Be respectful and stay on topic

Write your comment in the text area above. Comments should be respectful and relevant to the article.

AI Chat Assistant

Interactive AI assistant for Sumit Kumar Pandey's portfolio website. Ask questions about technical skills, work experience, projects, availability, and contact information. Powered by Next.js API.