Data Modeling for AI Apps: Embeddings and Vectors
Architecture Patterns — Part 7 of 30
The Demo That Lied to You
Here's a story I've watched play out a dozen times in the last two years.
A team builds a killer internal knowledge base. They chunk up the company's Confluence docs, run them through text-embedding-3-small, dump the vectors into a local ChromaDB instance, wire it to GPT-4o, and demo it to leadership on a Tuesday. It's magic. Questions get answered. Executives are delighted. Jira tickets are created.
By Thursday, the thing is in production. By the following Monday, the first complaints roll in.
"It gave me the wrong policy." "It said our refund window is 30 days. It's 14." "It quoted a pricing tier we discontinued two years ago."
The team scrambles. They tweak the prompt. They add more context. They bump k from 3 to 8. Nothing fundamentally changes because the problem isn't the LLM—it's the data model. Specifically, it's that they never made deliberate architectural decisions about how their vectors would be stored, versioned, retrieved, and kept current.
This is Part 7 of the Architecture Patterns series, and we're going deep on the decisions that separate a RAG demo from a RAG system. Not just the how—the why behind every fork in the road.
What You're Actually Building
Before picking a database or an embedding model, understand the data structure you're working with.
An embedding is a dense numerical vector—an array of floating-point numbers—that encodes semantic meaning. When you embed the sentence "Cancel my subscription," you get a vector that sits geometrically close to "I want to stop my plan" and far from "Tell me your pricing." The embedding model learned this geometry by training on massive text corpora.
Here's what that looks like in practice:
from openai import OpenAI
client = OpenAI()
def embed_text(text: str, model: str = "text-embedding-3-small") -> list[float]:
response = client.embeddings.create(
input=text,
model=model
)
return response.data[0].embedding
# Returns a list of 1536 floats
vec = embed_text("Cancel my subscription")
print(f"Dimensions: {len(vec)}") # 1536
print(f"First 5 values: {vec[:5]}")
The vector database stores these embeddings alongside their source content and metadata, then serves approximate nearest-neighbor (ANN) queries to find semantically similar records at query time.
Your first architectural decision is what those two pieces of your system look like.
Decision 1: Choosing Your Vector Store
This is where I see the most cargo-culting. Engineers default to whatever they saw in the last tutorial. In 2024, that was usually Pinecone or Chroma. In 2026, the conversation has gotten more nuanced—and the performance landscape has shifted dramatically.
The pgvector Revolution
For most of 2023 and 2024, the conventional wisdom was: use pgvector for prototypes under a million vectors, then graduate to Pinecone or Weaviate. That advice is now outdated.
As of early 2026, pgvector + pgvectorscale (Timescale's extension) benchmarks at 471 queries per second at 99% recall on 50 million vectors—11.4x better throughput than Qdrant and 28x lower p95 latency than Pinecone's storage-optimized tier, at roughly 75% lower infrastructure cost. The performance objection to PostgreSQL as a vector store is largely dead at moderate scale.
Here's how you set it up:
-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create a table with a vector column
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
metadata JSONB,
tenant_id UUID NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
embedding vector(1536)
);
-- HNSW index for fast ANN search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
And querying with a SQL JOIN—something no dedicated vector database can do natively:
import psycopg2
from pgvector.psycopg2 import register_vector
def semantic_search(conn, query_embedding, tenant_id, limit=5):
cur = conn.cursor()
cur.execute("""
SELECT d.id, d.content, d.metadata,
1 - (d.embedding <=> %s) AS similarity
FROM documents d
JOIN tenants t ON t.id = d.tenant_id
WHERE d.tenant_id = %s
AND t.subscription_active = true
ORDER BY d.embedding <=> %s
LIMIT %s
""", (query_embedding, tenant_id, query_embedding, limit))
return cur.fetchall()
That JOIN on tenants is handling your authorization layer inside the retrieval query. You can't do that in Pinecone without a second round-trip.
The Decision Framework
Here's how I think about it:
| Scenario | Recommendation |
|---|---|
| Already running Postgres, <10M vectors | pgvector — zero new infra, ACID, SQL JOINs |
| Multi-tenant app needing row-level security | pgvector — combine vector + relational permissions |
| Need zero ops, unpredictable scale, compliance (SOC 2/HIPAA) | Pinecone — managed, auto-scales |
| Hybrid search (semantic + BM25 keyword) required | Weaviate — native hybrid, proven at scale |
| >50M vectors, dedicated vector team | Pinecone or self-hosted Weaviate |
A mid-size SaaS team (around $15M ARR) recently built their internal document search on pgvector with 2 million documents. Result: $0 in new infrastructure, sub-100ms p95 latency at 95% recall, and their authorization rules expressed as ordinary SQL — not a custom metadata filter scheme bolted onto a vector API.
Decision 2: Picking Your Embedding Model
The embedding model determines the quality of your semantic space. Swap the model, and you must re-embed everything—so this decision deserves more than a default.
As of early 2026, benchmarks across 10 embedding models on real-world tasks show the landscape has matured:
| Model | MTEB Score | Dimensions | Cost / 1M tokens | Latency (100 tokens) |
|---|---|---|---|---|
| Voyage-2 | 67.8% | 1024 | $1,000 | 100–180ms |
| OpenAI text-embedding-3-large | 64.6% | 3072 | $1,300 | 150–250ms |
| BGE-large-en-v1.5 (self-hosted) | 63.9% | 1024 | $5–20 (infra) | 5–15ms |
| Cohere embed-english-v3.0 | 63.1% | 1024 | $1,000 | 80–150ms |
| OpenAI text-embedding-3-small | 62.3% | 1536 | $200 | 100–200ms |
The practical takeaway: text-embedding-3-small is an excellent default for general-purpose RAG. It's cheap, fast, widely supported, and good enough. The gap between it and Voyage-2 on most production tasks is smaller than the gap between good chunking strategy and bad chunking strategy.
Cohere's embed models stand out in enterprise contexts for one reason: the input_type parameter. You tell the model whether it's embedding a search query or a document, and it optimizes accordingly:
import cohere
co = cohere.Client('your-api-key')
# Embed documents at index time
doc_embeddings = co.embed(
texts=["Our refund policy allows 14-day returns..."],
model="embed-english-v3.0",
input_type="search_document" # <-- optimized for storage
).embeddings
# Embed the query at retrieval time
query_embedding = co.embed(
texts=["What is your return policy?"],
model="embed-english-v3.0",
input_type="search_query" # <-- optimized for retrieval
).embeddings[0]
This asymmetric embedding approach—different representations for documents and queries—consistently improves retrieval precision. Most teams don't bother with it. You should.
One firm rule: never mix embedding models in the same index. If you switch from text-embedding-3-small to Voyage-2, you must re-embed every document. Build your pipeline to make this straightforward from day one:
# Track your embedding model version in your schema
ALTER TABLE documents ADD COLUMN embedding_model TEXT;
ALTER TABLE documents ADD COLUMN embedding_version INT DEFAULT 1;
# Re-embedding script
python scripts/reembed.py \
--model voyage-2 \
--batch-size 100 \
--tenant all
Decision 3: Chunking is Your Hidden Lever
I've watched teams spend weeks switching embedding models when the real issue was chunking. This is the most undervalued variable in RAG quality.
Chunk size directly determines what the model can return. Chunk too small (< 128 tokens), and you lose context—the retrieved passage won't contain enough information to answer the question. Chunk too large (> 1024 tokens), and your similarity scores become noisy because the vector averages over too much content.
The right strategy for most document-heavy apps:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # tokens per chunk
chunk_overlap=64, # overlap prevents context loss at boundaries
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)
For structured documents (PDFs with headers, policies, API docs), add the parent document's title and section heading to each chunk before embedding. The embedding model can't infer context it doesn't see:
def prepare_chunk(title: str, section: str, chunk_text: str) -> str:
return f"{title}\n{section}\n\n{chunk_text}"
This simple prefix dramatically improves recall on section-specific queries.
Decision 4: Keeping Your Vectors Current
This is where production systems fall apart quietly. A 2025 post-mortem on RAG production failures identified stale embeddings as a primary failure mode. Your source documents change—new policies, updated pricing, deprecated features—but the vectors in your database remain static until you actively update them.
Build freshness into your data model from the start:
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
source_id TEXT NOT NULL, -- external document ID
source_hash TEXT NOT NULL, -- MD5 of source content
content TEXT NOT NULL,
embedding vector(1536),
embedding_model TEXT NOT NULL,
indexed_at TIMESTAMPTZ DEFAULT NOW(),
source_updated_at TIMESTAMPTZ -- when the source changed
);
CREATE INDEX ON documents (source_id, source_hash);
With source_hash, your indexing pipeline can skip unchanged documents and only re-embed what's actually changed:
import hashlib
def should_reindex(conn, source_id: str, content: str) -> bool:
content_hash = hashlib.md5(content.encode()).hexdigest()
cur = conn.cursor()
cur.execute(
"SELECT source_hash FROM documents WHERE source_id = %s ORDER BY indexed_at DESC LIMIT 1",
(source_id,)
)
row = cur.fetchone()
return row is None or row[0] != content_hash
The Architecture That Actually Works
Here's the minimal production RAG architecture that survives contact with real users:
┌─────────────────────────────────────────────────┐
│ INDEXING PIPELINE │
│ │
│ Source Docs → Hash Check → Chunker → Embedder │
│ ↓ │
│ PostgreSQL + pgvector (+ metadata) │
└─────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────┐
│ QUERY PIPELINE │
│ │
│ User Query → Embed Query → ANN Search (top-k) │
│ → Rerank (optional) │
│ → LLM + Context │
│ → Response │
└─────────────────────────────────────────────────┘
The reranking step—running retrieved chunks through a cross-encoder model like Cohere Rerank before passing them to the LLM—is optional for <100ms requirements but adds significant precision for knowledge-base applications. It's worth the 100–200ms if your queries are async or batch.
As of 2026, only 31% of AI initiatives reach full production according to ISG research. The gap isn't the LLM—it's the data infrastructure around it.
Architecture Checklist
Before you ship your vector-backed feature, verify:
- Vector store chosen deliberately — not just the first tutorial you read. If you already run Postgres and have <10M vectors, pgvector is the right answer for most teams.
- Embedding model locked in schema —
embedding_modelcolumn exists; re-embedding path is documented and tested. - Source hashing in place — your indexing pipeline is idempotent and skips unchanged documents.
- Chunk size validated — you've tested retrieval quality with actual user queries, not synthetic ones.
- Asymmetric embedding considered — if using Cohere or Voyage, are you using
input_typecorrectly? - Authorization inside retrieval — tenant/permission filters happen in the vector query, not as a post-filter on results.
- Staleness strategy defined — how often does your source data change? Match your re-indexing frequency to that cadence.
- Retrieval quality measured — you have a golden eval set, not just vibes-based QA.
- Index parameters tuned — HNSW
mandef_constructionare set for your recall/latency tradeoff, not left at defaults. - Monitoring in place — you're tracking per-query latency, retrieval quality, and embedding freshness in production.
Ask The Guild
This week's community prompt:
What's the most painful data modeling mistake you've made (or seen) in a RAG or vector search system—and what did it cost you to fix it? Chunking strategy, stale embeddings, wrong embedding model, authorization leaks, or something else entirely?
Share your war story in #architecture-patterns. The guild learns fastest from production failures honestly told.
Tom Hundley has been designing distributed systems since before AWS existed. He's spent the last four years helping vibe coders build AI systems that hold up under real load. Opinions are his own and are subject to revision by evidence.