Data Modeling for AI Apps: Embeddings and Vectors

Architecture Patterns — Part 7 of 30

The Demo That Lied to You

Here's a story I've watched play out a dozen times in the last two years.

A team builds a killer internal knowledge base. They chunk up the company's Confluence docs, run them through text-embedding-3-small, dump the vectors into a local ChromaDB instance, wire it to GPT-4o, and demo it to leadership on a Tuesday. It's magic. Questions get answered. Executives are delighted. Jira tickets are created.

By Thursday, the thing is in production. By the following Monday, the first complaints roll in.

"It gave me the wrong policy." "It said our refund window is 30 days. It's 14." "It quoted a pricing tier we discontinued two years ago."

The team scrambles. They tweak the prompt. They add more context. They bump k from 3 to 8. Nothing fundamentally changes because the problem isn't the LLM—it's the data model. Specifically, it's that they never made deliberate architectural decisions about how their vectors would be stored, versioned, retrieved, and kept current.

This is Part 7 of the Architecture Patterns series, and we're going deep on the decisions that separate a RAG demo from a RAG system. Not just the how—the why behind every fork in the road.

What You're Actually Building

Before picking a database or an embedding model, understand the data structure you're working with.

An embedding is a dense numerical vector—an array of floating-point numbers—that encodes semantic meaning. When you embed the sentence "Cancel my subscription," you get a vector that sits geometrically close to "I want to stop my plan" and far from "Tell me your pricing." The embedding model learned this geometry by training on massive text corpora.

Here's what that looks like in practice:

from openai import OpenAI

client = OpenAI()

def embed_text(text: str, model: str = "text-embedding-3-small") -> list[float]:
    response = client.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

# Returns a list of 1536 floats
vec = embed_text("Cancel my subscription")
print(f"Dimensions: {len(vec)}")  # 1536
print(f"First 5 values: {vec[:5]}")

The vector database stores these embeddings alongside their source content and metadata, then serves approximate nearest-neighbor (ANN) queries to find semantically similar records at query time.

Your first architectural decision is what those two pieces of your system look like.

Decision 1: Choosing Your Vector Store

This is where I see the most cargo-culting. Engineers default to whatever they saw in the last tutorial. In 2024, that was usually Pinecone or Chroma. In 2026, the conversation has gotten more nuanced—and the performance landscape has shifted dramatically.

The pgvector Revolution

For most of 2023 and 2024, the conventional wisdom was: use pgvector for prototypes under a million vectors, then graduate to Pinecone or Weaviate. That advice is now outdated.

As of early 2026, pgvector + pgvectorscale (Timescale's extension) benchmarks at 471 queries per second at 99% recall on 50 million vectors—11.4x better throughput than Qdrant and 28x lower p95 latency than Pinecone's storage-optimized tier, at roughly 75% lower infrastructure cost. The performance objection to PostgreSQL as a vector store is largely dead at moderate scale.

Here's how you set it up:

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create a table with a vector column
CREATE TABLE documents (
    id          BIGSERIAL PRIMARY KEY,
    content     TEXT NOT NULL,
    metadata    JSONB,
    tenant_id   UUID NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT NOW(),
    embedding   vector(1536)
);

-- HNSW index for fast ANN search
CREATE INDEX ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

And querying with a SQL JOIN—something no dedicated vector database can do natively:

import psycopg2
from pgvector.psycopg2 import register_vector

def semantic_search(conn, query_embedding, tenant_id, limit=5):
    cur = conn.cursor()
    cur.execute("""
        SELECT d.id, d.content, d.metadata,
               1 - (d.embedding <=> %s) AS similarity
        FROM documents d
        JOIN tenants t ON t.id = d.tenant_id
        WHERE d.tenant_id = %s
          AND t.subscription_active = true
        ORDER BY d.embedding <=> %s
        LIMIT %s
    """, (query_embedding, tenant_id, query_embedding, limit))
    return cur.fetchall()

That JOIN on tenants is handling your authorization layer inside the retrieval query. You can't do that in Pinecone without a second round-trip.

The Decision Framework

Here's how I think about it:

Scenario	Recommendation
Already running Postgres, <10M vectors	pgvector — zero new infra, ACID, SQL JOINs
Multi-tenant app needing row-level security	pgvector — combine vector + relational permissions
Need zero ops, unpredictable scale, compliance (SOC 2/HIPAA)	Pinecone — managed, auto-scales
Hybrid search (semantic + BM25 keyword) required	Weaviate — native hybrid, proven at scale
>50M vectors, dedicated vector team	Pinecone or self-hosted Weaviate

A mid-size SaaS team (around $15M ARR) recently built their internal document search on pgvector with 2 million documents. Result: $0 in new infrastructure, sub-100ms p95 latency at 95% recall, and their authorization rules expressed as ordinary SQL — not a custom metadata filter scheme bolted onto a vector API.

Decision 2: Picking Your Embedding Model

The embedding model determines the quality of your semantic space. Swap the model, and you must re-embed everything—so this decision deserves more than a default.

As of early 2026, benchmarks across 10 embedding models on real-world tasks show the landscape has matured:

Model	MTEB Score	Dimensions	Cost / 1M tokens	Latency (100 tokens)
Voyage-2	67.8%	1024	$1,000	100–180ms
OpenAI text-embedding-3-large	64.6%	3072	$1,300	150–250ms
BGE-large-en-v1.5 (self-hosted)	63.9%	1024	$5–20 (infra)	5–15ms
Cohere embed-english-v3.0	63.1%	1024	$1,000	80–150ms
OpenAI text-embedding-3-small	62.3%	1536	$200	100–200ms

The practical takeaway: text-embedding-3-small is an excellent default for general-purpose RAG. It's cheap, fast, widely supported, and good enough. The gap between it and Voyage-2 on most production tasks is smaller than the gap between good chunking strategy and bad chunking strategy.

Cohere's embed models stand out in enterprise contexts for one reason: the input_type parameter. You tell the model whether it's embedding a search query or a document, and it optimizes accordingly:

import cohere

co = cohere.Client('your-api-key')

# Embed documents at index time
doc_embeddings = co.embed(
    texts=["Our refund policy allows 14-day returns..."],
    model="embed-english-v3.0",
    input_type="search_document"  # <-- optimized for storage
).embeddings

# Embed the query at retrieval time
query_embedding = co.embed(
    texts=["What is your return policy?"],
    model="embed-english-v3.0",
    input_type="search_query"  # <-- optimized for retrieval
).embeddings[0]

This asymmetric embedding approach—different representations for documents and queries—consistently improves retrieval precision. Most teams don't bother with it. You should.

One firm rule: never mix embedding models in the same index. If you switch from text-embedding-3-small to Voyage-2, you must re-embed every document. Build your pipeline to make this straightforward from day one:

# Track your embedding model version in your schema
ALTER TABLE documents ADD COLUMN embedding_model TEXT;
ALTER TABLE documents ADD COLUMN embedding_version INT DEFAULT 1;

# Re-embedding script
python scripts/reembed.py \
  --model voyage-2 \
  --batch-size 100 \
  --tenant all

Decision 3: Chunking is Your Hidden Lever

I've watched teams spend weeks switching embedding models when the real issue was chunking. This is the most undervalued variable in RAG quality.

Chunk size directly determines what the model can return. Chunk too small (< 128 tokens), and you lose context—the retrieved passage won't contain enough information to answer the question. Chunk too large (> 1024 tokens), and your similarity scores become noisy because the vector averages over too much content.

The right strategy for most document-heavy apps:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # tokens per chunk
    chunk_overlap=64,      # overlap prevents context loss at boundaries
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_text(document_text)

For structured documents (PDFs with headers, policies, API docs), add the parent document's title and section heading to each chunk before embedding. The embedding model can't infer context it doesn't see:

def prepare_chunk(title: str, section: str, chunk_text: str) -> str:
    return f"{title}\n{section}\n\n{chunk_text}"

This simple prefix dramatically improves recall on section-specific queries.

Decision 4: Keeping Your Vectors Current

This is where production systems fall apart quietly. A 2025 post-mortem on RAG production failures identified stale embeddings as a primary failure mode. Your source documents change—new policies, updated pricing, deprecated features—but the vectors in your database remain static until you actively update them.

Build freshness into your data model from the start:

CREATE TABLE documents (
    id              BIGSERIAL PRIMARY KEY,
    source_id       TEXT NOT NULL,        -- external document ID
    source_hash     TEXT NOT NULL,        -- MD5 of source content
    content         TEXT NOT NULL,
    embedding       vector(1536),
    embedding_model TEXT NOT NULL,
    indexed_at      TIMESTAMPTZ DEFAULT NOW(),
    source_updated_at TIMESTAMPTZ        -- when the source changed
);

CREATE INDEX ON documents (source_id, source_hash);

With source_hash, your indexing pipeline can skip unchanged documents and only re-embed what's actually changed:

import hashlib

def should_reindex(conn, source_id: str, content: str) -> bool:
    content_hash = hashlib.md5(content.encode()).hexdigest()
    cur = conn.cursor()
    cur.execute(
        "SELECT source_hash FROM documents WHERE source_id = %s ORDER BY indexed_at DESC LIMIT 1",
        (source_id,)
    )
    row = cur.fetchone()
    return row is None or row[0] != content_hash

The Architecture That Actually Works

Here's the minimal production RAG architecture that survives contact with real users:

┌─────────────────────────────────────────────────┐
│                  INDEXING PIPELINE               │
│                                                  │
│  Source Docs → Hash Check → Chunker → Embedder  │
│                                   ↓              │
│              PostgreSQL + pgvector (+ metadata)  │
└─────────────────────────────────────────────────┘
                        ↓
┌─────────────────────────────────────────────────┐
│                  QUERY PIPELINE                  │
│                                                  │
│  User Query → Embed Query → ANN Search (top-k)  │
│                          → Rerank (optional)     │
│                          → LLM + Context         │
│                          → Response              │
└─────────────────────────────────────────────────┘

The reranking step—running retrieved chunks through a cross-encoder model like Cohere Rerank before passing them to the LLM—is optional for <100ms requirements but adds significant precision for knowledge-base applications. It's worth the 100–200ms if your queries are async or batch.

As of 2026, only 31% of AI initiatives reach full production according to ISG research. The gap isn't the LLM—it's the data infrastructure around it.

Architecture Checklist

Before you ship your vector-backed feature, verify:

Vector store chosen deliberately — not just the first tutorial you read. If you already run Postgres and have <10M vectors, pgvector is the right answer for most teams.
Embedding model locked in schema — embedding_model column exists; re-embedding path is documented and tested.
Source hashing in place — your indexing pipeline is idempotent and skips unchanged documents.
Chunk size validated — you've tested retrieval quality with actual user queries, not synthetic ones.
Asymmetric embedding considered — if using Cohere or Voyage, are you using input_type correctly?
Authorization inside retrieval — tenant/permission filters happen in the vector query, not as a post-filter on results.
Staleness strategy defined — how often does your source data change? Match your re-indexing frequency to that cadence.
Retrieval quality measured — you have a golden eval set, not just vibes-based QA.
Index parameters tuned — HNSW m and ef_construction are set for your recall/latency tradeoff, not left at defaults.
Monitoring in place — you're tracking per-query latency, retrieval quality, and embedding freshness in production.

Ask The Guild

This week's community prompt:

What's the most painful data modeling mistake you've made (or seen) in a RAG or vector search system—and what did it cost you to fix it? Chunking strategy, stale embeddings, wrong embedding model, authorization leaks, or something else entirely?

Share your war story in #architecture-patterns. The guild learns fastest from production failures honestly told.

Tom Hundley has been designing distributed systems since before AWS existed. He's spent the last four years helping vibe coders build AI systems that hold up under real load. Opinions are his own and are subject to revision by evidence.