Database Indexing: The 80/20 of Performance
Architecture Patterns — Part 2 of 30
The Flash Sale That Cost $500,000
It was a Friday afternoon. The flash sale had been planned for weeks. Marketing had sent the email blast, the countdown timer was live, and traffic was pouring in. Then, thirty minutes into the event, checkout stopped working.
The culprit wasn't a DDoS attack or a deployment gone wrong. It was one database index — one that a developer had added the week before with the best of intentions.
The system: a 50-million-row orders table in production. The developer saw that a reporting query was slow and reasoned, logically, that indexing the status column would help. What they didn't account for was that status has only four or five distinct values across 50 million rows. Low cardinality. The query planner looked at the index, did the math, and kept choosing a full table scan anyway — the index was useless for reads.
But it wasn't useless for writes. With thousands of orders changing status every minute during the sale, the database was constantly maintaining an index that nobody was using. Write latency spiked. Deadlocks cascaded. The system crashed within 30 minutes, and the post-mortem estimated $500,000 in lost revenue.
This is the real problem with database indexing. It's not that engineers don't know what indexes are. It's that the decision framework for when to add one, what to index, and which type to use is genuinely non-obvious — and the cost of getting it wrong is asymmetric. A missing index is slow. A wrong index can be catastrophic.
After 25 years building and reviewing production systems, I can tell you: indexing is the single highest-leverage place in your database architecture. It's also the most commonly misunderstood. Let's fix that.
Why Indexes Matter More Than You Think
First, a quick mental model refresh. An index is a separate data structure — almost always a B-tree — that your database maintains alongside your table. It stores a sorted copy of the indexed column(s) plus a pointer to the full row. When a query filters on an indexed column, the engine does a fast binary search through the B-tree instead of scanning every row.
The math here is dramatic. A table with 10 million rows:
- Full table scan: touches ~10,000,000 rows
- B-tree index seek: touches ~23 rows (
log₂(10,000,000) ≈ 23)
That's not a 10% improvement. That's a 400,000x reduction in rows examined. In practice, you see this play out as queries going from minutes to milliseconds.
A real example: IBM's FileNet P8 financial application added an index on a single column and watched transaction response times drop from 7,000ms to 200ms — a 35x improvement — while CPU load fell from 60% to under 20%.
And from Sentry's engineering blog in January 2026: a pagination query on a large tickets table was taking 3.85 seconds on page 321. After adding a composite index and switching to cursor-based pagination, average query duration dropped to 13ms. That's a 300x improvement from a CREATE INDEX statement and a query rewrite.
But indexes aren't free. Every index adds:
- Storage overhead (a complete sorted copy of those columns)
- Write overhead (every
INSERT,UPDATE, andDELETEmust update each index on the table) - Planning complexity (the query optimizer has more choices to evaluate)
This is the core tension in indexing architecture: you're trading write performance and storage for read performance. Getting that trade-off right is the job.
The Decision Framework
Here's the framework I use when evaluating whether to add an index. Work through these questions in order.
1. Is this query actually slow in production?
Always start with measurement. Never add indexes speculatively based on what you think will be slow. Your query planner is smarter than your intuition for most cases.
PostgreSQL — find your slowest queries:
-- Enable first (add to postgresql.conf: shared_preload_libraries = 'pg_stat_statements')
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Then find your top offenders
SELECT
query,
calls,
round(mean_exec_time::numeric, 2) AS mean_ms,
round(total_exec_time::numeric, 2) AS total_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
MySQL — enable slow query log:
-- In my.cnf or at runtime:
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 0.2; -- Log anything over 200ms
SET GLOBAL slow_query_log_file = '/var/log/mysql/slow.log';
-- Then parse it:
-- mysqldumpslow -s t -t 20 /var/log/mysql/slow.log
Terminal — check for sequential scans in PostgreSQL:
SELECT
relname AS table,
seq_scan,
idx_scan,
n_live_tup AS rows
FROM pg_stat_user_tables
WHERE n_live_tup > 10000
AND seq_scan > idx_scan
ORDER BY seq_tup_read DESC
LIMIT 20;
Any table that's being scanned sequentially more than it's using indexes, and has more than 10,000 rows, is a candidate for investigation.
2. What is the cardinality of the column?
This is where most developers go wrong, exactly like the flash sale story above.
Cardinality is the number of distinct values in a column relative to total rows. High cardinality means many distinct values (e.g., user_id, email). Low cardinality means few distinct values (e.g., status, is_active, country_code).
Rule of thumb:
- High cardinality columns → indexes are highly effective. The engine can skip most rows quickly.
- Low cardinality columns → indexes are often useless or harmful. If 40% of your rows have
status = 'active', the planner knows a full scan is faster.
-- Check cardinality before indexing (PostgreSQL)
SELECT
attname AS column,
n_distinct,
correlation
FROM pg_stats
WHERE tablename = 'orders'
ORDER BY n_distinct DESC;
A n_distinct value close to the row count means high cardinality. A small number (like 4 or 5) means low cardinality — think very carefully before indexing.
3. What is the read/write ratio for this table?
Indexes pay dividends on reads and cost overhead on writes. Before adding an index, ask yourself:
- How many
SELECTqueries per second hit this table? - How many
INSERT/UPDATE/DELETEoperations happen?
An orders table at a high-volume retailer might take 5,000 status updates per minute during a sale. An audit_log table might be append-only with almost no reads. A products table might be read 10,000 times per second but updated only a few times per day.
For write-heavy tables, be conservative. Every index you add is a tax on every write. For read-heavy tables with infrequent writes (like a products catalog), you can be more aggressive.
4. Is this a single-column or multi-column access pattern?
If your queries almost always filter on multiple columns together, a composite index is almost always better than multiple single-column indexes.
-- This query benefits from a composite index, not two separate ones:
SELECT id, total_amount
FROM orders
WHERE customer_id = 12345
AND status = 'pending'
ORDER BY created_at DESC;
-- Wrong approach: two separate indexes
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_status ON orders(status);
-- Right approach: composite index
-- Column order rule: equality conditions first, then range/ORDER BY
CREATE INDEX idx_orders_customer_status_date
ON orders(customer_id, status, created_at DESC);
The column order in a composite index matters enormously. The query can use the leftmost prefix of the index. (customer_id, status, created_at) will be used for queries filtering on just customer_id, or customer_id + status, or all three. But it won't help a query that filters only on status.
The Index Types You Actually Need to Know
Most tutorials cover B-tree and stop there. Here's a practical map:
B-Tree (the default)
Use for: equality (=), range (>, <, BETWEEN), ORDER BY, LIKE 'prefix%'
This covers 90% of your use cases. If you're not sure, use B-tree.
Hash Index
Use for: equality-only lookups (=)
Do not use for: range queries, sorting
-- PostgreSQL: good for equality-heavy lookup tables
CREATE INDEX idx_users_session_token
ON users USING HASH (session_token);
Hash indexes are slightly faster than B-tree for pure equality lookups, but they can't support range queries or ORDER BY. In most applications B-tree is the right default because it's more flexible.
Partial Index
Use for: queries that always filter on a specific condition
-- Only index the rows you actually query
-- Great for "active" or "pending" rows in a large table
CREATE INDEX idx_orders_pending_created
ON orders(created_at)
WHERE status = 'pending';
This is one of the most underused features in PostgreSQL. If 95% of your reads only touch WHERE status = 'pending' rows, there's no reason to index all 50 million historical records. A partial index on only the pending rows will be smaller, faster to maintain on writes, and more effective.
Covering Index
Use for: queries where you want index-only scans (avoiding heap fetches entirely)
-- PostgreSQL INCLUDE syntax: add non-search columns to the index
-- The engine can satisfy the entire query without touching the table
CREATE INDEX idx_orders_covering
ON orders(customer_id, status)
INCLUDE (id, total_amount, created_at);
-- Now this query never touches the orders table itself:
SELECT id, total_amount, created_at
FROM orders
WHERE customer_id = 12345 AND status = 'pending';
Full-Text Index
Use for: text search on string columns
-- PostgreSQL: GIN index for full-text search
CREATE INDEX idx_products_search
ON products USING GIN (to_tsvector('english', name || ' ' || description));
-- Query:
SELECT * FROM products
WHERE to_tsvector('english', name || ' ' || description)
@@ to_tsquery('english', 'wireless & headphones');
Common Traps (And How to Avoid Them)
Trap 1: Indexing a function-wrapped column
-- This query DOES NOT use an index on email:
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';
-- Fix: Create a functional index
CREATE INDEX idx_users_lower_email ON users(LOWER(email));
-- Or better: normalize on write (store emails lowercase)
Any time you wrap an indexed column in a function — LOWER(), DATE(), YEAR(), COALESCE() — the index is bypassed. The query planner can only use an index if the WHERE clause exactly matches the indexed expression.
Trap 2: Implicit type coercion
// Node.js / TypeScript: user_id is an integer in the DB
// but you're accidentally passing a string
const userId = req.params.id; // This is a string '12345'
db.query('SELECT * FROM orders WHERE user_id = $1', [userId]);
// PostgreSQL will coerce it, but the index may not be used
// depending on your column type
// Fix: explicit casting
const userId = parseInt(req.params.id, 10);
Trap 3: Low-selectivity indexes (the flash sale story)
As established: do not index columns with few distinct values unless you're using a partial index to target a specific subset.
Trap 4: Index bloat from dead tuples
In PostgreSQL, UPDATE and DELETE don't remove old index entries immediately — they leave dead tuples. Over time, your indexes grow and slow down.
-- Check index bloat
SELECT
indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan,
idx_tup_read
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;
-- Rebuild bloated indexes without downtime
REINDEX INDEX CONCURRENTLY idx_orders_customer_id;
Trap 5: Adding indexes to a live table without CONCURRENTLY
-- WRONG: This locks the table and blocks all writes
CREATE INDEX idx_orders_status ON orders(status);
-- RIGHT: Non-blocking, takes longer but safe for production
CREATE INDEX CONCURRENTLY idx_orders_status ON orders(status);
Never run a plain CREATE INDEX on a large production table. The CONCURRENTLY option builds the index without holding a write lock, but it takes longer and can't be run inside a transaction.
Finding Unused Indexes
Index bloat isn't just from dead tuples. Over time, applications evolve, queries change, and some indexes become completely unused. They still pay the write cost and consume storage.
-- PostgreSQL: find indexes that have never been used
-- (reset on every pg_stat_reset() or server restart)
SELECT
schemaname,
tablename,
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan AS scans_since_reset
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE 'pg_%'
ORDER BY pg_relation_size(indexrelid) DESC;
If an index shows zero scans after a few weeks of normal traffic, it's dead weight. Drop it — but verify first that it's not needed for a rare but critical query (like a monthly billing job).
A Python Script for Index Monitoring
Here's a utility I've used on several production systems to continuously monitor index health:
import psycopg2
import json
from datetime import datetime
def audit_indexes(dsn: str) -> dict:
"""Audit PostgreSQL indexes for usage, bloat, and missing index candidates."""
conn = psycopg2.connect(dsn)
cur = conn.cursor()
# Find tables with high sequential scans (missing index candidates)
cur.execute("""
SELECT
relname AS table,
seq_scan,
idx_scan,
n_live_tup AS live_rows,
round(seq_scan::numeric / NULLIF(seq_scan + idx_scan, 0) * 100, 1)
AS pct_seq_scans
FROM pg_stat_user_tables
WHERE n_live_tup > 50000
AND seq_scan > 100
ORDER BY seq_tup_read DESC
LIMIT 20;
""")
seq_scan_tables = cur.fetchall()
# Find unused indexes
cur.execute("""
SELECT
tablename,
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE 'pg_%'
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20;
""")
unused_indexes = cur.fetchall()
cur.close()
conn.close()
return {
"timestamp": datetime.utcnow().isoformat(),
"high_seq_scan_tables": [
{"table": r[0], "seq_scans": r[1], "idx_scans": r[2],
"live_rows": r[3], "pct_seq": float(r[4] or 0)}
for r in seq_scan_tables
],
"unused_indexes": [
{"table": r[0], "index": r[1], "size": r[2]}
for r in unused_indexes
]
}
if __name__ == "__main__":
result = audit_indexes("postgresql://user:pass@localhost:5432/mydb")
print(json.dumps(result, indent=2))
Run this weekly (or pipe it into your observability stack) and you'll catch index drift before it becomes a production incident.
The Architecture Decision: When NOT to Index
All of the above should make one thing clear: the question isn't "should I add an index?" The question is "do the read benefits outweigh the write costs for this specific workload?"
Here's a decision tree for production systems:
Query is slow?
└─ YES: Check EXPLAIN ANALYZE
├─ Seq Scan on small table (<100k rows)? → Skip, index won't help
├─ Seq Scan on large table?
│ ├─ Column has high cardinality? → Add B-tree index
│ ├─ Column has low cardinality?
│ │ └─ Query targets small % of rows? → Add partial index
│ │ └─ Query scans most rows? → Rethink the query
│ └─ Query always uses multiple columns? → Add composite index
└─ Already using an index but slow?
├─ Index bloat? → REINDEX CONCURRENTLY
└─ Wrong index type? → Evaluate GIN, covering, functional index
And critically: after adding any index, run EXPLAIN ANALYZE again to confirm the planner actually uses it. The query planner sometimes has surprising opinions.
-- Always verify your index is being used
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT id, total_amount
FROM orders
WHERE customer_id = 12345 AND status = 'pending'
ORDER BY created_at DESC
LIMIT 50;
Look for Index Scan or Index Only Scan in the output. If you see Seq Scan, the planner chose not to use the index — which means either the index is wrong, the data distribution doesn't warrant it, or the statistics are stale (ANALYZE orders; to refresh them).
The 80/20 Summary
Eighty percent of your database performance gains will come from twenty percent of your indexing decisions. Here's the concentrated version:
- Measure first. Use
pg_stat_statementsand slow query logs before touching anything. - Check cardinality. High cardinality → index freely. Low cardinality → use partial indexes or skip.
- Composite over multiple singles. For multi-column access patterns, one composite index beats two singles.
- Mind the write tax. Write-heavy tables demand conservative indexing. Read-heavy tables can sustain more.
- Always use CONCURRENTLY when adding indexes to live production tables.
- Drop unused indexes. They cost write overhead and storage with zero benefit.
- Functional indexes for function-wrapped columns.
LOWER(email)in your WHERE clause needsLOWER(email)in your index. - Verify with EXPLAIN ANALYZE. The query planner is the final arbiter.
Checklist: Before You Ship That Index
- Verified this query is actually slow using
pg_stat_statementsor slow query logs - Checked column cardinality (
pg_statsorSELECT COUNT(DISTINCT col) / COUNT(*)) - Evaluated read/write ratio for this table
- Chose the right index type for the access pattern
- Ordered composite index columns correctly (equality first, range/sort last)
- Used
CREATE INDEX CONCURRENTLYfor production tables - Ran
EXPLAIN (ANALYZE, BUFFERS)and confirmed the planner uses the new index - Checked for duplicate or redundant indexes after adding
- Added index monitoring to catch future bloat or drift
- Documented the index with a comment explaining why it exists
-- Document your indexes — future-you will thank present-you
COMMENT ON INDEX idx_orders_customer_status_date IS
'Supports customer order history queries in OrderService.getCustomerOrders().
Added 2026-03-17 after prod perf regression. See incident #1042.';
Ask The Guild
This week's community prompt:
What's the worst index mistake you've seen or made in production? A missing index that caused an outage? An over-indexed write table that cratered throughput? Or maybe an index that was doing the wrong thing for years before anyone noticed?
Share your war story in the Guild forum under #architecture-patterns. The best ones become future articles — with your name on them.
The guild learns fastest from what actually breaks.
Tom Hundley is a software architect with 25 years of experience building and breaking production systems. He currently teaches architecture patterns to the next generation of vibe coders at the AI Coding Guild.