Guide · Engineers

Vector search for stock chart patterns — pgvector at production scale.

Chart Library indexes ~25 million chart-pattern embeddings (256 dimensions, self-supervised) in pgvector on TimescaleDB and serves nearest-neighbor queries at ~100ms median latency. This guide covers the schema, IVFFlat tuning, embedding pipeline, and the operational gotchas we’ve hit running it at scale.

If you’re evaluating pgvector for a finance vector search workload, this is the production playbook.

Why pgvector instead of a dedicated vector DB?

Three reasons we picked pgvector over Pinecone / Weaviate / Milvus / Qdrant:

  1. One database for everything. Embeddings, metadata (sector, vol regime, news features), forward returns, user accounts — all in one Postgres. No data sync between systems. Joins work.
  2. TimescaleDB hypertables for the bar data. Minute-bar volume across 19K symbols × 10 years is well past a billion rows. TimescaleDB compression turns this into manageable disk usage; the same database serves both.
  3. Operational simplicity. One backup story, one connection pool, one auth model. No vendor risk.

The tradeoff: pgvector’s ANN performance lags dedicated vector DBs by 2-5x at scale. For our 25M corpus, IVFFlat at lists=200 gets us ~100ms median queries — fine. At 100M+ embeddings we’d revisit.

Schema

-- Enable extensions
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS timescaledb;

-- Embedding table — keyed on (symbol, date, scale)
CREATE TABLE bar_embeddings_v5 (
    symbol      TEXT      NOT NULL,
    date        DATE      NOT NULL,
    scale       TEXT      NOT NULL,  -- '5m', '15m', '30m', '1h', '1d'
    embedding   vector(256) NOT NULL,
    PRIMARY KEY (symbol, date, scale)
);

-- Approximate nearest neighbor index
-- lists=200 chosen via empirical tuning; sqrt(N) heuristic gives ~5000
-- but 200 saturates query throughput for our workload.
CREATE INDEX bar_embeddings_v5_ivfflat_idx
    ON bar_embeddings_v5
    USING ivfflat (embedding vector_l2_ops)
    WITH (lists = 200);

-- Pre-computed forward returns cache (also one of the load-bearing tables)
CREATE TABLE forward_returns_cache (
    symbol  TEXT NOT NULL,
    date    DATE NOT NULL,
    ret_1d  DOUBLE PRECISION,
    ret_5d  DOUBLE PRECISION,
    ret_10d DOUBLE PRECISION,
    PRIMARY KEY (symbol, date)
);

The IVFFlat tuning that matters

The two parameters: lists at index build time, probes at query time. Both control the speed/recall tradeoff.

-- Set probes per session
SET ivfflat.probes = 10;

-- Or per query
SELECT ... FROM bar_embeddings_v5
ORDER BY embedding <-> $1::vector
LIMIT 300;
-- (probes inherited from session)

The shape of the recall/latency curve we measured:

  • lists=200, probes=1: 30ms median, recall@300 ≈ 75%
  • lists=200, probes=10: 100ms median, recall@300 ≈ 94% — production setting
  • lists=200, probes=50: 250ms median, recall@300 ≈ 99%
  • lists=4096, probes=20: 400ms median, recall@300 ≈ 98% (over-indexed for our workload)

We use lists=200, probes=10 for V5. The 6% recall loss vs exhaustive search is statistically invisible in cohort outcomes — the missing analogs were marginal in the embedding space and their forward returns are within distribution of the retrieved 300.

Same-symbol exclusion at retrieval time

The non-obvious gotcha: if you retrieve nearest neighbors of a query anchor, several adjacent days for the same symbol look similar (chart shape barely changes day-to-day). Including those adjacent days produces a meaninglessly tight cohort.

-- Wrong: returns adjacent same-symbol days as "analogs"
SELECT symbol, date FROM bar_embeddings_v5
WHERE scale = '1h' AND NOT (symbol = 'NVDA' AND date = '2024-08-05')
ORDER BY embedding <-> $1::vector LIMIT 300;

-- Right: exclude same-symbol within ±10 calendar days
SELECT symbol, date FROM bar_embeddings_v5
WHERE scale = '1h'
  AND NOT (symbol = 'NVDA'
           AND date BETWEEN '2024-08-05'::date - INTERVAL '10 days'
                        AND '2024-08-05'::date + INTERVAL '10 days')
ORDER BY embedding <-> $1::vector LIMIT 300;

See symbol-disjoint evaluation for the deeper discipline this is part of.

Embedding pipeline

The embedding model is the load-bearing piece. We use 256-dim self-supervised embeddings trained on minute-bar data with masked-autoencoder + contrastive losses. The pipeline:

# Nightly job (Tue-Sat after market close)
def compute_embeddings_for_date(date_str):
    """Compute V5 embeddings for all symbols on a given trading day."""
    universe = get_active_symbols(date_str)  # ~19,000 symbols
    for symbol in universe:
        for scale in ['5m', '15m', '30m', '1h']:
            window = load_bars(symbol, date_str, scale)
            if len(window) < REQUIRED_BARS[scale]:
                continue
            embedding = model.encode(window)  # 256-dim vector
            upsert_embedding(symbol, date_str, scale, embedding)

    # Refresh IVFFlat index lists periodically (not after every nightly)
    # Lists need rebalancing as new vectors arrive but the cost is high.
    # We REINDEX monthly via a maintenance window.

Tradeoffs to know about: IVFFlat lists are computed at index build time. As you insert new vectors, the cluster centroids don’t update — query quality slowly drifts. We REINDEX monthly to keep recall stable. HNSW (the alternative pgvector index type) handles this better but uses more memory.

Operational gotchas

1. IVFFlat won’t use the index if WHERE clauses are too selective

If you filter to a tiny subset (e.g., one symbol, one date) before the ORDER BY, the planner often skips the IVFFlat index and does a sequential scan. Fix: rewrite to use the index first (LIMIT inside a CTE) and filter post-hoc, or use pgvector’s hint syntax.

2. Connection pool exhaustion

Each ANN query holds a connection for ~100ms. A worker pool of 10 saturates at ~100 QPS. If your application serves agents parallel-querying cohorts, size the pool generously (we run 50 max connections) and use a connection pooler in front of Postgres if the agent traffic is bursty.

3. Embedding update without index rebuild = silently degraded recall

If you update the embedding values in-place (e.g., a model retrain), the IVFFlat index will silently misroute queries because the cluster centroids no longer match the actual embeddings. Production-safe pattern: write new embeddings to a staging column, build a new index on it, then atomic switch via schema rename.

Frequently asked questions

Should I use IVFFlat or HNSW?
HNSW gives better recall at higher memory cost, doesn't need rebalancing as data changes, but is slower to build initially. For our 25M corpus the IVFFlat tradeoff (faster build, monthly REINDEX) is fine; for highly mutable corpora HNSW is better.
What dimension should the embedding be?
256 is what we settled on after experiments at 128, 256, 384. 128 lost too much shape information; 384 added query latency without recall improvement. The right answer depends on your embedding model's expressivity.
How does this scale to 100M+ embeddings?
We haven't hit that scale yet. At ~100M we'd consider a dedicated vector DB (Milvus, Weaviate) or partition the pgvector index by date range. The Chart Library production index is at 25M and likely sustainable at 50M with current hardware.
Can I use cosine distance instead of L2?
Yes — pgvector supports `<=>` (cosine) and `<->` (L2) and `<#>` (inner product). For shape-similarity embeddings we use L2. Cosine is right when you want to ignore magnitude (e.g., sentence embeddings); for chart shape, magnitude carries information.
Try it

Skip the infrastructure — query our index directly.

cohort_analyze API is free Sandbox tier (200 calls/day). Production-tested vector retrieval, no infra to manage.

Related