Guide · Quants

k-NN regression for stock returns — what works, what doesn't, and the calibration that makes it honest.

k-Nearest Neighbors regression is the simplest and arguably the most robust approach to stock return forecasting. The idea: find the K most similar historical situations, look at what came next, return the empirical distribution. It’s non-parametric, distribution-free, and resistant to most of the failure modes that plague gradient-boosted ensembles in finance.

But k-NN regression done naively is also wrong in three specific ways most papers don’t address. This guide covers what works, what fails, and the production-grade discipline that turns k-NN into a useful forecast primitive.

Why k-NN regression is a good fit for stock returns

Three structural reasons:

  1. Stocks are non-stationary. A model fit on 2010-2020 data has the wrong priors for 2024 vol regime. k-NN refits implicitly with every query — no stale parameters.
  2. Outcomes are heavy-tailed. A Gaussian likelihood under-estimates tail risk; k-NN’s empirical distribution preserves whatever tails exist in the cohort.
  3. Predictions need to be auditable. An ensemble model says “NVDA: +2.3%” with no way to inspect why. k-NN says “here are the 300 historical analogs my answer is based on” — fully introspectable.

The naive approach (and why it fails)

# What most tutorials show
from sklearn.neighbors import KNeighborsRegressor

X = engineered_features(historical_bars)  # e.g., RSI, MACD, vol, etc.
y = forward_returns_5d(historical_bars)

knn = KNeighborsRegressor(n_neighbors=300)
knn.fit(X_train, y_train)

prediction = knn.predict(X_test)

This fails for three reasons that are subtle enough to slip past most papers:

1. Hand-engineered features waste the signal

RSI, MACD, vol — these are 1-D summaries of a 2D chart. Two charts with identical RSI can have completely different shapes. The embedding needs to capture shape, not just summary statistics. Self-supervised learning gets you a useful embedding; engineered features mostly don’t.

2. The split is leaky

train_test_split(X, y, test_size=0.2) randomly assigns rows. In time-series finance, this leaks future information into the past. Worse, it allows the same symbol on adjacent days to appear in both train and test — basically the same chart counted twice. See symbol-disjoint evaluation for the right way.

3. The point prediction is misleading

knn.predict(X_test) returns the mean of the K neighbors’ outcomes — a point estimate that hides the dispersion. The actual useful answer is the full distribution: p10, p90, win rate, trimmed mean. See calibrated stock forecasting for why.

The production version

The architecture that actually works in production:

# 1. Compute self-supervised embeddings on minute-bar data
# (We trained 256-dim V5 embeddings; ~25M chart patterns indexed)
embedding = model.encode(chart_pattern)

# 2. Store in pgvector with IVFFlat for fast retrieval
CREATE TABLE bar_embeddings_v5 (
    symbol TEXT, date DATE, scale TEXT,
    embedding vector(256)
);
CREATE INDEX ON bar_embeddings_v5 USING ivfflat (embedding vector_l2_ops) WITH (lists = 200);

# 3. For a query anchor, retrieve K nearest neighbors with discipline
SELECT symbol, date, embedding <-> %s::vector AS distance
FROM bar_embeddings_v5
WHERE scale = '1h'
  AND NOT (symbol = %s AND date BETWEEN %s::date - INTERVAL '10 days'
                                     AND %s::date + INTERVAL '10 days')
ORDER BY embedding <-> %s::vector
LIMIT 300;

# 4. Look up each neighbor's realized forward returns
# (pre-computed cache, indexed by (symbol, date))
SELECT symbol, date, ret_1d, ret_5d, ret_10d
FROM forward_returns_cache
WHERE (symbol, date) IN (...);

# 5. Return the empirical distribution, not the mean
returns_5d = [r.ret_5d for r in cohort_returns]
distribution = {
    'mean': mean(returns_5d),
    'median': median(returns_5d),
    'p10': percentile(returns_5d, 10),
    'p90': percentile(returns_5d, 90),
    'win_rate': sum(r > 0 for r in returns_5d) / len(returns_5d),
    'trimmed_mean': winsorized_mean(returns_5d, p=0.05),
}

# 6. Apply conformal correction to the bands
# (precomputed offset on calibration set)
distribution['p10_calibrated'] = distribution['p10'] - q_80
distribution['p90_calibrated'] = distribution['p90'] + q_80

This is the architecture Chart Library’s cohort intelligence uses. The k-NN retrieval is straightforward; the discipline is in the embedding, the cohort hygiene, the eval splits, and the calibration layer.

Why use this if Chart Library already exposes it as an API?

Two reasons you might want to roll your own anyway:

  1. Custom universe. If you’re trading a specific subset (small caps, biotechs, ETFs only) and the production index doesn’t match, you can build a domain-specific cohort. Chart Library has 19,000+ US equities including delisted; if your universe is “biotechs with FDA catalysts in the last 30 days” you’ll want to filter or build your own index.
  2. Custom features. If you want to condition the cohort on something Chart Library doesn’t expose (e.g., short interest, institutional flow, options skew), you can either filter post-hoc on the cohort_analyze response or build your own k-NN with your own metadata layer.

For most use cases, Chart Library’s production version is the right choice — it has the embedding pipeline, the calibration layer, the eval discipline, and the cohort hygiene already done. Free Sandbox tier covers exploratory use; paid tiers cover production agent workloads.

Frequently asked questions

Why 300 neighbors? Isn't k-NN supposed to use small k?
For prediction tasks where you want a point estimate, small k (5-20) is standard. For distribution estimation — which is what cohort intelligence does — you want a large enough sample to characterize the tails. n=300 is a balance: large enough for stable percentile estimates, small enough that the local similarity neighborhood is still meaningful.
What about k-NN with a learned distance metric?
That's effectively what self-supervised embeddings do — the embedding space's L2 distance is a learned similarity metric that's been optimized over the entire chart corpus. Layered methods (Mahalanobis distance, learned kernel) are largely subsumed by getting the embedding right.
Does k-NN scale to billions of historical patterns?
With pgvector + IVFFlat, yes — we serve ~25M embeddings at ~100ms median latency. Billions would push toward a dedicated vector DB (Milvus, Weaviate, Pinecone) but the core algorithm doesn't change.
How do you handle cold-start (new IPOs)?
Cold-start is fine because the cohort is computed from historical analogs of the pattern, not from the symbol's own history. A 3-day-old IPO can have a very meaningful cohort because the 3-day pattern matches thousands of historical 3-day patterns from other symbols.
Try it

Skip the build — call cohort_analyze.

Free Sandbox tier (200 calls/day, no auth). Production-grade k-NN regression with full distribution + calibration + feature attribution.

Related