Concept · Failure modes

Common mistakes with cohort intelligence —
six failure modes.

Cohort intelligence is a methodology-honest alternative to point-prediction forecasting — but it’s only useful if you read it the right way. The wrong reads make the same mistakes point forecasts make, just dressed up in distributions.

Six failure modes we’ve seen most often from analysts, AI agents, and ourselves while building the product. Each one with the trap, a worked example, and the fix.

Mistake 1: Treating the cohort median as a forecast

The trap. A cohort response shows 5d median: +0.8% and the reader concludes “the stock will rise 0.8%.” That’s a point forecast — exactly the framing cohort intelligence was designed to replace.

Why it fails. The median is one statistic of a distribution that runs from p10 ≈ -6% to p90 ≈ +7%. Half of analogs were below the median; half above. A reader who anchors on +0.8% has compressed all the uncertainty into a number that “feels precise.” That’s false precision.

The fix. Always cite the median with the win rate and the p10/p90 band. If the median is +0.8% but the win rate is 49% and the p10/p90 is -6%/+7%, the right summary is “close to coin-flip with wide tails,” not “will rise 0.8%.” The agent system prompts we recommend specifically encode this discipline — see using cohort intelligence in a Claude agent.

Mistake 2: Over-filtering the cohort

The trap. “I want analogs that match the live situation as closely as possible.” So you filter: macro = bullish AND vol_regime = low AND sector = tech AND days_since_earnings > 14 AND days_since_ath < 30. Cohort comes back with n=8.

Why it fails. n=8 has a 95% confidence interval on the win rate of roughly ±30 percentage points. Any conclusion you draw from those 8 analogs is noise. The cohort lost its statistical power the moment you over- filtered.

The fix. Two principles:

Don’t apply more than 1-2 filters at retrieval. Watch cohort_size_actual; if it drops below 30, you’ve gone too far. Below 15, abandon the filtered cohort entirely.
Prefer reading regime_stratification on an unfiltered cohort. The stratification gives you the conditional read AND the contrast (you can compare regimes to each other) without burning the sample.

See cohort filtering by regime for calibrated narrowing patterns.

Mistake 3: Ignoring regime stratification when it's screaming at you

The trap. The headline cohort shows 5d median: -1.3%. The reader writes “ outlook is mildly bearish” and moves on. But the regime stratification shows:

"regime_stratification_5d": {
  "low_vol":   {"n": 84, "win_rate": 0.38, "median_return": -2.1},
  "high_vol":  {"n": 76, "win_rate": 0.51, "median_return":  0.4}
}

Low-vol analogs lose 62% of the time. High-vol analogs win 51% of the time. The headline “mildly bearish” hides a sharp regime split.

Why it fails. The whole-cohort statistic averages across regimes. If the live anchor is cleanly in one regime, the regime-conditioned statistic is the actual answer. The headline is at best uninformative and at worst misleading.

The fix. Always check regime_stratification before writing a conclusion. If the buckets differ by more than ~10pp on win rate, the regime conditioning matters. Find out which regime the live anchor is in (via anchor_metadata or a get_market_context call) and weight the regime-conditioned stat accordingly.

Mistake 4: Reading feature_importance as causation

The trap. The cohort response shows credit_spread_state=tight as the top positive feature. The reader concludes “tight credit causes this pattern to work.”

Why it fails. Feature importance is correlation within the cohort, not causation. Three things could be going on:

Tight credit might genuinely be a driver — when credit is tight, money is cheap, risk assets rally, the pattern resolves higher.
Tight credit might be a marker of something else — bullish macro, late-cycle, “risk-on” mood — and the real driver is the underlying state, not credit itself.
It might be confounding from sample composition — the tight-credit analogs in your specific cohort happened to also be from later in 2017, a period when the pattern resolved higher for other reasons.

The fix. Treat feature importance as conditioning information, not as a causal model. The right read is: “within this 300-analog cohort, analogs with tight credit outperformed. The live anchor currently has tight credit, which makes the positively-correlated subset of the cohort more relevant. I’m weighting the read toward those analogs.” That’s defensible. “Tight credit will cause this to work” isn’t.

Mistake 5: Same-stock leakage in custom retrieval

The trap. You build a custom retrieval pipeline on top of the embeddings (or run cohort_analyze with same-symbol exclusion turned off). The cohort comes back with 30 analogs that look incredibly similar — and most of them are the same symbol on adjacent days. The cohort metrics look astonishingly tight: median +0.05%, win rate 90%, std 0.4%.

Why it fails. The “analogs” are the same anchor at slightly different times — say NVDA on 2024-08-04, 08-05, 08-06, 08-07. Of course they look similar; they’re consecutive trading days of the same stock. The forward returns are nearly identical because they’re measuring almost the same event. The tight statistics are an artifact, not a finding.

The fix. Always exclude same-symbol matches within a meaningful window. Chart Library’s cohort_analyze defaults to excluding the same symbol within ±10 calendar days. If you’re building custom retrieval, replicate that exclusion. If you turned it off and got suspiciously tight numbers — that’s why.

Same logic applies if you allow same-quarter same-sector analogs in heavy concentration. A cohort that’s 80% tech in 2021 isn’t a diversified cross-stock cohort — it’s a thinly-disguised single-bet.

Mistake 6: Ignoring distance weighting (or applying it wrong)

The trap. A cohort_analyze response returns the top 300 analogs by similarity. The reader treats all 300 as equally weighted — analog #1 (very close) and analog #300 (just barely made the cut) contribute the same amount to the median, win rate, and feature attribution.

Why it fails. The 300 analogs aren’t equally similar. Analog #1 might have an L2 distance of 0.3 in embedding space; analog #300 might have 0.9. The closer analogs are more informative about the live anchor.

The fix. Chart Library’s server-side statistics use distance weighting where it materially changes the answer. The outcome_distribution fields are equal-weighted by default (simpler to interpret), but the weighted_outcome_distribution field gives you the inverse-distance-weighted version. Use the weighted version when:

Cohort_score is below 0.7 (the tail of the cohort is stretched and the bottom analogs are weak matches)
You’re comparing two cohorts and want to control for cohort tightness
The distance distribution is bimodal (some very close analogs, some very far)

The wrong fix is to drop the bottom analogs entirely (e.g. only keep the top 50). That introduces its own instability — small-sample noise plus a similarity-threshold cliff that biases stats. Distance weighting is smoother.

A meta-mistake: confusing cohort intelligence with prediction

All six mistakes above share a root: treating cohort intelligence as a forecasting tool. The median becomes a prediction. The filters become an attempt to find “the right answer.” The features become causal claims. The cohort becomes an oracle.

Cohort intelligence isn’t a forecast. It’s empirical conditioning information — what did 300 historical analogs of this situation do next, what separated them, and how does the answer change across regimes? Used that way, it’s the cleanest source of market reasoning available. Used as a forecast wrapper, it’s point prediction in a fancier coat.

The 50-0 paired-agent evaluation found exactly this pattern in the qualitative judge rationales: agents that used cohort intelligence as conditioning information dominated agents that treated it as a forecast. The methodology rewards the right read.

Frequently asked questions

Is there an automated way to detect over-filtering?: Yes — the cohort_analyze response includes a warning field when cohort_size_actual drops below the n=30 floor. We surface progressively louder warnings at n<30, n<15, n<5. AI agents should be system-prompted to respect those warnings.
What's the right read when the regime is transitioning?: Don't condition on regime at all. Stick with the headline cohort statistic and acknowledge the regime uncertainty in your write-up. Reaching for a regime-conditioned read when the regime label isn't stable is over-claiming.
How do I know if I'm reifying a feature into causation?: Ask: would I claim this feature 'causes' the pattern to work if I were writing a paper for peer review? If no, you're reifying. The honest framing is always 'analogs with feature X outperformed in this cohort,' never 'feature X causes outperformance.'
Is same-stock exclusion always the right call?: Almost always. The exception is when you specifically want a within-stock analysis — 'how does NVDA tend to behave after a setup like this?' In that case, drop the cohort layer entirely and use symbol_intelligence, which is built for per-symbol track records.
Should I always use distance weighting?: No. Equal weighting is simpler and matches what most consumers expect. Switch to distance weighting when cohort_score is low (the cohort is stretched), or when comparing cohorts that might differ in tightness. For typical use, equal weighting is fine.
What's the single most important habit when reading a cohort response?: Check cohort_size_actual + cohort_score FIRST. If the cohort isn't statistically meaningful, every other read is built on sand. If it is, then read in this order: regime_stratification → feature_importance → outcome_distribution. Working from conditional to headline preserves the right hierarchy of conclusions.

Try it