Docs / 7 · Engine Roadmap

07 — Engine Improvement Roadmap

A phased, concrete plan to evolve our engine from an additive points worksheet into something closer to how books and quants operate. Each item cites the doc that justifies it and the component it touches.

Where we are today (honest assessment)

Updated ("defense-weighted, scenario-realistic score engine"). A recent change reworked the baseline and added a scheme matchup, closing part of the old Gap B. The current state and remaining gaps below reflect it.

Since then (this update): Phase 2 (QB lever) and Phase 3 (probability layer + key numbers + per-spread σ + cover%/over% on the bets CSV) shipped, Gap F (dispersion/σ double-count) is resolved, and Phase 4's ridge rating spine (the power-ratings model) is built, tested, wired as a gated term, and added to the optimizer's search (with EPA-value scheme blending and a lowered HFA). The scoring engine now emits a QB-aware margin plus win_prob_home / key-number-aware cover_prob_home / over_prob from the pre-dispersion margin. What's left is enabling the rating spine on a walk-forward fit and Phase 5/6 (market blending + validation hardening).

Our prediction path runs through the prediction engine's score routine:

# opponent-aware, DEFENSE-weighted "envelope" baseline (replaces the
# old (off_scored + opp_allowed)/2 mean-of-averages that bunched every game ~20-23)
home_base, away_base = envelope_baseline(...)   # blends additive matchup + envelope
# + rush/pass scheme adjustment from defensive EPA ranks, emits "why" bullets
# + the same ~15 capped nudges: home_adv, coaching, rookies, travel, rest,
#   momentum, turnovers, divisional, fg_impact, recent_form, dvoa, power_rank,
#   defense_narr, line_play, redzone, third_down
# score_dispersion (1.45) stretches scores around the league mean (~22.6)
# weather applied to the total only; scores clamped to [3, 48] then per-defense envelope
# confidence = weighted blend of edge/momentum/health/turnovers/weather

The defensive model's per-team profiles now hold real data: scored/allowed means, an allowed envelope (p15/p50/p85/min/max, std ≈ 9.95 ≈ our σ_total), rush/pass EPA values + ranks + success rate + YPG, and offensive pass rate — built from 3 yrs of scores + 2 yrs of play-by-play, recency-weighted.

The Optuna tuning process tunes the weights and — to its credit — optimizes ATS profit (ats_w·0.909 − ats_l)/total, not raw error.

What's good (keep it)

ATS-profit objective in the optimizer (already market-aware in spirit).
Small, capped situational weights — turnovers (0.2/0.8), rest (0.07/0.7), divisional (0.0). Doc 4 confirms these should be small. Don't inflate them.
Recent wins: opponent-aware defense-weighted baseline (Doc 2/4), a real scheme matchup off EPA ranks (Doc 3/4), and a variance/calibration fix via dispersion (Doc 5 §2 — predicted EV is flatter than real outcomes). These are genuinely in the spirit of this series.
DVOA already ingested; chronological backtests already the norm.

The remaining gaps (in priority order)

#	Gap	Doc	Severity
A	~~No QB starter-vs-backup model~~ — ✅ shipped (the QB-value model + `attrib['qb']`); starter−backup delta, capped 3–7 pts	2 §7, 4 #2	✅ done
C	~~No probability layer / ignores key numbers~~ — ✅ shipped: engine emits `win_prob_home`, key-number-aware `cover_prob_home`, `over_prob` (the probability model's `margin_pmf`/`cover_prob_keyed`). Per-spread σ table + ticket-CSV surfacing still open.	2 §4–5, 5 §2	🟡 mostly done
D	No de-vig / CLV — compares to the posted line, never the no-vig line; never measures Closing Line Value	1 §2/§5, 6	🟠
E	Backtest leakage fixed — real dates + real rest (done earlier), and now per-season leak-free profiles: each season is scored with profiles from ONLY prior seasons. This corrected the mirage — 2024 ATS fell 63.5% → 53.2% (ROI +21% → +1.6%), 2025 → 45.4%. Remaining: prior-season PBP for the scheme term in backtests (currently score-only), and a full walk-forward tune.	5 §3, 6	🟡 mostly fixed
B′	Ridge rating spine is in and enabled (the power-ratings model, leak-free in backtest + live, scale 0.5; EPA values usable via the scheme EPA blend). Remaining double-count cleanup: a walk-forward re-tune to push `power_rating_scale` up and pull `power_rank_slot_points` down.	2 §1, 5 §1	🟡 enabled; tuner to finish
F	~~`score_dispersion` double-counts variance with the probability layer~~ — ✅ resolved: probabilities are computed from the pre-dispersion margin/total with the empirical σ; dispersion now only stretches the displayed scores.	5 §2	✅ done

Algorithms & methods now in the engine (quick reference)

A consolidated map of the modeling machinery currently shipped, newest first:

Method	Where	What it does
Ridge power ratings	the power-ratings model	League-wide L2 solve of `MOV = R_home − R_away + HFA` with MOV capping + exponential time-decay — the opponent-adjusted rating spine (enabled @0.5; solved HFA ≈ 1.55). Leak-free in backtest + live.
Leak-free per-season profiles	the defensive model	Rebuilds each team's score/envelope/EPA profile from prior seasons only, so a backtest never sees the season it predicts (fixed the 63.5%→53.2% mirage).
QB lever	the QB-value model	Points-vs-replacement QB values; applies a `starter − baseline` delta (the real line move on a QB change), capped 3–7 pts.
Key-number margin PMF	the probability model (`margin_pmf`, `cover_prob_keyed`)	Re-weights a normal margin envelope by the empirical spikes on 3/7/6/10 so half-points across key numbers price correctly; push-aware.
Per-spread σ	the probability model (`sigma_for_spread`)	Margin SD that drifts with the spread (pick'em tighter, big spreads wider).
Probability layer	the prediction engine	Emits `win_prob_home` / `cover_prob_home` / `over_prob` from the pre-dispersion margin so variance is counted once.
EPA-value scheme matchup	the scheme adjustment (scheme EPA blend)	Run/pass matchup using EPA magnitude, not just defensive ranks.
Defense-weighted envelope baseline	the envelope baseline	Frames each offense inside the specific defense's historical points envelope.
Score dispersion	the prediction engine	Stretches displayed scores to reproduce real blowouts/duds (display only; probabilities use the empirical σ).
De-vig · fractional Kelly · CLV · Brier/log-loss/calibration	the probability model	The measurement layer: strip the vig, size bets, and score probabilities honestly.
Bayesian (TPE) tuner	the Optuna tuning process	Searches 30+ weights at once on real games by a composite ROI + calibration + accuracy objective, with a holdout season.
AI news narratives + rank deltas	the AI narrative process	Summarizes live online news into draft-specific blurbs and signed ranking nudges (fantasy side).

(The why behind each method lives in Docs 2–6; this table just maps them to the engine's components.)

Phase 1 — Measurement first (de-vig + CLV + calibration) — ✅ IMPLEMENTED (CLV loop now closed)

You can't improve what you don't measure. Do this before changing predictions.

Shipped. The probability model (de-vig, margin↔prob, cover/over prob, Kelly, Brier/log-loss/calibration; 23 unit tests) + an optimizer/backtester refactor: shared game-building (real dates + real rest days, was hardcoded 7) and game evaluation (ATS/OU/SU Brier + log loss + calibration + ROI + CLV scaffold). The backtest run now writes a calibration CSV and richer metrics.

First finding (validates the whole exercise): on 2024, the engine posts 62.3% ATS but ATS Brier ≈ 0.2499 — a coin flip. The win rate looks elite while the probabilities carry almost no information (the model is badly underconfident, and/or the 62% is leakage-inflated per Gap E). Raw ATS% hid this; the calibration layer exposed it. SU Brier (0.201) is healthier.

CLV loop closed (2026-06). The "n/a until closing-line capture" gap is gone. src/bet_log.py appends every flagged bet to nfl_data/predictions/bet_log.csv at flag time (first flag wins the line it carried), wired into run_predictions.py. settle_bets.py then fills the closing line (the latest pre-kickoff odds snapshot for that week), grades each bet from final scores, computes CLV in points via market_math.clv_points, and prints a scorecard (record, ROI, avg CLV, % of bets that beat the close). This is now the bar every future change must clear — sustained positive CLV, not ATS%.

Add a de-vig utility. A new probability model exposing: - american_to_prob(odds), devig(prob_a, prob_b) -> (fair_a, fair_b), prob_to_american(p), spread_to_winprob(s, sigma=13.5), cover_prob(mu, line, sigma=13.5), over_prob(t_hat, line, sigma=10). - These implement Doc 2 §5 / Doc 5 §2. One small, well-tested module everything else calls.
Store opening + closing lines and prices. Extend the odds-data pipeline — currently it has no notion of a closing line, price, or de-vig. Persist open_spread, open_total, open_price, close_spread, close_total, close_price.
Add CLV + calibration to the backtester: for each flagged bet record number-taken vs. closing number → per-bet and aggregate CLV; add Brier score and log loss on the cover/over probabilities, and a reliability-diagram dump (Doc 5 §2).
Output probabilities on tickets — add cover% / over% columns to the weekly tickets / bets CSVs so every pick carries a calibrated probability, not just a point estimate.

Exit criteria: every backtested bet reports CLV; we can draw a reliability diagram; we compare against the de-vigged line everywhere. ✅ The live CLV loop (bet_log + settle_bets) now satisfies the first clause for real bets.

Live data feeds — ✅ replaced the random stubs (2026-06)

Two modules fed the engine sampled random numbers on every prediction — literal noise worth a couple of points. Both are now real and deterministic.

Weather (src/weather_analysis.py) — real National Weather Service forecasts (api.weather.gov, free, no key, ~7-day window) for outdoor games. Outside that window (historical/backtest games, far-future) or on any API failure the impact is a deterministic zero, never sampled. Domes still short-circuit to zero. Same BetSpec impact thresholds (wind/precip/temp, ±4 cap).
Injuries (src/injury_data.py) — get_team_injuries(team, season, week) returns the official nflverse weekly report when it exists, else the live ESPN injuries API (which covers preseason/PUP/IR from season start, before any weekly report is published). Statuses normalize to Out/Doubtful/Questionable/Healthy. sentiment_analysis.py is now deterministic: injury impact comes from these feeds (BetSpec position values, −10/team cap), and momentum is 0 until a real news feed lands (see Phase 7 below) — zero is the honest neutral, and the momentum_cap weight stays tunable.

Phase 2 — The QB lever (highest single-feature ROI) — ✅ IMPLEMENTED

Doc 2 §7, Doc 4 #2. The only position that reliably moves a spread.

Shipped. The QB-value model — points-vs-replacement values (~0 elite-backup to ~7 elite-starter), player-keyed so they survive trades, loadable from refreshable QB-value data over a tiered seed. The engine applies a delta, not the raw value: the team baseline already encodes its established starter, so a normal start moves nothing; only actual_starter − baseline_starter (injury/benching) hits the spread, via attrib['qb'] in the scoring engine, scaled by qb_value_scale and capped by qb_value_cap. It reads the per-game actual/baseline starter for each team, so it's live-only and a harmless 0 on historical backtest games (which carry no starter). 9 unit tests lock the exit criterion (a starter-out swings the spread 3–7 pts; a capable backup barely moves it). qb_value_scale/qb_value_cap are now in the optimizer's search. Still open: populate the actual + baseline starter onto live games from the roster-data pipeline (the lever needs an injury/change signal to fire — it's a harmless 0 until then).

Build a QB value table — points-vs-replacement per starter (~0 to 7). Seed from public oddsmaker values (Allen/Mahomes ~7; league spread ~3–7), then refine from our own data. Store the backup's value too — the line move is starter − backup, not just starter.
Wire it into the scoring engine as a new attributed term (attrib['qb']), driven by depth-chart/starter data we already scrape via the roster-data pipeline. When the listed starter ≠ the rated starter (injury/benching), apply the delta.
Add weights qb_value_scale, qb_value_cap to the default weights config and to the optimizer's search space.

Exit criteria: a QB-out scenario shifts our spread by a sane 3–7 points; ATS and CLV improve on weeks with QB news in the backtest.

Phase 3 — The probability layer & key numbers — 🟡 MOSTLY IMPLEMENTED

Doc 2 §4–5, Doc 5 §2. Turns a point estimate into a priced bet.

Shipped. The scoring engine now emits, alongside the scores: attrib['pred_margin'] (pre-dispersion), win_prob_home, key-number-aware cover_prob_home (when the game carries a line), and over_prob. The key-number machinery lives in the probability model: KEY_NUMBER_WEIGHTS + margin_pmf (normal envelope re-weighted by the empirical spikes on 3/7/6/10/…), cover_prob_keyed (push-aware, so +3.5 beats +2.5 correctly), and key_number_half_point_value. Gap F resolved — probabilities use the pre-dispersion margin/total with the empirical σ; dispersion stays on the displayed scores only, so variance is counted once. margin_sigma/total_sigma are now weights. 10 unit tests. Also shipped: a per-spread σ table (sigma_for_spread, opt-in so existing numbers don't move) wired into the game-evaluation cover probability and the live bet builder; and cover%/over% now ride the weekly bets CSV (cover_prob column), with attr_* probability columns on the predictions CSV. Still open: fit the σ table on real data (it's a documented placeholder), and decide whether the ATS point estimate (not just the probability) should drop dispersion too.

Replace "predict a margin, clamp scores" with "predict a margin, then a distribution." Keep emitting home_score/away_score, but compute cover_prob and over_prob from the projected margin/total via the probability model.
Add a key-number-aware margin model. Start with the empirical NFL margin distribution (3 ≈ 15%, 7 ≈ 9%, 6 ≈ 6–7%, …; Doc 2 §4) as a lookup, layered on the σ≈13.5 normal. This makes half-points across 3 and 7 worth their true ~3–4%, so the engine can prefer +3.5 over +2.5 correctly.
Use a per-spread σ table (Doc 5 §2) instead of one constant if data supports it.
Reconcile with score_dispersion (Gap F). Dispersion stretches each team's points by 1.45× around the league mean to widen the score distribution for display realism. That inflates the predicted margin's spread too — so if you then feed the margin into Φ((μ−m)/13.5) you will double-count variance (dispersion once, σ again) and produce over-confident probabilities. Pick one home for variance: either keep dispersion for the displayed scores but compute cover/over probability from the pre-dispersion margin with the empirical σ, or drop dispersion and let σ carry all the spread. Decide this explicitly when building the probability layer; tune the chosen knob walk-forward, not on a leaky 2024 fit.

Exit criteria: the engine outputs calibrated cover/over probabilities; reliability diagram is near the 45° line; key-number half-points are valued distinctly.

Phase 4 — The rating spine (✅ IMPLEMENTED & ENABLED @0.5)

Doc 2 §1–2, Doc 5 §1. The structural upgrade.

Shipped & enabled. The power-ratings model — a real league-wide ridge solve of MOV = R_home − R_away + HFA with MOV capping (28) and exponential time-decay (8-wk half-life), team ratings shrunk by λ=100 and a free HFA, centered. 6 unit tests (recovers strength order + planted HFA; cap/decay behave). Wired as a capped attrib['power_rating'] term (neutral-field rating diff — HFA stays in home_adv, no double-count). The scheme term can now use EPA values (the scheme EPA blend, Doc 4 #1); HFA lowered (cap 2.5→2.2, default 2.5→1.9, Doc 2 §6); the solved HFA on real data is 1.55 (bang on modern levels).

Leak-free & measured. Game evaluation solves ratings from games strictly before each one; the live path does the equivalent from the last two completed seasons. On a real backtest over the historical scores dataset, turning the spine on (scale 0.5) lowers straight-up Brier in every season 2022–25 (e.g. 2024 0.2047→0.2039, 2025 0.2128→0.2123) and improves margin MAE in 2 of 3, with ATS within noise. So it ships on at 0.5 — a conservative, calibration-positive level — added to the optimizer's search alongside power_rank_slot_points so the tuner can push it higher and demote the legacy term (Gap B′) on a walk-forward fit. A per-venue HFA table is the remaining refinement.

A recent change already replaced the mean-of-averages baseline with the opponent-aware, defense-weighted envelope baseline. That's real progress — but it's still a per-game additive estimate, not a league-wide solved rating, and it left two threads dangling. Finish the job:

Add a ridge power-rating module — the power-ratings model: solve MOV = R_home − R_away + HFA league-wide via ridge regression on game results, with MOV capping (min(|MOV|, 28)) and exponential time-decay. Tune λ walk-forward. (The profiles already hold the inputs this needs.)
Remove the double-count (Gap B′). Team strength is now encoded in three places — the envelope baseline, the power_rank_slot term, and DVOA. Pick one spine (the rating) and demote the rest to adjustments, or they fight each other and the optimizer over-fits to reconcile them.
Use the EPA values, not just ranks. The scheme term consumes rush/pass_def_epa_rank; the profiles also store *_epa values + success rate (Doc 4 #1 — EPA is the top predictor; ranks throw away magnitude).
Lower / regionalize HFA — default home_advantage toward ~1.7 (cap from 2.5 → ~2.2) and, ideally, a per-venue table (Doc 2 §6: KC/PIT/DEN high, many low). Let the optimizer tune the global level.
(Optional, later) Elo or state-space secondary (Doc 5 §1b/§1d) for an ensemble and uncertainty estimates.

Exit criteria: straight-up accuracy ≈ market baseline (~63–66%) on a holdout season; ratings are opponent-adjusted; no strength double-counting.

Phase 5 — Market-aware blending & objective alignment

Doc 5 §4, Doc 6. Where measurable edge comes from.

Blend toward the de-vigged line: final = w·model + (1−w)·fair_line, w tuned (likely < 0.5). The market is a strong prior; this both improves accuracy and isolates our residual signal.
Report ΔR² (combined-vs-market) so we can see whether we add information at all (Doc 5 §4). If ΔR² ≈ 0, we're just reprinting the line.
Retarget the optimizer — ✅ partly done. The tuner now searches 24 variables at once (incl. the envelope/scheme/dispersion knobs) on real games via the game-evaluation path, with a configurable composite objective (ROI + calibration + accuracy), a holdout season, and a leaderboard CSV. Still to do: score profit vs. the de-vigged close and add CLV as a secondary objective once closing lines are captured.
Bet-flagging policy (Doc 6 §6): flag only when |model − fair| > thresh and p_model − p_fair > min_edge, bonus for crossing a key number. Our configurable spread/total edge thresholds stay but compare to the fair line.
Fractional Kelly sizing — add a sizing column (start ¼ Kelly, per-bet cap ≤2–3% bankroll, weekly total-exposure cap) using p_model from Phase 3.

Exit criteria: backtest reports positive CLV and ROI vs. the close on a holdout season; sizing is fractional-Kelly with exposure caps.

Phase 6 — Validation hardening

Doc 5 §3, §6.

Fix backtest fidelity/leakage in the tuning process: ✅ done — both the backtest run and the tuner now use real dates and real rest days via the shared game-building step (rest derived from each team's previous game). The old hardcoded {season}-09-01 / rest_days=7 path is gone.
Strict walk-forward with a held-out season the optimizer never sees: ✅ wired and now enforced end-to-end. The holdout-year option validates the winner on an excluded season. The profile-leakage caveat is fixed (2026-06): the tuner now builds per-year leak-free as-of profiles for each training season (the same build_profiles_asof the backtest uses) and scores each year against its own profiles, and the holdout is evaluated with as-of profiles too. Previously the tuner scored all training years against the global 2023–25 profiles — leakage that inflated training metrics. - Latest run (2026-06, train 2022–24, holdout 2025, 1000 trials): the winner improved the 2024 training backtest (54.7% ATS) but lost on the 2025 holdout (46.7% ATS / −10.9% ROI) vs. the incumbent config (46.9% / −10.5%). Per the adopt-only-if-the-holdout-wins rule, the incumbent config/best_weights.json was kept (restored from backup). This is the third independent confirmation that the engine has no exploitable ATS edge and sits near its honest ceiling. - scheme_epa_blend A/B (2026-06, scripts/eval_epa_blend.py): swept {0, 0.3, 0.6, 1.0} on leak-free 2024 + 2025. Every nonzero value hurt ATS%, ROI, MAE, and Brier monotonically on both seasons. Kept at 0 (ranks only). Evidence: nfl_data/predictions/backtests/epa_blend_eval.csv.
Leakage audit: ensure no feature uses post-game info or season aggregates that include the target game, and that we never train on the closing line if we intend to bet the opener.
Run the pitfalls checklist (Doc 5 §6) before declaring any improvement real; expect live results below backtest and demand margin.

Exit criteria: holdout-season CLV/ROI hold up; no leakage; results survive walk-forward.

Phase 7 — Structured news extraction (the "online chatter" idea, done right) — 📋 PLANNED

The instinct to ingest news/X/beat-reporter feeds is right; the framing is wrong. Generic sentiment ("fans are hyped") is priced into the line within minutes and barely moves ATS. The value is speed on facts: a starter ruled out, a surprise inactive, an OL shuffle — reacting before the line does.

Build an entity-extraction layer, not a sentiment scalar: - Ingest the RSS feeds the repo already knows (ESPN, FantasyPros) plus beat-reporter sources; the news_collector.py plumbing exists. - Use an LLM to emit typed events, not a mood score: {player, team, event_type: ruled_out|limited|benched|role_change, severity, source, timestamp}. - Route those onto levers that already exist: the QB lever (±7), the sentiment_analysis.py positional injury impacts, and config/narrative_overrides_*.json deltas. momentum_score stays the home for any genuine momentum signal this produces (it's 0 today by design). - Rookie/camp chatter feeds the fantasy app and player props, not game spreads — rookie usage is one of the few genuinely under-priced early-season signals.

Exit criteria: a news event changes a prediction within minutes, and the change shows up as positive CLV in bet_log.csv (not just a nicer-looking pick).

Phase 8 — Player knowledgebase & matchup simulation — 📋 PLANNED (highest ceiling)

Today everything aggregates to team level. The data to go deeper — depth charts, NGS-style tracking, block-win rates — is loaded but unused. This is where totals/team-totals/props edges live (softer lines than sides).

Knowledgebase: one record per player blending career + recent form + role; RAG a premium analytics/fantasy publication into it (prose in, capped numbers out — the DefensiveNarrative.md → rank-differential pattern is the model to copy; never let a publication's number flow in unbounded).
Drive-level Monte Carlo: simulate possessions from team pace + EPA/success by play type, with player modifiers on the key matchups (WR1 vs CB1 separation, pass-block win rate vs pass-rush win rate). 10k sims/game yields a full score distribution, not a point estimate — which is what spreads/totals actually are, and lets the key-number PMF in market_math be driven by the sim instead of a fitted normal.
Aim it at totals and props first (softest markets), then sides.

Why this is how Vegas "is so good": books don't out-predict the field — they post a number and let the sharpest money move it to the close. The closing line is an aggregate of every good model + inside info. We can't beat that head-on on marquee sides; we can beat openers, stale derivative lines, and slow-to-move props with speed (Phase 7) and distributional depth (Phase 8). That's why CLV is the scoreboard, not accuracy.

Suggested new/changed weights (for the default weights config)

{
  // Phase 2 — QB lever
  "qb_value_scale": 1.0,          // points per unit of starter-vs-backup delta
  "qb_value_cap": 7.0,            // max QB swing on the spread

  // Phase 4 — rating spine & HFA
  "home_advantage": 0.8,          // lower the multiplier; modern HFA ~1.5–2
  "ridge_lambda": 100.0,          // tuned walk-forward
  "mov_cap": 28.0,                // blowout dampening for ratings
  "time_decay_half_life_weeks": 8,

  // Phase 3 — probability layer
  "margin_sigma": 13.5,           // spread→prob SD
  "total_sigma": 10.0,            // total→prob SD

  // Phase 5 — market blending & sizing
  "market_blend_w": 0.4,          // weight on our model vs. the fair line
  "kelly_fraction": 0.25,         // quarter-Kelly
  "max_bet_pct": 0.03,            // per-bet bankroll cap
  "min_edge_prob": 0.03           // required p_model − p_fair to flag a bet
}

(Values are starting points to be tuned walk-forward, not gospel.)

Priority order (if you do nothing else)

~~Phase 1 — de-vig + CLV + calibration~~ — ✅ done.
~~Phase 2 — QB lever~~ — ✅ done (the QB-value model).
~~Phase 3 — probability layer + key numbers~~ — ✅ done (per-spread σ + cover%/over% on the bets CSV included).
~~Phase 4 — ridge rating spine~~ — ✅ built, tested, leak-free, and enabled @0.5 (improves SU Brier every season on a real backtest). Left: walk-forward re-tune to push it higher + demote power_rank_slot; per-venue HFA.
Phase 5/6 — ✅ market blending shipped (market_blend_w 0.7 — sharply better forecast MAE/calibration), and the leak-free walk-forward tune is now the enforced default (per-year as-of profiles for training and holdout). Re-run 2026-06 (train 2022–24 / holdout 2025, 1000 trials): the candidate again lost on the holdout, so the incumbent weights were kept — a third confirmation the engine is near its honest ceiling with no exploitable ATS edge. scheme_epa_blend evaluated and rejected (hurts both holdouts; kept at 0).
✅ CLV capture is live — bet_log.py + settle_bets.py close the last open Phase-1 clause: real bets now get a closing line, result, and CLV. This is the scoreboard for everything after.
Phase 7 — structured news extraction (typed events → existing levers; speed beats sentiment). 📋 planned.
Phase 8 — player knowledgebase + drive-level Monte Carlo for distributions; aim at totals/props. 📋 planned, highest ceiling.

Each phase is independently shippable and independently testable against the backtester. Build Phase 1's measurement first so every later phase can prove it actually helped — by CLV and ROI vs. the close, not by score accuracy.

← Back to README.