Reading the Board — Learn — A Second Foundation

Why a leaderboard

Most research projects report their results in prose, which is exactly the problem: prose can be arranged. A leaderboard is harder to arrange. Every model takes the same exam — the same sealed events, graded by the same frozen scorer — and the numbers land in the same table, sorted by a rule no one gets to bend for a favorite. On this project the leaderboard is where structural theories of political crisis stop being debated and start being scored.

Two things the board is not. It is not a ranking of truth: every number on it is an estimate from one small sealed sample, with all the noise that implies — the third section of this chapter is about exactly that. And it is not an admission list: as chapter 2 explained, rank never admits a model to the official ensemble— only a pre-registered severe test and the Philosopher gate do. The board’s job is humbler and more useful: to make every claim comparable, public, and impossible to quietly forget.

One absence is worth noticing. The original 8-dimensional grand formula — the project’s first 44sessions of work — is not on the board at all. It sits at Tier 0: unable to emit a probability, it cannot take the exam. Every variant listed below, whatever its score, already does something the old design could not do at all.

Column by column

Here is what each column means, in reading order. Model is the variant— one runnable program, one structural hypothesis. Family is its mechanism class; families matter more than individuals, because a lone good row is probably noise while a family that keeps surviving tests is a lead. Brier is the headline score from chapter 3 — lower is better, 0.25 is chance, 0.18 is roughly market consensus. Resolutionis the unfakeable part of skill — does the model actually separate crisis from calm — and is arguably the column that matters most. Neg-ctrl is the negative-control Brier, the crying-wolf detector, always read jointly with the headline score. Tier grades the raw score against the reference ladder. Status says whether the row is admitted to the ensemble or experimental— on the board, counting for nothing.

Now explore the real thing. This is the live board, the same data shown on the redesign page— click any column header for the lesson, any row for a reading generated from its numbers:


ensemble	equal-weight	0.291	0.096	0.068	T0	admitted-only
pitf_logit	regime_logit	0.175	0.141	0.256	T2	excl.
hierarchical_bayes	empirical_bayes	0.219	0.084	0.278	T1	excl.
hazard_spline	hazard_spline	0.220	0.095	0.405	T1	excl.
conformal_wrapper	calibration_meta	0.221	0.121	0.158	T1	excl.
sdt_turchin	structural_demographic	0.230	0.173	0.222	T1	excl.
train_freq	empirical_frequency	0.234	0.064	0.161	T1	admitted
firth_logit	penalised_logit	0.269	0.192	0.297	T0	excl.
gbm_honest	gradient_boosting	0.281	0.115	0.144	T0	excl.
reign_logit	duration_logit	0.330	0.060	0.209	T0	excl.
null_baseline	null	0.370	0.095	0.038	T0	admitted

Hold-out events: 26Negative controls: 10Chance line: 0.25Market line: 0.18PBO: 0.70

How to read this board

Click a column header to learn what that number means, or click a row for a reading generated from its live data. Two defaults to hold onto while you explore: rank on this table is not admission — the official ensemble counts only admitted variants — and as of June 2026 the luck-corrected (EVT-deflated) best Brier sits exactly at the chance line of 0.25. No validated skill yet.

A detail that often surprises readers: the row with the best raw Brier on the board is marked excl. That is not an oversight — it is chapter 3’s pitf_logit story rendered as a table cell. The board keeps falsified and experimental models visible, scores and all, because hiding them would itself be a form of result-shopping.

Overlapping uncertainty

Every Brier on the board is computed from 26sealed events. That is a deliberately precious sample — curated, balanced with 10negative controls, and never expanded casually — but it is small, and small samples make noisy scoreboards. Re-draw the events and the scores would shuffle; some of the gaps between adjacent rows are real, and some are weather.

The honest tool for that problem is the bootstrap confidence interval: re-score each model thousands of times on random re-samples of the test events and record the range its score wanders over. On samples this size the resulting intervals are wide — and when two models’ intervals overlap heavily, the data simply cannot tell those models apart, whatever their ranks say. The project’s internal leaderboard carries a 90 percent interval on every row; the public table shows point estimates, so carry the rule with you instead (as of June 2026): refuse to read drama into small gaps. A model two ranks higher is not “winning” — it is indistinguishable.

And one number summarizes how seriously to take the top of the table: PBO, the probability of backtest overfitting from chapter 4, currently reads 0.7 — a 70% estimated chance that the board's raw leader is leading by luck. Split the data in half, crown a champion on one half, and odds are it fails to repeat on the other. A leaderboard whose own metadata says “the leader is probably a fluke” is not being self-deprecating; it is being accurate about sample size.

The honest headline

So how should the whole table be summarized? Not by its best cell. The raw best Brier on the board belongs, as of June 2026, to a falsified model, and even setting that aside, a best-of-N raw score is precisely the quantity chapter 4showed to be manufactured by luck. The project’s rule is to headline the luck-corrected number: after EVT deflation— the haircut sized to the number of models tried — the best Brier on the board is 0.2500. Exactly the chance line. As of June 2026, the honest headline is: no deflated evidence of forecasting skill yet.

The number the project stakes its name on is the official ensemble’s Brier — currently 0.291, read live from the same file that renders the board. Set it against the chance line of 0.25 and the conclusion states itself plainly.

What would change the headline

Watch for one specific event: a variant that passes its pre-registered severe test and is admitted, whose deflated confidence interval clears the chance line, with resolution intact and negative controls unharmed. That conjunction — not a pretty raw Brier, not a rank-one row, not an exciting hypothesis from the nightly sweep — is what evidence of skill will look like here. If it ever appears, this page will say so. Until then, the board says “not yet,” and means it.

The board flatters; the deflation tells the truth. Read the headline from the deflated number, always.

One chapter remains in the machinery tour: the bets— what happens when the models leave sealed history and stake real probabilities on the future.

What to remember

✓The leaderboard makes every model take the same sealed exam under the same frozen grader — it is a comparison device, not a truth ranking and not an admission list.
✓Read Brier, resolution, and negative-control Brier together: the headline can be hedged or lucky, resolution cannot.
✓Scores on ~26 sealed events are noisy; overlapping confidence intervals mean adjacent ranks are indistinguishable, so never read drama into small gaps.
✓The board's own PBO says the raw leader more likely than not would fail to repeat out of sample.
✓The honest headline comes from the deflated best — 0.2500, exactly chance, as of June 2026 — and what would change it is an admitted variant clearing chance after deflation with resolution and negative controls intact.