Autonomous research systems tend to game their own evaluation. This is the machinery built to make that as hard as possible: frozen scorers, sealed data, and statistical haircuts for luck.
The most reliable way for a research system to fail is not to be wrong. It is to grade itself, and pass. This is the best-documented failure mode of autonomous research efforts — agents, human or artificial, drifting toward the metric that flatters them — and this project was designed around the assumption that it would happen here too, given the chance. Every mechanism in this chapter exists to take a specific form of self-deception and make it structurally hard: not forbidden by politeness, but blocked by architecture.
Start with the most basic attack: if you cannot raise your grade, change the grading. Any system that controls its own evaluation code can quietly redefine success and call it progress.
So in this project, exactly one program is allowed to produce official numbers, and that program is frozen. The frozen scoreris locked with a cryptographic fingerprint — a SHA-256 hash, a short string that changes completely if even one character of the file changes. Before any score is trusted, the file is re-fingerprinted and checked against the recorded value. No agent in the system, however autonomous, has permission to edit it; the nightly pipeline goes further and treats any detected modification as tampering, automatically reverting the file and quarantining whatever touched it. Changing the metric is possible — but only as a rare, deliberate human act, recorded with its reasons in the project’s permanent history file.
The analogy to hold onto: students may study however they like, but the exam is printed and sealed, and the grading machine sits in a glass case. Anyone — including you, reading the public repository — can check the seal.
The second attack is subtler and far more common: study the test itself. A model that has seen the answer key does not need skill — memory will do. In statistics this contamination is called leakage, and the defense is the hold-out: data locked away from everyone who builds models, touched only at final scoring time.
Open — training data
Sealed — hold-out data
Who may read what
The sealing is a ceremony with a paper trail, not a habit. In June 2026, the panel hold-out — 1,198 country-years covering 2010 to 2015 — was sealed with its cryptographic fingerprint recorded in the history file before any model was permitted to fit the data it would be judged on. The training panel, 8,410 country-years spanning 1946 to 2004, stays open: models may study it freely. The curated event hold-out of 26 historical cases, including the negative controls from chapter 3, is sealed the same way.
Around the seal sits a rule about people — or rather, about roles. Any agent that proposes, builds, or tunes models is proposer-forbidden: it may never read the hold-out, for any reason. The proposer suggests; the frozen scorer judges; nobody grades their own homework. Reading the hold-out to inform a model’s design is called, in the project’s own integrity contract, the cardinal sin — because it silently converts “prediction” into “memory” while leaving every number looking legitimate. It is the one error that voids the project’s claims outright, which is why the claim that it has not happened is backed by architecture and audit trail rather than by promise. The same logic, applied to forecasts about the future instead of tests on the past, is called pre-registration— lock your answer before the world reveals its own — and chapter 6 shows it in action.
Historical data has a property that quietly breaks the standard textbook way of testing models. The standard way — cross-validation — splits the data into random folds, trains on most, tests on the rest, and rotates. Randomly splitting history, though, means your training set routinely contains the years just before and just after the very crisis you are testing on. And 1992 looks a great deal like 1991: countries change slowly, so adjacent years are nearly copies of each other. A model trained on 1991 and 1993 is not predicting 1992 — it is interpolating between two photographs of it. The test comes back glowing; the skill is an illusion of autocorrelation.
The fix is to cut the leakage out of the timeline explicitly. Purged cross-validation removes from the training set any years that overlap the test window, and an embargoadditionally removes the years immediately after it — because the aftermath of a crisis encodes the crisis too.
One fold of purged + embargoed cross-validation
In this repository, plain random cross-validation on the panel is not discouraged — it is banned. Every internal evaluation, including the nightly parameter tuning, runs purged and embargoed, so that even the numbers no outsider ever sees are computed under the honest split.
Now for the deepest problem, the one that survives even perfect data hygiene. Freeze the scorer, seal the data, purge the splits — and then simply try many models. The best of them will look skilled by pure luck. Test scores scatter randomly around their true level; pick the minimum of N random scatters and you have manufactured a number that flatters you, without a single dishonest step. This project’s nightly pipeline evaluates thousands of candidates. It is, by construction, a machine for generating lucky winners — so it must also be a machine for refusing to believe them.
Run the experiment yourself. Every model below knows nothing:
Try it — the luck of the draw
Every model below is pure noise: it answers each of the 26 events with a random probability. None of them knows anything, by construction. Watch what the best of them scores anyway.
Best of 25
0.226
Below the chance line — looks like skill. It is not.
Expected lucky edge
0.160
≈ sd × √(2·ln N) = 0.063 × √(2·ln 25)
Deflated best
0.385
Add the luck back out, and the “winner” lands at or above the chance line of 0.25.
Switch from 5 models to 500 and draw a few times: the more attempts you make, the more skilled the luckiest attempt looks — which is exactly why the leaderboard charges itself for every model it has tried.
The corrections the project applies are exactly the ones the widget demonstrates. First, measure the luck bandwidth: the permutation nullkeeps a model’s predictions fixed, shuffles the real outcomes at random hundreds of times, and records the spread of scores — the scatter that pure chance produces on this exact data. Second, charge for every attempt: EVT deflation subtracts from the best observed score the gap the luckiest of N tries is expectedto achieve, which grows like the square root of 2·ln N. The formula’s shape carries the lesson: trying more models always makes the luckiest one look better, but slowly — so the haircut grows relentlessly with N, and a system that tries thousands of candidates pays a real toll before it may claim anything.
Third, ask the cruelest question of all: PBO, the probability of backtest overfitting. Split the evaluation data in half; find the winner on the first half; check whether it stays the winner on the second. Repeat over many splits. PBO estimates the probability that the leaderboard’s apparent champion would fail to repeat out of sample. The current value, read live from the leaderboard data, is 0.7— meaning the apparent best model is more likely than not a creature of this particular sample. And after the EVT haircut, as of June 2026, the best deflated Brier on the board sits at 0.2500 — exactly the chance line. The raw board flatters; the deflation tells the truth; the project headlines the deflated number.
The luck problem returns at larger scale every night. The grammar sweep introduced in chapter 2tests thousands of candidate model structures per run. Suppose none of them — not one — contains any real signal. With conventional significance testing, roughly one in twenty would clear the bar anyway; out of a thousand candidates, that is fifty “findings” manufactured from nothing. Mass search without correction is a false-discovery factory.
The correction is the Benjamini–Hochberg procedure, which controls the false discovery rate: instead of asking “is this candidate impressive?”, it adjusts the bar across the whole batch so that of everything allowed through, at most a chosen fraction — ten percent here — is expected to be a fluke. The survivors earn a deliberately modest label: hypotheses, queued for the gauntlet of severe testing — never discoveries, never findings, never results.
A worked example
In the first panel-scale sweep, in June 2026, 236 candidate structures were tested and 11 survived the false-discovery filter. Intriguingly, every survivor involved elite factionalism — a lead the project is now chasing through proper severe tests. Note what was notsaid: not “factionalism predicts instability,” only “eleven hypotheses survived, and they rhyme.” The restraint is the method.
Everything in this chapter converges on a single document: the falsification register, the project’s book of locked bets against itself. Each entry is a severe test— a pass-or-fail criterion for one causal mechanism, written down before the evidence is examined.
What makes a test severe is not difficulty but design: it must be a test the mechanism would probably failif it were not real. “The model fits history well” is not severe — nearly anything flexible fits history. “Remove this feature set and TRAIN-side discrimination must drop by at least the pre-registered amount, while the negative controls must not get worse” is severe: a decorative feature flunks it. The tools of severity are the ablation— remove the part, measure what it was really contributing — and the hunt for confounds, hidden variables that explain a gain away. The register’s entry 7 records a textbook confound hunt: an apparent improvement that owed more to one feature set having better data coverage in recent decades than to any causal signal — the “signal” was partly which decade is this, dressed up as insight.
The register’s rules are append-only, in both directions. A criterion, once locked, may be marked PASSED or FAILED but never weakened — there is no “the test was too strict,” no quiet redefinition of success after the fact. And a mechanism marked FALSIFIEDstays falsified: re-entry requires a materially different mechanism, not a rematch. The register’s status ladder — FETCHED-UNWIRED (data acquired, not yet used), LIVE-UNTESTED(running, but its severe test still pending), SUPPORTED, FALSIFIED — is the honest answer to “what does this project actually know?”, mechanism by mechanism.
The pitf_logit story from chapter 3is the register working as designed: the best raw score in project history walked into a criterion locked before that score existed, failed it, and was recorded as a falsification — permanently, publicly, next to the flattering number it failed to justify.
Falsification is progress. The project now knows something it did not — and, more importantly, it can prove it did not move the goalposts to learn it.
With the machinery in hand, you are equipped for the leaderboard itself. Chapter 5 reads it column by column.
What to remember