How do you grade a probability? The Brier score, the difference between being calibrated and being useful, and the trap that makes a model look brilliant while learning nothing.
Suppose a model says: the probability of an irregular regime change in country X within five years is 70 percent. Five years pass and nothing happens. Was the model wrong?
You genuinely cannot tell — not from one case. A 70 percent forecast explicitly reserves a 30 percent chance of quiet, and quiet is what occurred. The forecast might have been excellent; the unlikely branch simply came up. It might also have been nonsense. A single resolved event cannot distinguish the two, any more than one hand of poker can tell you whether a player is skilled.
This is the basic predicament of serious forecasting, and everything in this chapter follows from it. Probabilistic forecasters can only be graded in bulk— across many forecasts, scored by a rule that rewards honesty and punishes both cowardice and bluster. Models in this project speak the same dialect everywhere: a hazard, the probability that a given kind of event hits a given country within a given window. The natural starting point for any such number is the base rate— how often events of that kind happen historically, before you know anything special about the case at hand. Chapter 6 returns to base rates in detail; here, hold onto one idea: a forecast earns credit only by improving on that historical default.
The grading rule this project lives and dies by is the Brier score, and it fits in one sentence: take the gap between the stated probability and what happened, square it, and average over all events.
The Brier score
For each event, write the outcome as 1 (it happened) or 0 (it did not). The Brier score is the average of (forecast − outcome)² across all events. It runs from 0 (forecasting perfection) to 1 (perfect confidence in the wrong direction). Lower is better.
Three worked examples make it concrete. You say 90 percent and the event happens: (0.9 − 1)² = 0.01 — a tiny penalty, almost a perfect call. You say 90 percent and it does not happen: (0.9 − 0)² = 0.81 — a brutal one. You shrug and say 50 percent: you score 0.25 either way, no matter what happens. That shrug is the most important landmark on the whole scale, because a forecaster who knows nothing can always achieve it. Scoring 0.25 on average means scoring at chance.
Drag the slider and feel how the rule behaves:
Try it — score one forecast
Then the event…
Brier score for this forecast
0.090
At or below the superforecaster line — on this single forecast, elite-level scoring. Remember: one forecast is one draw — the real grade is the average over many.
The reference lines in the widget are the same ones drawn on the project’s leaderboard, read from the same data file. Chance is 0.25: the score of pleading ignorance on every question. Around 0.18is roughly what prediction markets — the pooled bets of thousands of people with money at stake — achieve on questions like these. And around 0.15is the territory of elite human forecasting teams, the so-called superforecasters. Any claim that a model here “works” cashes out as a claim about where it sits on that ladder.
The Brier score has a sharper-tongued companion called log-loss, which the scorer also computes. Its distinguishing feature is how it treats arrogance: saying 99 percent and being wrong is catastrophic under log-loss, far worse than under Brier. A model that looks acceptable on one score and terrible on the other is usually telling you something about how it fails.
Here is the deepest idea in this chapter — the one that, once seen, changes how you read every forecasting claim anywhere, not just on this site. Forecast skill is not one quantity. The Brier score quietly adds together two very different virtues, and one of them can be faked.
Calibrationasks: when you say 30 percent, does the event happen about 30 percent of the time? It is a kind of statistical honesty — your numbers mean what they say. Resolution asks something harder: do your forecasts actually separatethe cases? Do you say high numbers before the crises and low numbers before the quiet years — or do you say the same safe number every time?
The distinction matters because calibration alone is cheap. Consider two weather forecasters in a region where it rains 30 percent of days:
Two calibrated forecasters, ten days, three of them rainy
The hedger
Says 30% every single day — the regional average. Over the year its 30%s come true 30% of the time, so it is perfectly calibrated. It has also told you nothing about which day to bring an umbrella. Zero resolution.
The discriminator
Says 85% ahead of the rainy days and 5% ahead of the dry ones. Equally calibrated — and actually useful, because its forecasts separate the two kinds of day. High resolution. This is the quantity a model cannot fake.
Now the trap. Because the Brier score blends both virtues, a model can improve its Brier score while learning nothing— simply by hedging its answers toward the base rate. Squash every forecast toward the historical average and the squared errors shrink on a typical sample, the headline number gets prettier, and the model has moved backwardsin the only sense that matters: it now says less about which society is actually in danger. The hedger’s path to a better score is always open, costs nothing, and teaches nothing.
That is why this project treats resolution — the unfakeable part — as the quantity that has to move before anyone is allowed to get excited, and why a falling resolution is treated as an alarm even when the headline score is improving. You are about to see both rules earn their keep on a real case.
Medicine does not test a drug only on the sick; it needs to know what happens to people who were never going to deteriorate. Forecasting needs the same discipline. A negative control is a test case chosen because the event did not happen: a society under visible, severe stress — economic crisis, mass protest, institutional strain — that nevertheless held together.
The project’s sealed evaluation set deliberately contains 10of them, and they exist to catch a specific cheat. A model can buy a flattering headline score on crisis-heavy data by simply predicting doom everywhere — crying wolf as a strategy. The negative controls are where that strategy bleeds: a model with real signal should push its probabilities down on the societies that held, not up.
This gives the project one of its sharpest diagnostic patterns. When a change makes the headline Brier better while making the negative-control Brier worse, that combination is the anti-signature: the model has not learned to see crises coming, it has learned to shout more often. The scorer computes the negative-control score separately and the leaderboard prints it as its own column, precisely so this pattern cannot hide inside an average.
None of the above is hypothetical. In June 2026 it all happened at once, on this project, to the best-scoring model it had ever produced — and the paper trail is public.
A variant called pitf_logit— encoding the published findings of a long-running US government instability-forecasting program — posted a raw Brier of 0.1745 on the sealed hold-out. Look at the ladder above: that is not just better than chance, it is better than the market line. The single best raw score in the project’s history, before or since. A less paranoid process would have published a victory announcement that afternoon.
Instead the score was put through the checks this chapter has been building, and all three came back bad. The model’s resolution had droppedrelative to the baseline it was meant to improve — its forecasts separated crisis from calm less well, even as its headline number shone. Its negative-control Brier got worse— the anti-signature. And on a pre-registered ablationtest — remove the new features, re-score on training data, and measure how much discrimination they actually added — the contribution came in below the bar the project had locked in writing before looking. The flattering Brier was recalibration absorbing the gap: the model had drifted toward well-hedged answers, not understanding.
So the mechanism was declared FALSIFIED, permanently, in the project’s falsification register — best raw score on the board notwithstanding. It remains visible on the leaderboard today, rank one by raw Brier, excluded from the official ensemble, a standing exhibit of why the headline number is never enough.
Brier alone is never admission evidence. Resolution, negative-control behavior, and a pre-registered severe test are.
The general failure being guarded against has a name: HARKing— hypothesizing after the results are known. Find a good number first, construct the story afterwards, and present the story as if it had been the prediction. The defense is always the same and always boring: write down what would count as success before you look. The pass bar that killed pitf_logitwas written before its score existed — which is the only reason the kill is trustworthy. Chapter 4 tours the full machinery built on that principle.
What to remember