Architecture redesign · throughput buildJune 6, 2026

From One Formula to a System of Models

For 41 sessions the project polished a single equation it could never actually run. We rebuilt it the way real forecasting sciences work: a population of competing, runnable models on a leaderboard, scored by a frozen oracle, driven by an autonomous nightly loop. The June throughput build then broke the binding data constraint (a country-year panel of 11,130 observations), gave the night a mass structural-hypothesis search and a bounded agentic builder, and put the whole thing behind a multiple-testing-honest evaluation stack.

The honest question 41 sessions never answered — what is the actual forecast skill? — now has an honest answer: EVT-deflated, still at chance. We say so, and we built the machinery to move it.

Why we were stuck

A research engine that never computed could only accumulate citations

A ground-truth audit of the repository plus a 10-report research review converged on the same conclusion: the stagnation was architectural, not a matter of effort. An autonomous discovery loop needs four legs. The old design had none of them.

No population

One living formula, refined by consensus.

There was exactly one CURRENT.md. Every session re-polished the same document — gradient descent from a single seed. Productive discovery needs a population of competing candidates (FunSearch, AlphaEvolve, ShinkaEvolve). With one model, the diversity term that powers every ensemble is zero by construction.

No scalar oracle

The gate was prose, not a number.

A formula update was approved by the Philosopher's written verdict. Prose cannot detect overfitting or compute calibration — and it selects for coherence with the existing formula, filtering out exactly the diverse variants an ensemble needs. Every serious system gates on an immutable scalar an agent cannot edit.

No execution

Agents read papers; they never ran code.

All nine lead agents and their 135 sub-agents only searched and scored literature. None fit a model, ran a simulation, or executed a backtest. The 36 'calibrated' parameters were argued into existence from papers, never fit to data — curve-fitting by citation.

No held-out data

Backtests were circular and qualitative.

Historical 'tests' were narrative precondition assessments scored PASS/PARTIAL, with the calibration cases reused as validation cases. No locked hold-out, no numeric score, no negative controls. After 40 sessions the state vector and solver were still undefined — so the formula had never produced a single numerical prediction.

The forks we chose

Six decisions that define the new system

Scope

Clean-slate engine, keep the corpus

Rebuild the session engine around the model zoo; archive the single-formula apparatus; reuse the 74 accumulated parameters as priors.

Mathematics

A portfolio of tractable models now

Runnable models that can be backtested today (hazard models, regime logits, ABMs, ensembles). The grand unified PDE becomes one variant and a non-blocking long-horizon research track.

Compute

Agents that run code

A new tier of compute agents fit models, run simulations, and execute backtests every session. Literature research becomes an input, not the deliverable.

Targets

Dense, scoreable benchmarks

UCDP, V-Dem, PITF, the Cline Center coup data, Seshat, and IMF feed continuous, numerically-scored feedback. Polymarket becomes one signal among several.

Publication

Publish pre-resolution, labeled

Predictions are hash-locked before they are published, then shown live tagged by reflexivity class. The clean validation signal is held-out retrodiction; live markets are secondary.

Cadence

Nightly autonomous loop

A scheduled ratchet runs many fast backtest experiments each night: keep what improves the score, revert what does not.

The model zoo

The formula is now a population of ten competing structures

Each variant is a genuinely different structural hypothesis with a runnable model — not a parameter tweak. They are scored identically by the frozen oracle and combined into an ensemble of the ADMITTED variants only. A variant earns admission by beating the null baseline on discrimination AND passing a pre-registered severe test; until then it stays on the board marked experimental. The June build added six new variants, most fit on the country-year panel.

null_baselineplanned

null

Reference-class base rate only. The mandatory benchmark — beat it or you are noise.

train_freqplanned

empirical

Smoothed frequency learned from training events — a control that exposes data contamination.

pitf_logitplanned

regime logit

PITF regime-type inverted-U with FIXED literature betas on infant mortality, factionalism, neighbor conflict.

firth_logitplanned

regime logit (fitted)

The same PITF channels but with coefficients FITTED on the panel under a Firth penalty — the Philosopher-sanctioned re-entry route. Board-record discrimination.

sdt_turchinplanned

structural-demographic

Turchin's Political Stress Index — elite overproduction, mass mobilization, state fiscal stress; now fed real youth-bulge data.

hierarchical_bayesplanned

hazard (partial pooling)

Beta-Binomial empirical-Bayes shrinkage across reference classes — the statistically correct small-N regularizer.

hazard_splineplanned

hazard (splines)

Discrete-time logistic hazard with natural cubic splines on polity score and IMR — captures the anocracy inverted-U linear logits miss.

reign_logitplanned

leader tenure

Regime/leader-duration hazard (REIGN): risk falls with regime durability and leader tenure, rises with factionalism.

gbm_honestplanned

ML ensemble

Gradient boosting on panel features with hard anti-overfit constraints (ViEWS paradigm), early-stopped on a chronological tail.

conformal_wrapperplanned

calibration meta-layer

Split-conformal calibration over the admitted ensemble — inherits, never creates, discrimination; reports honest intervals.

The autonomous nightly pipeline

Four stages, every night at 01:00

The June build turned a near-idle parameter ratchet into a four-stage engine. Stages 1–2 are pure compute; stage 2b runs a mass structural search; stage 3 is a single bounded LLM build (the AlphaEvolve pattern — LLM as mutation operator, frozen scorer as selection); stage 4 resolves live predictions. Anything the LLM builds is auto-marked experimental and can never enter the ensemble without the Philosopher gate.

Ratchet

Differential evolution searches each variant's parameter box against TRAINING purged cross-validation (keep-if-better, ~400 evals). Replaced the old UCB1 + Gaussian engine, fixing a same-day-replay seed bug and a champion-regression bug.

Grammar sweep

Thousands of structural candidates — feature subsets × transforms × link functions × bases — fit and ranked by out-of-fold discrimination; survivors of a permutation null + Benjamini-Hochberg FDR become queued hypotheses (labeled hypotheses, never discoveries).

Evolve

ONE bounded headless agent build per night: it pulls the top queued hypothesis, writes a runnable variant, the frozen scorer judges it. Hard dollar cap, wall-clock cap, protected-file hash quarantine, automatic experimental flag.

Resolve

Registered predictions are checked against live markets (Manifold; Polymarket via an in-process DNS bypass) and appended to an append-only scoreboard.

Anti-degeneracy safeguards

MAP-Elites archive (targets unoccupied mechanism families; preserves diversity)Purged + embargoed CV, plus leave-one-region-out (no temporal or spatial leakage)Proposer ≠ evaluator; the hold-out is proposer-forbidden (no gaming your own metric)EVT √(2·ln N) deflation + bootstrap CIs + PBO (multiple-testing honesty that scales with trial count)Protected-file hash check with git-restore quarantine on the nightly buildSevere tests pre-registered BEFORE any score is read

The frozen oracle + honesty stack

One number no agent can edit — and the machinery to keep it honest at scale

The single evaluation function for the whole project is frozen and hash-verified. Around it sits a honesty stack that scales with the thousands of hypotheses now tested per night — because testing more only stays honest if the haircut grows with the trial count.

Integrity rules

✓Frozen and hash-verified: integrity is checked before any score is trusted.
✓Two sealed hold-outs are proposer-forbidden: the 26-event curated suite and the 1,198-row country-year panel hold-out (2010–2015).
✓All out-of-sample evaluation uses purged + embargoed cross-validation; standard k-fold is banned; leave-one-region-out guards spatial clustering.
✓Predictions are pre-registered: the probability is hash-locked before the outcome can be read.
✓Leaderboard claims carry an extreme-value √(2·ln N) deflation and a 90% bootstrap CI; mass-search output passes a permutation null + FDR.

Live leaderboard

Where the variants actually stand

Scored by the frozen oracle on a locked hold-out of 26 historical events including 10 negative controls (high-stress societies that did NOT collapse). The set is deliberately crisis-skewed, so the real test is resolution (discrimination), not a low average. With ten variants now tried, PBO is 0.70 and the binding number is the EVT-deflated best Brier, which sits exactly at the chance line (0.25) — i.e. no deflated evidence of skill yet. New variants are auto-experimental and excluded from the official ensemble until they pass a pre-registered severe test.

Model	Family	Brier	Resolution	Neg-ctrl	Tier
ensemble	equal-weight	0.291	0.096	0.068	T0
pitf_logitexcl.	regime_logit	0.175	0.141	0.256	T2
hierarchical_bayesexcl.	empirical_bayes	0.219	0.084	0.278	T1
hazard_splineexcl.	hazard_spline	0.220	0.095	0.405	T1
conformal_wrapperexcl.	calibration_meta	0.221	0.121	0.158	T1
sdt_turchinexcl.	structural_demographic	0.230	0.173	0.222	T1
train_freq	empirical_frequency	0.234	0.064	0.161	T1
firth_logitexcl.	penalised_logit	0.269	0.192	0.297	T0
gbm_honestexcl.	gradient_boosting	0.281	0.115	0.144	T0
reign_logitexcl.	duration_logit	0.330	0.060	0.209	T0
null_baseline	null	0.370	0.095	0.038	T0

Hold-out events: 26Negative controls: 10Legacy formula: Tier 0Chance line: 0.25Market line: 0.18PBO: 0.70

The headline is a falsification and a lead, not a victory. The fixed-prior pitf_logit posts the lowest raw Brier (0.175) but its pre-registered F1 ablation FAILED a third time as feature coverage widened — so the fixed-beta PITF hypothesis is FALSIFIED, and that low Brier is a calibration artifact (its discrimination dropped and its negative-control error rose). The genuinely interesting result is firth_logit: the SAME PITF channels, but with coefficients fitted on the country-year panel, reach the board's highest discrimination ever (resolution 0.192) — the Philosopher-sanctioned re-entry route, now awaiting its own pre-registered severe test. The admitted ensemble (just the two baselines) stays conservative on purpose. The binding honest number: EVT-deflated best Brier = 0.25, exactly chance. No validated skill yet — admission flows only through the six pre-registered gates.

New to these numbers? How to read this table, column by column →

The progress ladder

What counts as real progress

A single objective metric, tracked every night: ensemble Brier on the frozen hold-out, against three reference lines. The legacy formula sits at Tier 0 — it cannot emit a probability at all. A variant can dip under the chance line yet be held experimental (excluded from the ensemble) until it passes a severe test, so the admitted ensemble stays conservative. We don't claim robust Tier-1 skill yet, and we say so.

Tier 0

Produces a numeric prediction at all

—

legacy formula is here (cannot)

Tier 1

Beats chance

Brier < 0.25

current target

Tier 2

Beats market consensus

Brier < 0.18

future

Tier 3

Approaches superforecaster level

Brier < 0.15

genuine psychohistory progress

The research team, rebuilt

From 150 readers to seven roles that each own an artifact

The 2024–2026 evidence is clear: homogeneous multi-agent debate underperforms, and beyond a small panel more agents degrade results. The ceremonial roster of 10 leads × 15 prose-only sub-agents was retired to domain knowledge bases (workers cite them; nobody 'runs' them). In its place: seven heterogeneous roles, each owning a concrete artifact, with strict proposer ≠ scorer ≠ critic separation and parallel cold-briefed workers instead of sequential deliberation.

Judgment (large model)

The Orchestrator routes each session on a lean context; the Philosopher of Science runs the adversarial admission gate. Both reason; neither writes the model code.

Execution (worker model)

Cold-briefed workers whose output is an artifact: the Data Engineer owns the panel and feature pipelines, the ML/Forecasting Engineer owns variant construction, the Demographer owns age-structure features, the Calibration Auditor owns reliability and CI reporting, the Red-Team Forecaster owns the adversarial counter-case before any prediction is registered.

Lead agent	Verdict	What changes
Orchestrator	ROUTES	Lean-context session router; runs the integrity pre-flight and chooses the one move.
Data Engineer	OWNS DATA	The country-year panel, feature pipelines, and source caches — the role that broke the data constraint.
ML / Forecasting Engineer	OWNS THE ZOO	Builds and maintains the runnable variants and grooms the hypothesis backlog.
Demographer	NEW	Owns age-structure / youth-bulge / urbanization features (UN WPP) — feeds the structural-demographic variant.
Calibration Auditor	NEW	Reads reliability curves and confidence intervals after every session; flags train-vs-hold-out drift.
Red-Team Forecaster	NEW	Writes the adversarial counter-case before any prediction is registered.
Philosopher of Science	JUDGES ONLY	Owns the falsification register and the admission gate; pre-registers severe tests before scores are read — and never selects the work.
Legacy 10 leads	→ KNOWLEDGE BASES	Cliodynamicist, Econophysicist, Stat Physicist, et al. survive as domain references the workers cite, not as dispatch targets.

Validation, rebuilt

How we keep ourselves honest

A locked hold-out

Twenty-plus events including negative controls, which the model-building agents are forbidden to read.

Numeric Brier, not PASS/FAIL

Every retrodiction emits a probability and is scored. Narrative precondition assessments are gone.

Leakage-safe backtesting

Purged k-fold with an embargo longer than the cycle being modeled; standard k-fold is banned.

Pre-registration

The probability is hash-locked before the outcome can be read — technical, not procedural, honesty.

Severe testing

Every variant admitted to the zoo carries a pre-specified falsification criterion (Mayo severity ≥ 0.8).

Reflexivity audit

Each published prediction is classified immune, self-fulfilling, or self-defeating — the Seldon problem, handled explicitly.

What's next

Run the six pre-registered severe tests — fitted coefficients first

The data constraint is broken and ten variants are on the board, but none has earned admission. The next session runs the six severe tests that were locked before any score was read — starting with FL-1 for firth_logit, the fitted-coefficient re-entry route that now posts the board's best discrimination. That test, not the leaderboard, decides whether the PITF channels carry real structural signal at panel scale. In parallel, the nightly engine keeps running its four stages, and the open operator items are a free Metaculus API token, an optional permanent DNS fix for live Polymarket data, and confirming one candidate market link.

See the model zoo Read the research log