AI we there yet?

Scoring rubric v1

This is how a single event is sized on the board. It is the same document the depth pass reads before it scores and the same document you read on /rubric — there is only one copy, and this is it.

The board asks one question of every event: how much did this actually move us toward AGI, and in which direction? Not how exciting the headline is — how much the *body* of the story supports a real, measurable shift. A cold read, bound to evidence, biased toward caution.

Magnitude: 0 to 5

Every event gets a whole number from 0 to 5. The number comes from what the article body demonstrates, never from how the headline is phrased.

ScoreNameWhat it takes
0No movementA re-report of something already counted, pure opinion or commentary, a rumor, or a body that does not support its own headline. Also the score when the body could not be read.
1MarginalA minor increment: a small product tweak, a benchmark bump within noise, an intention or roadmap with nothing shipped yet.
2ModestA real but narrow result: a shipped feature, a mid-tier benchmark gain, a concrete policy proposal, a funding round of ordinary size.
3NotableA solid, measured advance a specialist would stop to note: a meaningful model or product release, a binding rule entering force, a significant compute or capital commitment.
4MajorA clear step-change the field will cite: a frontier model that moves the state of the art broadly, a landmark law or enforcement action.
5LandmarkRare and hard to dispute. It redefines the frontier — the kind of event you would still remember by name a year later.

The deflation rule — ante la duda, la menor

When the evidence leaves you torn between two scores, take the lower one. The board would rather under-count a real event than let an inflated one set the pace. Concretely:

  • If the body inflates the headline — "X solves Y" over a narrow benchmark with caveats — drop at least one step from what the headline implies.
  • If the body contradicts the headline, score by the body, not the claim.
  • A re-report of an event already on the board is a 0. The original already counts; counting it twice is the one error the board cannot see.
  • If the body is unavailable, paywalled, or truncated, do not guess. Score 0 with the reason stated. An unread event is not a small event — it is an unknown one, and unknowns do not move the needle.

Source tiers

Where an event is reported bounds how far it can move the board on its own. A landmark claim needs a primary source; a blog post cannot certify one by itself.

TierSourcesMagnitude cap
A — PrimaryThe paper or preprint (arXiv), the lab or company's own release, peer-reviewed venues, official registers and regulators, primary government filings.none (0–5)
B — Established pressOutlets with editorial standards and a correction record: Reuters, Bloomberg, the FT, the NYT, The Verge, Ars Technica, Wired, MIT Technology Review.4
C — SecondaryAggregators, personal blogs, social posts, and outlets of unknown or low signal.3

When an event's evidence would earn a higher score than its tier allows, the score is lowered to the cap and the event is flagged as capped. This is not a penalty on the outlet — it reflects that a secondary report, alone, cannot carry a landmark. The same event, once the primary source appears, can be re-scored on Tier A.

The six axes

Each event touches exactly one axis. The depth pass proposes one; a human confirms it before it counts.

AxisWeightRole
Autonomy×1.5Agency and self-improvement — agents that act, plan, and improve themselves.
Capability×1.2Raw ability — benchmark jumps, emergent skills, distance to human-level. Safety and alignment work lives here too.
Friction×1.2The brake — regulation, limits, failures, backlash. Slows the speedometer; never touches the odometer.
Power×1.0The fuel — training compute, chips, data centres, energy.
Diffusion×0.9How fast it spreads — adoption into products, infrastructure, decisions.
Vibes×0.5Public narrative and sentiment — viral moments, influential essays, waves of fear or optimism.

Magnitude and axis are independent: a Vibes event and an Autonomy event can both score 4, but the axis weight decides how much each one finally bends the board.

What the depth pass returns

For each event, the depth pass reads the full body and returns a cold, structured judgment: whether the body supports, inflates, or contradicts the headline; the event type (result, announcement, promise, or re-report); the single axis it proposes and its direction (progress or friction); the magnitude with the exact sentence from the body that anchors it; the source tier and whether a cap was applied; and a short, specific reasoning line. It scores. It never decides what enters the board — a human sets the cutoff.


Rubric v1. Changes to this document are recorded in the [methodology changelog](/rubric/changelog).