Disagreement methodology
We publish what we don't post.
Every PR AntFleet reviews is read by two frontier models independently. The unanimous gate posts only findings both agree on. Everything else — solo flags from one model, severity mismatches, classification conflicts — is filtered out. This page explains how we classify and surface those filtered findings.
The taxonomy
A solo finding means one model flagged something and the other didn't mention the file at all.
A mismatched classification means both models flagged the same line range, but assigned a different severity or category.
We don't classify overlapping-different-evidence yet: same range, same classification, but divergent explanations.
The opt-in gate
Disagreements use the same public boundary as receipts: the parent review must have public_receipt = true. Non-opted-in reviews stay off this surface.
How the data is computed
The archive is computed from reviews.provider_responses JSONB. IDs are deterministic, and there is no new persistence layer; every row is reproducible from the source review data.
AI Scorecard methodology
Every week, AntFleet publishes a scorecard comparing the two frontier models that power the unanimous gate. Scorecard data is computed from the same reviews.provider_responses JSONB used by receipts and disagreements.
Sample gate: only reviews where public_receipt = true contribute. Non-opted-in installs never reach this surface.
Date windows: weekly, Monday 00:00 UTC through Sunday 23:59:59 UTC. Each scorecard also shows a 4-week rolling average alongside the per-week numbers. Weeks with no reviews are excluded from the rolling average (not interpolated).
Immutability:once a weekly snapshot is published, it never changes — even if underlying reviews are backfilled, opt-in status changes, or finding_status rows are updated. Credibility requires stable historical numbers.
Small-N caveat:with the current review rate (~2–5 reviews per day), per-week samples are small. The 4-week rolling average alongside per-week numbers mitigates noise. The first few weeks of scorecard data will be especially noisy.
Cost limitations: patch generation cost (cost_patch_usd) is currently always 0 due to a Patch Agent v1.5 blocker. Estimated token-cost is shown where available via the review-level cost_estimated_usd.
Reproducibility: the aggregator at apps/web/lib/scorecard.ts is open source. Same code path as /api/v1/installations/{id}/review.
Why we publish them
We publish what we don't post. The unanimous gate filters PR comments down to findings both frontier models agree on; this page shows the public, opted-in findings that gate filtered out.