AntFleet

Disagreement methodology

We publish what we don't post.

Every PR AntFleet reviews is read by two frontier models independently. The unanimous gate posts only findings both agree on. Everything else — solo flags from one model, severity mismatches, classification conflicts — is filtered out. This page explains how we classify and surface those filtered findings.

The taxonomy

A solo finding means one model flagged something and the other didn't mention the file at all.

A mismatched classification means both models flagged the same line range, but assigned a different severity or category.

We don't classify overlapping-different-evidence yet: same range, same classification, but divergent explanations.

The opt-in gate

Disagreements use the same public boundary as receipts: the parent review must have public_receipt = true. Non-opted-in reviews stay off this surface.

How the data is computed

The archive is computed from reviews.provider_responses JSONB. IDs are deterministic, and there is no new persistence layer; every row is reproducible from the source review data.

AI Scorecard methodology

Every week, AntFleet publishes a scorecard comparing the two frontier models that power the unanimous gate. Scorecard data is computed from the same reviews.provider_responses JSONB used by receipts and disagreements.

Sample gate: only reviews where public_receipt = true contribute. Non-opted-in installs never reach this surface.

Date windows: weekly, Monday 00:00 UTC through Sunday 23:59:59 UTC. Each scorecard also shows a 4-week rolling average alongside the per-week numbers. Weeks with no reviews are excluded from the rolling average (not interpolated).

Immutability:once a weekly snapshot is published, it never changes — even if underlying reviews are backfilled, opt-in status changes, or finding_status rows are updated. Credibility requires stable historical numbers.

Small-N caveat:with the current review rate (~2–5 reviews per day), per-week samples are small. The 4-week rolling average alongside per-week numbers mitigates noise. The first few weeks of scorecard data will be especially noisy.

Cost limitations: patch generation cost (cost_patch_usd) is currently always 0 due to a Patch Agent v1.5 blocker. Estimated token-cost is shown where available via the review-level cost_estimated_usd.

Reproducibility: the aggregator at apps/web/lib/scorecard.ts is open source. Same code path as /api/v1/installations/{id}/review.

Why we publish them

We publish what we don't post. The unanimous gate filters PR comments down to findings both frontier models agree on; this page shows the public, opted-in findings that gate filtered out.

← back to all disagreements