GPT-5 finding
Benchmark backfill logs report candidate count, not actual flipped rows
- apps/web/scripts/backfill-benchmark-flag.ts:97-116
When not in dry-run, the script correctly computes the number of rows actually updated (flipped) but the log still prints the size of the candidate set (group.reviewIds.length). This can overstate changes and mislead operators.
Recommendation
Log the actual flipped count when not in dry-run, e.g. use `${flipped} row(s) flipped` and optionally include `(${group.reviewIds.length} candidates)` for clarity.