GPT-5 finding
Benchmark backfill logs can misreport flipped row count
lowmaintainabilityhigh
- apps/web/scripts/backfill-benchmark-flag.ts
In non–dry-run mode, flipRows() may update fewer rows than the group size (e.g., already-flipped rows). The decision.flipped value records the accurate count, but the log line always prints group.reviewIds.length, overstating what actually flipped and potentially confusing operators.
Recommendation
Log the actual flipped count: use the computed flipped variable in the message instead of group.reviewIds.length.