AntFleet

Anatomy · 8ff8c1af-0

Retention policy keeps RETENTION_LIMIT+1 rows (off-by-one)

mediumbug
repo 56f59a0d·PR #2·reviewed 4 days ago

The vulnerable code

src/providers/telemetry.ts:206-214

Code snippet unavailable.

The reasoning

Opus

Telemetry flush() swallows all errors silently — data loss with no signal

mediumdata-losshigh
  • src/providers/telemetry.ts:195-202
On a failed transaction, the queued metricUpdates/decisionQueue/failureQueue are NOT cleared after rollback. That means they will be retried on next flush, which is OK for retry, BUT: (a) if the failure is deterministic (e.g., bad schema column), the queues grow unbounded forever; (b) the error is completely silent — no console.warn, no logging. Telemetry data loss happens with zero diagnostics. Combined with FLUSH_EVENT_COUNT=10, the queues can grow to memory exhaustion if DB is broken.

Recommendation

Log errors (at least once), and add an upper bound on queue size with policy to drop oldest events when retention threshold is exceeded.

GPT-5

Retention policy keeps RETENTION_LIMIT+1 rows (off-by-one)

mediumbughigh
  • src/providers/telemetry.ts:206-214
Using id < (MAX(id) - LIMIT) retains rows with id >= (MAX - LIMIT), which is LIMIT+1 rows due to inclusive range (e.g., keep 50..1050 inclusive when MAX=1050, LIMIT=1000). If the intent is to keep exactly RETENTION_LIMIT rows, the condition should be id <= (MAX - LIMIT).

Recommendation

Use <= instead of <. Example: DELETE FROM routing_decisions WHERE id <= (SELECT MAX(id) - ? FROM routing_decisions); DELETE FROM failures WHERE id <= (SELECT MAX(id) - ? FROM failures); Bind RETENTION_LIMIT as a parameter to avoid string interpolation. Optionally, guard NULL MAX with COALESCE.

The agreement

Both frontier models flagged this within the same line range. AntFleet's unanimous gate fired — the finding posted on the PR.

Closure

Tweet thread template

tweet 1 of 8140 / 280

Two frontier models reviewed PR #2 on 56f59a0d. Both found this bug: medium bug: Retention policy keeps RETENTION_LIMIT+1 rows (off-by-one)

tweet 2 of 8120 / 280

The vulnerable code (src/providers/telemetry.ts:206-214): (full snippet at https://www.antfleet.dev/anatomy/8ff8c1af-0)

tweet 3 of 8280 / 280

What Opus saw: "On a failed transaction, the queued metricUpdates/decisionQueue/failureQueue are NOT cleared after rollback. That means they will be retried on next flush, which is OK for retry, BUT: (a) if the failure is deterministic (e.g., bad schema column), the queues grow…

tweet 4 of 8280 / 280

What GPT-5 saw: "Using id < (MAX(id) - LIMIT) retains rows with id >= (MAX - LIMIT), which is LIMIT+1 rows due to inclusive range (e.g., keep 50..1050 inclusive when MAX=1050, LIMIT=1000). If the intent is to keep exactly RETENTION_LIMIT rows, the condition should be id <= (MAX…

tweet 5 of 897 / 280

Both flagged the same line range. AntFleet's unanimous gate fired — the finding posted on the PR.

tweet 6 of 893 / 280

The fix landed in commit pending: (view diff at https://www.antfleet.dev/anatomy/8ff8c1af-0)

tweet 7 of 881 / 280

AntFleet reviews every PR with two frontier models. Only unanimous findings post.

tweet 8 of 877 / 280

Full anatomy + reasoning + diffs: https://www.antfleet.dev/anatomy/8ff8c1af-0

Paste into X composer one tweet at a time. X has no multi-tweet intent API.