AntFleet

Anatomy · d9ae4fa5-0

Circuit breaker is tripped after a single retry exhaustion, even for the first provider tried

mediumbug
repo 56f59a0d·PR #3·reviewed 4 days ago

The vulnerable code

src/providers/orchestrator.ts:263-305

Code snippet unavailable.

The reasoning

Opus

Circuit breaker is tripped after a single retry exhaustion, even for the first provider tried

mediumbughigh
  • src/providers/orchestrator.ts:263-305
  • src/providers/orchestrator.ts:405-460
retryWithBackoff trips the circuit breaker after RETRY_BACKOFFS_MS attempts (3) and throws. The caller's catch then also marks fallbackTriggered and continues to the next provider. The circuit breaker cooldown is 5 minutes, so a single transient burst (e.g. brief 503) will mark the provider 'degraded' for 5 minutes after just 3 retries on one call. This is too aggressive and contradicts a typical failure-rate-based circuit breaker, and combined with EMA recordFailure (which already reduces successRate) it will rapidly demote providers from a single bad call. Worse, non-retryable errors (e.g. auth) never trip the breaker at all because retryWithBackoff returns early on non-retryable.

Recommendation

Trip the circuit breaker based on rolling failure rate or N consecutive failures instead of one retry-exhausted call. Also trip on non-retryable hard failures (e.g. 401) when appropriate.

GPT-5

Watchdog timer is not cleared on error/fallback, leaking a pending timer per attempt

mediumbughigh
  • src/providers/orchestrator.ts:391-409
  • src/providers/orchestrator.ts:431-433
  • src/providers/orchestrator.ts:514-518
On success, the watchdog timer is cleared. On exceptions/fallback (catch path), the timer is not cleared, leaving a pending timeout that will fire later and abort the now-orphaned controller. This creates dangling timers per failed attempt and unnecessary abort callbacks, wasting resources and complicating debugging.

Recommendation

Clear watchdogTimer in a finally block so it runs on both success and error. Example: move `if (watchdogTimer) clearTimeout(watchdogTimer);` into the finally or ensure it executes before continuing to the next candidate.

The agreement

Both frontier models flagged this within the same line range. AntFleet's unanimous gate fired — the finding posted on the PR.

Closure

Tweet thread template

tweet 1 of 8175 / 280

Two frontier models reviewed PR #3 on 56f59a0d. Both found this bug: medium bug: Circuit breaker is tripped after a single retry exhaustion, even for the first provider tried

tweet 2 of 8123 / 280

The vulnerable code (src/providers/orchestrator.ts:263-305): (full snippet at https://www.antfleet.dev/anatomy/d9ae4fa5-0)

tweet 3 of 8280 / 280

What Opus saw: "retryWithBackoff trips the circuit breaker after RETRY_BACKOFFS_MS attempts (3) and throws. The caller's catch then also marks fallbackTriggered and continues to the next provider. The circuit breaker cooldown is 5 minutes, so a single transient burst (e.g. brie…

tweet 4 of 8280 / 280

What GPT-5 saw: "On success, the watchdog timer is cleared. On exceptions/fallback (catch path), the timer is not cleared, leaving a pending timeout that will fire later and abort the now-orphaned controller. This creates dangling timers per failed attempt and unnecessary abort…

tweet 5 of 897 / 280

Both flagged the same line range. AntFleet's unanimous gate fired — the finding posted on the PR.

tweet 6 of 893 / 280

The fix landed in commit pending: (view diff at https://www.antfleet.dev/anatomy/d9ae4fa5-0)

tweet 7 of 881 / 280

AntFleet reviews every PR with two frontier models. Only unanimous findings post.

tweet 8 of 877 / 280

Full anatomy + reasoning + diffs: https://www.antfleet.dev/anatomy/d9ae4fa5-0

Paste into X composer one tweet at a time. X has no multi-tweet intent API.