AntFleet

Disagreement · d9ae4fa5-anthropic-0

Circuit breaker is tripped after a single retry exhaustion, even for the first provider tried

mismatch
repo 56f59a0d·PR #3·reviewed 4 days ago

Primary finding

Circuit breaker is tripped after a single retry exhaustion, even for the first provider tried

mediumbughigh
  • src/providers/orchestrator.ts:263-305
  • src/providers/orchestrator.ts:405-460
retryWithBackoff trips the circuit breaker after RETRY_BACKOFFS_MS attempts (3) and throws. The caller's catch then also marks fallbackTriggered and continues to the next provider. The circuit breaker cooldown is 5 minutes, so a single transient burst (e.g. brief 503) will mark the provider 'degraded' for 5 minutes after just 3 retries on one call. This is too aggressive and contradicts a typical failure-rate-based circuit breaker, and combined with EMA recordFailure (which already reduces successRate) it will rapidly demote providers from a single bad call. Worse, non-retryable errors (e.g. auth) never trip the breaker at all because retryWithBackoff returns early on non-retryable.

Recommendation

Trip the circuit breaker based on rolling failure rate or N consecutive failures instead of one retry-exhausted call. Also trip on non-retryable hard failures (e.g. 401) when appropriate.

Counterpart finding

Fallback reason is not preserved in success events and sendMessage telemetry logs generic errorType

mediummaintainabilityhigh
  • src/providers/orchestrator.ts:456-460
  • src/providers/orchestrator.ts:568-571
  • src/providers/orchestrator.ts:586-592
Although extractFallbackReason derives a specific reason in streamMessage, the success event that follows a fallback uses a generic 'server_error'. In sendMessage, failures are always logged as 'server_error' without extracting a specific reason. This reduces observability and complicates debugging and routing analytics.

Recommendation

Carry forward the specific fallback reason into the success event (store it when first determined) and use extractFallbackReason in sendMessage to log an accurate errorType. Consider including the reason in success metadata as well.

Why this didn't post

This finding didn't meet AntFleet's unanimous agreement threshold. Both frontier models review every PR independently; only findings they both flag with the same severity and category are posted to the PR. This one fell through.

read the methodology →