AntFleet

Anatomy · b1d71d08-0

State file rollback uses non-existent .bak and can cause data loss on write failure

highdata-lossclosed in 4b9b492
repo 6f7fc663·PR #24·reviewed 1 week ago·closed 1 week ago

The vulnerable code

skills/fleet-state/SKILL.md:291-306

291TMP=$(mktemp)
292jq --arg ts "$(date -u +%FT%TZ)" \
293 --arg today "$(date -u +%F)" \
294 --arg parent "$PARENT_REPO" \
295 --argjson totals "{\"total\":$N_TOTAL,\"power\":$N_POWER,\"active\":$N_ACTIVE,\"stale\":$N_STALE,\"cold\":$N_COLD,\"running\":$N_RUNNING,\"unreadable\":$N_UNREADABLE}" \
296 --argjson release_count "$RELEASE_COUNT" \
297 --arg spotlight_fork "$SPOTLIGHT_FORK" \
298'
299 .parent = $parent |
300 .last_run = $ts |
301 .snapshot = {totals: $totals, release_count: $release_count, spotlight_fork: $spotlight_fork} |
302 .history = ((.history // []) + [{run_date: $today, totals: $totals, release_count: $release_count, spotlight_fork: $spotlight_fork}] | sort_by(.run_date) | .[-12:])
303' memory/topics/fleet-state.json > "$TMP"
304mv "$TMP" memory/topics/fleet-state.json
305jq empty memory/topics/fleet-state.json || { cp memory/topics/fleet-state.json.bak memory/topics/fleet-state.json; exit 1; }
306```

The reasoning

Opus

State write has no `.bak` creation step but recovery path references `memory/topics/fleet-state.json.bak`

mediumdata-losshigh
  • skills/fleet-state/SKILL.md:270-290
  • skills/fleet-state/SKILL.md:292
  • skills/fleet-state/SKILL.md:39
The recovery branch unconditionally `cp memory/topics/fleet-state.json.bak memory/topics/fleet-state.json` but no step ever creates `.bak`. On the first run, or on any run where `.bak` was deleted, the recovery `cp` fails AND the just-written invalid JSON remains in place. The prose says "Keep one `.bak` rolling" but provides no command to do so — operators / LLMs will follow the code block literally and skip backup creation. Result: a single bad `jq` write irreversibly corrupts fleet-state.json, losing the 12-week history.

Recommendation

Before `mv "$TMP" memory/topics/fleet-state.json`, add `cp memory/topics/fleet-state.json memory/topics/fleet-state.json.bak 2>/dev/null || true`. Also validate `$TMP` with `jq empty` *before* the mv, so the corrupt file is never promoted to canonical in the first place.

GPT-5

State file rollback uses non-existent .bak and can cause data loss on write failure

highdata-losshigh
  • skills/fleet-state/SKILL.md:291-306
  • skills/fleet-state/SKILL.md:308
The write flow moves the new file into place before validation, then attempts to restore from a .bak that is never created in this step (and not created anywhere else in the document). If jq empty fails (e.g., interrupted write, disk error), the restore path will also fail due to missing backup, leaving a corrupt or empty state file and losing prior history.

Recommendation

- Create a backup before overwriting: if [ -f memory/topics/fleet-state.json ]; then cp memory/topics/fleet-state.json memory/topics/fleet-state.json.bak; fi - Prefer atomic update: write to $TMP, validate ($ jq empty "$TMP"), then move into place (mv "$TMP" memory/topics/fleet-state.json). Only update the on-disk file after successful validation. - If validation after move is kept, ensure the backup is created before mv and the error path logs FLEET_STATE_STATE_CORRUPT and restores from the created .bak.

The agreement

Both frontier models flagged this within the same line range. AntFleet's unanimous gate fired — the finding posted on the PR. Closed in 4b9b492.

The fix

291TMP=$(mktemp)
292jq --arg ts "$(date -u +%FT%TZ)" \
293 --arg today "$(date -u +%F)" \
294 --arg parent "$PARENT_REPO" \
295 --argjson totals "{\"total\":$N_TOTAL,\"power\":$N_POWER,\"active\":$N_ACTIVE,\"stale\":$N_STALE,\"cold\":$N_COLD,\"running\":$N_RUNNING,\"unreadable\":$N_UNREADABLE}" \
296 --argjson release_count "$RELEASE_COUNT" \
297 --arg spotlight_fork "$SPOTLIGHT_FORK" \
298'
299 .parent = $parent |
300 .last_run = $ts |
301 .snapshot = {totals: $totals, release_count: $release_count, spotlight_fork: $spotlight_fork} |
302 .history = ((.history // []) + [{run_date: $today, totals: $totals, release_count: $release_count, spotlight_fork: $spotlight_fork}] | sort_by(.run_date) | .[-12:])
303' memory/topics/fleet-state.json > "$TMP"
304mv "$TMP" memory/topics/fleet-state.json
305jq empty memory/topics/fleet-state.json || { cp memory/topics/fleet-state.json.bak memory/topics/fleet-state.json; exit 1; }
306```

Closure

Closed 1 week ago

SHA: 4b9b49251c8c9808bf147d55aa2930352af2e8c0

View closure receipt on GitHub →

Tweet thread template

tweet 1 of 8170 / 280

Two frontier models reviewed PR #24 on 6f7fc663. Both found this bug: high data-loss: State file rollback uses non-existent .bak and can cause data loss on write failure

tweet 2 of 8121 / 280

The vulnerable code (skills/fleet-state/SKILL.md:291-306): (full snippet at https://www.antfleet.dev/anatomy/b1d71d08-0)

tweet 3 of 8280 / 280

What Opus saw: "The recovery branch unconditionally `cp memory/topics/fleet-state.json.bak memory/topics/fleet-state.json` but no step ever creates `.bak`. On the first run, or on any run where `.bak` was deleted, the recovery `cp` fails AND the just-written invalid JSON remain…

tweet 4 of 8280 / 280

What GPT-5 saw: "The write flow moves the new file into place before validation, then attempts to restore from a .bak that is never created in this step (and not created anywhere else in the document). If jq empty fails (e.g., interrupted write, disk error), the restore path wi…

tweet 5 of 897 / 280

Both flagged the same line range. AntFleet's unanimous gate fired — the finding posted on the PR.

tweet 6 of 893 / 280

The fix landed in commit 4b9b492: (view diff at https://www.antfleet.dev/anatomy/b1d71d08-0)

tweet 7 of 881 / 280

AntFleet reviews every PR with two frontier models. Only unanimous findings post.

tweet 8 of 877 / 280

Full anatomy + reasoning + diffs: https://www.antfleet.dev/anatomy/b1d71d08-0

Paste into X composer one tweet at a time. X has no multi-tweet intent API.