Skip to main content

Why State Files Break Down for AI Agents

State files were designed around assumptions that AI agents violate: one writer at a time, changes batched into infrequent human-reviewed applies, and a workflow that tolerates the state being stale between refreshes. Agents are concurrent, act continuously, and reason from whatever facts they are given - if those facts come from a state file, the agent reasons from a stale, partial copy of reality.

The assumptions, made explicit

The state-file model works when:

  1. Writes are serialized. State locking assumes contention is rare. Human teams apply a few times a day; lock contention is an edge case.
  2. Reads happen at plan time. State is refreshed when a human runs plan; staleness in between is invisible because nobody is looking.
  3. The managed set is the relevant set. The state file describes only resources the tool created. The workflow assumes nothing important exists outside it.
  4. Errors are repaired by experts. Corrupted or divergent state is fixed by a practitioner doing state surgery (imports, moves, manual edits).

Each assumption fails under agentic operation. Agents act in seconds, so serialized writes become a throughput ceiling and lock contention the norm. Agents observe continuously, so between-refresh staleness becomes a stream of false premises. Agents are routinely asked about the whole environment ("is anything publicly exposed?"), and the answer outside the managed set is precisely what matters. And an agent that corrupts state cannot reliably perform its own state surgery - the repair loop assumed human judgment.

The deeper problem: the copy is the interface

All of these reduce to one design decision: the state file makes a copy of infrastructure state the system's working interface. Every copy has a synchronization problem, and every consumer of the copy inherits it. Infrastructure drift is not an operational accident; it is the steady-state behavior of a copy in an environment where reality changes through other channels - consoles, autoscalers, other tools, other agents.

Human workflows absorb this with process: change freezes, single-tool mandates, periodic reconciliation. Agentic environments cannot - the value of agents is that they act often and independently, which is exactly the traffic pattern that makes a coordinated copy unmaintainable.

The replacement primitive

The control plane itself is already a consistent, authoritative, concurrent-safe state store - the providers run it. What agents need is an interface to it that fits how they work: declarative queries over current state, uniform across providers, with mutations expressed in the same grammar and gated by policy. That is the model StackQL implements:

SELECT instance_id, instance_type, launch_time
FROM aws.ec2.instances
WHERE region = 'us-east-1';

The result is true at execution time, includes unmanaged resources, requires no lock, and needs no repair workflow - there is nothing to corrupt. Identifier mapping and dependency ordering (the state file's legitimate jobs) move into the query engine and deployment orchestration (stackql-deploy) rather than into a mutable artifact the agent must keep consistent.