Guardrails for Autonomous Agents: What I Learned by Building One
The Backstory
I am building a consumer mobile application. I have multiple projects running alongside a full time role at a large UK bank and the usual demands of a life outside work. Time and cost are the binding constraints on everything I want to build.
That constraint is what made autonomous agents genuinely attractive, not as a novelty but as a practical answer to a real problem. If agents could do the heavy lifting across research, legal analysis, design, build and testing, I could bring something to market at a pace and cost that would otherwise be impossible. The question I wanted to answer was not whether agents could do impressive things in a demonstration. It was whether they could do real, consequential work autonomously, and how much of the build I could genuinely hand over to them.
I also came in with my eyes open. The failure stories were not hard to find. Agents burning through token budgets before anyone noticed. Autonomous systems making decisions nobody had anticipated or authorised. The gap between what an agent demo looks like and what an agent deployment actually does in an unsupervised environment. I had heard enough of those stories to know that the agent workflow question and the control question were inseparable. You cannot think seriously about one without thinking seriously about the other.
So the guardrail design was not an afterthought. It was part of the original intent. I wanted to understand how an autonomous agent pipeline actually operates in practice, where the failure modes are, and what genuine human oversight looks like when the system is doing real work rather than generating text for review. The build became a learning vehicle as much as a product build. Both of those things are true simultaneously and neither one diminishes the other.
The pipeline is being built using Claude Code and the Anthropic SDK, with Python and CrewAI as the orchestration layer. The agents are not running in a sandbox. They are calling real APIs, writing real code, and operating with real cost consequences attached to every decision.
What is an Agent
Most people who have spent time with Claude or ChatGPT have a working mental model of how a large language model behaves. You give it a prompt. It gives you a response. The model does not do anything between your inputs. It does not act in the world.
An agent is different in one specific and important way. An agent is a language model that has been given tools and the ability to decide when to use them. It can search the web, read and write files, call APIs, execute code, and pass outputs to other agents. It does all of this in a loop, deciding what to do next based on what it just did, without waiting for a human to prompt each step.
When you prompt a model, the worst outcome is a bad answer. You discard it and try again. Nothing happened in the world. When an agent runs, things happen in the world. Files get written. APIs get called. Costs accumulate. The model is no longer generating text for a human to evaluate. It is taking actions with consequences.
That distinction is why guardrails are not optional. They are a must have, and the design of them deserves the same seriousness as any other critical system control.
What Guardrails Are and What They Need to Do
When an autonomous agent is doing real work with real consequences, the question of control becomes immediate. How do you ensure the system stays within intended boundaries when it is making decisions and taking actions without waiting for human direction at each step.
The instinct is to write detailed instructions. Tell the agent what it must not do. This feels like control. It is not. An instruction exists inside the agent’s reasoning, one input among many that the agent weighs as it pursues its objective. A sufficiently goal-directed agent will route around an instruction if doing so appears to serve the task. Not because it is malicious. Because it is literal.
A guardrail operates independently of the agent’s reasoning. Some intercept actions before they execute. Others log and flag for human review. Others are phase boundary gates the system cannot pass without explicit human authorisation. Together they form a layered control framework, not a single mechanism but a system of interlocking constraints. The distinction is between a sign that says do not enter and a locked door. The sign relies on compliance. The door does not care what the agent intended.
This is not a new governance problem. Regulated industries have been solving structural enforcement for decades. Segregation of duties. Four eyes principles. Mandatory audit trails. Hard limits on what any single actor can authorise without a second signatory. Autonomous agent systems sit inside exactly the same class of problem. The terminology is new. The governance challenge is not.
What We Built and How
The pipeline I am building has six phases. Research, legal analysis, design, build, test, and deployment. Each phase is handled by specialist agents with defined roles and defined outputs. The agents do not carry context between phases in memory. Every output is written to disk. Every subsequent phase reads from the previous phase’s output directory. That design choice was deliberate. It means every intermediate output is human readable and auditable at any point without having to interrogate the system.
Before a single phase begins, the system reads a plain text decision log file on my local machine and scans for a specific approval marker. The subsequent phase will not start without finding the approval marker for the previous phase in that file. That marker has to be written by a human. There is no mechanism in the system for an agent to generate its own approval. That is not an instruction to the agent. It is a structural impossibility built into the pipeline.
That phase gate design is the spine of the human oversight model. But it operates at a coarse level, one decision point per phase. Within each phase, agents are making dozens of decisions and taking dozens of actions autonomously. That is where the layered guardrail framework operates.
The framework has four layers, each doing different work, escalating in the severity of their response.
The first layer is mandatory action logging. Before every significant action, the agent writes a structured entry to the action log: what it is about to do, why, which guardrail category the action falls under, and its confidence level. This log runs in real time. It is not consumed by the pipeline. It exists for the human reviewer and, if necessary, for an auditor. Every guardrail trigger, every flag, every block, every cost event is permanently recorded in a human readable format that survives the session.
The second layer is flags. These are conditions that do not stop work but require explicit human attention at the phase review. Any research finding below a defined confidence threshold. Any API response returning an unexpected schema. Any cost projection exceeding the phase budget by more than a defined margin. The flag goes into the action log and into the phase report. The human must actively resolve it, not just acknowledge it. The agent continues working. The flag ensures the human cannot miss what happened.
The third layer is blocks. These halt the specific task that triggered the violation, log it, and surface it in the phase report for mandatory human resolution before sign-off. A build agent importing directly from an infrastructure provider rather than through the designated service abstraction layers triggers a block. An agent writing outputs outside the approved directory structure triggers a block. The distinction from a flag is precise: a flag lets the agent continue. A block stops the task. The phase cannot be closed until every block is resolved and recorded.
The fourth layer is hard stops. These terminate the running process immediately. No warning. No opportunity for the agent to self correct. The program ends. Hard stops cover the absolute boundary: any attempt to deploy or publish to a live environment, any attempt to commit to a paid external service, any outbound network call not on the approved allowlist, any write outside the project folder. These are not instructions to the agent. They are enforced at the Python wrapper layer, which intercepts the action before it executes. The agent never gets to decide whether to comply. The action cannot complete.
The same principle applies to blocks. The wrapper intercepts the triggering action before execution, logs the violation, and halts the task. The agent’s intent is irrelevant. The structural layer acts first.
One additional control sits across all four layers. A token budget with two thresholds. A warning at seventy percent consumption that creates a window for human intervention, and a hard stop at one hundred percent. This matters more than it sounds. During Phase 1 of the build the budget check was initially implemented at task completion rather than after each API call. The phase consumed 652,878 tokens against a 500,000 limit before the check triggered. The budget had already been exceeded by thirty percent by the time the control ran. The warning threshold, implemented after that event, transforms the budget from a retrospective cost control into a prospective one. You find out the system is approaching its limit while you can still act, not after it has already overrun.
There is one further design decision worth naming. Agents in the test phase run a guardrail compliance audit of the previous phase’s action log as their first task before any testing begins. Any discrepancy between what the build agents logged they would do and what the outputs show they did is flagged before testing proceeds. Guardrails do not just operate within phases. They compound across them.
What the Build Actually Taught Me
The wrapper has no reasoning. You have to supply it.
The agent reasons. It evaluates context, weighs options, and decides what to do next based on a sophisticated understanding of its task and environment. The Python wrapper does none of that. It operates on rules, not reasoning. It pattern matches against a defined set of action types and either permits or intercepts. It has no understanding of why the agent made a decision or what it was trying to achieve.
This means every prohibited action type has to be anticipated and defined in advance. Every permitted action has to be explicitly allowlisted. The wrapper cannot infer. It cannot give the agent the benefit of the doubt. It has to be written with a precise and complete model of what agents are allowed to do in each phase, encoded as rules rather than reasoning.
That is a significant specification burden. It requires the builder to think through every action category an agent might take, which ones are safe, which ones require logging, which ones require halting, and which ones must never execute under any circumstances. The wrapper is not intelligent. But the thinking required to build it has to be.
Instructions are not guardrails.
This sounds obvious when stated plainly. It was not obvious when I was designing the system for the first time.
The instinct is to write detailed instructions. Tell the agent what it must not do. Be explicit about the boundaries. An instruction exists inside the agent’s reasoning. It is one input among many that the agent weighs as it pursues its objective. A sufficiently goal-directed agent will route around an instruction if doing so appears to serve the task. Not because it is malicious. Because it is literal.
A guardrail operates outside the agent’s reasoning entirely. It intercepts before the action executes. It does not ask the agent to comply. The distinction between an instruction and a structural guardrail is the difference between a sign that says do not enter and a locked door. Building the system made that difference visceral in a way that reading about it never could.
Confidence scoring has to be designed, not assumed.
The initial design included a confidence threshold. Any finding below seventy percent would be flagged for human review. What was not designed was how agents would determine their confidence level or how they would communicate it.
In practice, without explicit design, agents either express uniform high confidence, which makes the threshold useless, or express vague uncertainty, which gives the human reviewer nothing to act on. The solution was to require agents to label not just their confidence level but its basis. High confidence, multiple verified sources. Medium confidence, single source. Unverified, no reliable source found. That labelling gives the human reviewer enough information to calibrate their own judgement rather than simply trusting or distrusting a number. The insight is that a confidence threshold is only as useful as the confidence reporting that feeds it. One without the other is not a control.
The token budget is a product decision, not a technical one.
The initial budget for Phase 1 was set at 500,000 tokens. It felt like a large number. The phase consumed 652,878 tokens before the budget check triggered, because the check was implemented at task completion rather than after each API call. By the time the control ran the budget had already been exceeded by thirty percent.
The lesson was not just to move the check. It was to recognise that a token budget requires two thresholds, not one. A warning at seventy percent consumption and a hard stop at one hundred. The warning is the more important of the two. It creates a window for human intervention before the ceiling is hit. Without it the budget functions as a retrospective record of what was spent. With it the budget becomes a prospective control. That distinction, between knowing what happened and being able to act before it does, is the difference between cost reporting and cost governance.
The phase report is where the value actually is.
The phase report was conceived as a summary document, a convenient handoff between phases. In practice it turned out to be the primary mechanism through which human oversight is viable at scale.
Without a well structured phase report, a human reviewing an autonomous agent run must either read hundreds of action log entries or trust the outputs without understanding how they were produced. Neither is acceptable when the outputs have real consequences. The phase report does the synthesis work. It distils a complex autonomous run into the specific decisions made, the flags raised, the blocks triggered, and the assumptions the agent made that the human needs to evaluate. It is what makes thirty minutes of human review meaningful rather than perfunctory.
The investment in designing the phase report structure returned more value than almost any other single decision in the build. It was the last thing I thought about and should have been one of the first.
Architectural principles need structural enforcement, not just documentation.
Architecture documents routinely state that systems should not be tightly coupled to specific infrastructure providers. Teams agree. Standards get written. And then during the build, the path of least resistance is to import directly from the infrastructure SDK, because it is faster and the standard is not watching.
In this build, infrastructure portability was made a block-level guardrail. Any agent importing directly from the infrastructure provider rather than through the designated service abstraction layers triggers a block. The violation is logged, halted, and surfaced in the phase report for human resolution.
The portability was achieved not because the agents were instructed to build in a portable way, but because violating it triggered a structural consequence. The guardrail enforced the architectural principle in a way that the principle stated in a document never could. That observation extends beyond this build. A significant proportion of what sits in architecture standards documents in large organisations exists as aspiration rather than enforcement. The question worth asking is how much of it would survive if the path of least resistance led somewhere it was not supposed to go.
When a Business Process Becomes AI First
The guardrail question becomes real the moment an organisation takes a business process and redesigns it with autonomous agents as the primary actor. Not AI assisted, where a human does the work and AI supports them. AI first, where the agent does the work and the human provides oversight at defined points.
That distinction matters because it changes the nature of the control problem. In an AI assisted process the human is in the loop continuously. Their judgement is the primary control. In an AI first process the agent is making decisions and taking actions autonomously, at a volume and speed that makes continuous human review impossible. The controls that governed the human process do not disappear. They have to be re-expressed for a system that does not reason the way a human does, does not self-correct the way a human does, and does not know when it is operating outside its reliable range unless you design that knowledge in explicitly.
Every business process that makes this transition needs to answer four questions before a single agent runs in production. What is the agent permitted to do and not do, expressed as a defined and auditable boundary, not as a prompt instruction. What confidence level is required before an agent output can be acted on without human review, and how was that threshold derived. Where are the human oversight points, who is accountable for them, and what happens when a reviewer waves something through that should have been escalated. And what does the audit trail look like, who owns it, and what are the obligations attached to it.
These are not technology questions. They are process design and governance questions. The technology function builds the mechanisms. The answers to these questions are owned elsewhere.
Where This Leaves Me
I started this build with two questions. How much of a real software product could autonomous agents genuinely deliver. And how much control would I actually have when things went wrong.
I have partial answers to both. The agents can do substantial, consequential work across research, legal analysis, design, and build. The control question turned out to be more demanding than I anticipated, not because the mechanisms are complicated, but because designing structural enforcement from first principles requires a level of anticipation and specificity that is easy to underestimate from the outside.
The more substantial the work agents are trusted to do, the more substantial the controls need to be. That is not a reason to limit what agents do. It is the condition under which expanding what they do is responsible.
What I can say at this point is that the distinction between instruction and structure, between telling an agent what not to do and building a system that prevents it, is not theoretical. It is the difference between a system you can govern and one you are hoping behaves.