Safety

Safety in agentic AI isn't about wrapping a chatbot in a content filter. It's the engineering discipline of constraining autonomous systems so they do what you intended — and nothing else. Guardrails, security harnesses, approval gates, and human-in-the-loop checkpoints. The brakes on the system.

What Safety Means in Agentic AI

Content filtering is table stakes. Real safety is about controlling what autonomous systems can do, what data they can access, and how far they can go without human approval.

When an AI agent can read your database, call external APIs, send emails, and modify files, "safety" takes on a very different meaning than it does for a chatbot. Every tool call is a potential side effect. Every piece of context is a potential data leak. Every prompt is a potential injection vector. I design safety as a first-class architectural concern — not an afterthought bolted on at the end.

Input Validation

Every input to the system gets validated before it reaches the model. Schema enforcement on structured inputs. Length limits, encoding checks, and format validation on freeform text. The goal is to reject malformed or malicious inputs before they ever enter the reasoning loop.

Output Filtering

What the model generates isn't automatically what the user sees. Output filters catch hallucinated data, sensitive information leakage, off-topic responses, and content that violates policy. For agentic systems, this extends to validating tool call parameters before execution.

Scope Constraints

Agents should only be able to do what they're explicitly authorised to do. Scope constraints define the boundaries: which tools an agent can call, which data sources it can query, which actions it can take, and which topics it should refuse to engage with. Least privilege, applied to AI.

Prompt Injection Protection

When agents consume external content — web pages, documents, emails, database records — that content can contain instructions designed to hijack the agent's behaviour. I build layered defences: input sanitisation, instruction hierarchy enforcement, and detection systems that flag suspicious patterns before they execute.

Data Leakage Prevention

Agents that have access to sensitive data need guardrails that prevent that data from leaking into outputs, logs, or downstream systems. This includes PII detection, credential scanning, and context isolation — ensuring that information from one conversation or workflow doesn't bleed into another.

Access Controls

Not every user should have access to every agent capability. Role-based access controls determine who can trigger which workflows, which tools are available at which permission levels, and what data each user's agents can see. This maps to your existing identity and access management infrastructure.

Human-in-the-Loop

Approval checkpoints before high-stakes actions. The brakes on the system.

Not every action an agent takes needs human approval. Fetching data from a read-only API? Let it run. But sending an email to a client, modifying a production database, or committing code? That needs a human in the loop. The art is in knowing where to place the checkpoints — too few and you lose control, too many and you've built a very expensive way to click "approve" all day.

I design human-in-the-loop systems with a clear taxonomy of actions: autonomous (agent acts freely), supervised (agent acts but logs everything for review), and gated (agent proposes, human approves, then agent executes). The classification depends on the risk profile of the action, the maturity of the system, and your organisation's tolerance for autonomy.

When Agents Should Act Autonomously

Read-only operations. Internal data lookups. Drafting content that will be reviewed before sending. Routine classification and routing tasks. Low-stakes actions where the cost of a mistake is negligible and the speed benefit of autonomy is significant. As trust builds over time, the autonomous zone expands.

When Humans Need to Approve

Any write operation to a production system. External communications — emails, messages, API calls that trigger actions in third-party systems. Financial transactions. Actions that are difficult or impossible to reverse. Anything where the blast radius of a mistake extends beyond the agent's own context.

Approval UX Matters

A human-in-the-loop checkpoint is only as good as the information it presents. I design approval interfaces that show the human exactly what the agent wants to do, why it wants to do it, and what the expected impact will be. Clear context, not cryptic "approve/deny" prompts.

Graduated Autonomy

Systems should get more autonomous as they prove themselves. I build escalation frameworks where agents start fully gated, graduate to supervised mode after a track record of correct decisions, and eventually earn autonomous status for specific action classes. Trust is earned, not assumed.

Pairs With

Safety doesn't exist in isolation. It's wired into every other building block.

Safety & Agents

Bounded Autonomy

Safety governs what agents can and can't do. Every agent gets a scope definition — which tools it can call, which data it can access, and under what conditions it must stop and ask. Without safety constraints, an agent is just an unsupervised process with API keys.

Safety & MCP

Approval Gates on Tool Calls

MCP gives agents tools. Safety wraps those tools in approval gates — especially for write operations. Reading a calendar is low-risk. Creating calendar events, sending invitations, or deleting entries requires a human checkpoint. The approval layer sits between the agent's intent and MCP's execution.

Safety & Observability

Trust, Verified

Safety sets the boundaries. Observability proves they're being respected. Logging every tool call, every approval decision, every scope constraint that fired — this is how you build an audit trail that demonstrates your AI systems are operating within policy. Safety without observability is hope. With it, it's evidence.

Safety & Inference

Trust Boundaries

Where and how you run inference determines your trust boundaries. Cloud API inference means data leaves your network. On-prem inference keeps data inside your perimeter but requires more infrastructure investment. The deployment pattern shapes what safety controls are possible and what risks you're accepting.

Need help designing safety into your AI system?

I build guardrails, approval gates, and human-in-the-loop systems that keep agentic AI trustworthy without killing its usefulness.

Start a Conversation See Services