LLM Sandboxing

This post gives a brief overview of why LLM agents need sandboxing, what kinds of sandboxing are commonly used, and which tradeoffs matter in practice.

My main takeaway is simple: once an agent can touch the file system, execute programs, and reach the network, safety stops being just a prompting problem and becomes a sandboxing problem.

Why Sandboxing

Modern LLM agents are no longer just models that answer questions. As soon as they are given tools, they start looking much more like real software actors, and that means they need real boundaries.

Agent Risks

The core reason is straightforward: capable agents already have dangerous primitives.

  • They can write arbitrary files to the system and may delete important data.
  • They can execute arbitrary code and may run malicious commands or scripts.
  • They can access the network and may interact with remote servers or expose sensitive data.

If we give an agent real power, we also need real boundaries.

Prompting Is Not Enough

The most naive line of defense is to write stricter instructions. For example: "Don't remove any data!"

Models make mistakes, just like humans do.

That may help in some cases, but it is obviously not a robust protection mechanism. Models make mistakes, misunderstand instructions, or follow a harmful local objective while missing the broader constraint. If the action is truly dangerous, it should be blocked by the system, not merely discouraged in the prompt.

Search-Based Protection

The next step up is usually search-based protection: look at tool inputs or shell commands, try to match dangerous patterns, and reject them.

This is better than nothing, but still brittle. It works poorly for languages like bash, where the same intent can be expressed through many equivalent commands, scripts, pipes, or layers of indirection. As soon as the agent is allowed to generate code instead of a single obvious command, simple matching rules become easy to bypass accidentally or deliberately.

OpenCode protection. Search-based protection is poor, especially for a language like bash and for script execution.

File System Sandbox

File system sandboxing is much more convincing because it changes what the process can actually see. Instead of trying to guess whether a command looks dangerous, we restrict the namespace itself. The agent can only access directories that are explicitly exposed to it.

File system sandboxing: only selected directories are accessible.

How bwrap Works

bwrap works in two important ways:

  • It leverages Linux user namespaces to obtain a kind of pseudo-root privilege inside the sandbox.
  • It uses mount --bind to construct a customized file system view for the sandboxed process.

Network Sandbox

Restricting the file system is not enough if the agent can still talk to the outside world freely. Unrestricted network access is another escape hatch.

Network escape: the agent can access arbitrary servers on the Internet, which may lead to data leakage or other security issues.

One practical mitigation is to manually insert proxies and only expose network access through those controlled paths. In other words, network access should also be mediated, not granted wholesale. Anthropic's write-up on Claude Code sandboxing and its sandbox runtime are both useful references here.12

Claude Code sandboxing.

Other Scenarios

Cloud Agents

When the agent runs in the cloud, sandboxing is still central. The difference is just where the boundary sits.

Systems such as OpenAI Codex Cloud and Claude Code on the Web run agents on the cloud side rather than directly on the user's machine.

  • They usually rely on Docker- or microVM-based sandboxing to protect the host system and limit resource usage.

Benchmarking

Sandboxing is not only for protection. Sometimes it is part of the benchmark or operating interface itself. For example, OSWorld requires a VM or sandbox environment to run GUI tasks and evaluate agents on them.3

OSWorld uses a sandbox environment to run the agent and evaluate its performance on GUI tasks.

Code Execution

Another common scenario is lightweight code execution, such as running Python for a chatbot. In that setting, the interface may only need basic stdin/stdout, so the sandbox can be much lighter than a full interactive operating-system environment.

For example, Zeroboot uses copy-on-write forking to create a sandbox environment in milliseconds.

Zeroboot is a fast sandboxing solution for code execution scenarios.

Two More Points

Sandbox Interface

One point I find easy to miss is that sandboxing is not only about protection. It also defines what the agent can see and how the agent interacts with the environment.

  • In OSWorld, the sandbox provides a clean interface to the environment and also helps with evaluation through recording and replaying.
  • For today's code agents, the bwrap + socat style used by Claude Code is already enough for many use cases because its interface is close to the operating system itself.
  • But if the target task needs a richer or more structured interface, then the design of the sandbox also directly affects the agent's capabilities.

In other words, the "power" of a sandbox is heavily constrained by the interface it exposes. Protection matters, but functionality matters too. A sandbox that is perfectly safe but strips away the right abstractions may also make the agent much weaker.

Startup Cost

The other point is startup latency. Sandbox startup is often on the critical path of agent execution.

  • In chat-like or otherwise lightweight workloads, sandbox startup overhead may even exceed the time spent on the task itself.
  • This makes startup latency an important design dimension, not just an implementation detail.

The comparison above is adapted from the excellent awesome-sandbox collection.4

Features of different sandboxing techniques.

Taken together, this leaves me with a fairly simple takeaway: sandbox design is a balancing act between protection, interface quality, and startup cost. If agents keep getting more capable, sandboxing will increasingly look like core systems infrastructure rather than an optional safety add-on.


  1. Anthropic: Making Claude Code more secure and autonomous with sandboxing↩︎

  2. Anthropic Experimental: Sandbox Runtime for Claude Code↩︎

  3. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments↩︎

  4. restyler/awesome-sandbox↩︎