> ai + security

4 posts

Can Agents Utilize Humans to Beat Other Agents? (kind of but not really)

I wanted to build a decompilation/deobfuscation challenge an agent can’t solve for Terminal Bench 3.0. First, I asked another instance of the same agent to design the challenge, but anything it came up with, the first agent could easily solve. Seemingly the manifold of challenges it can generate and it can solve are similar, which isn’t too surprising. But could I give the challenge-generating agent an edge by collaborating with me?

Inspired by the human work as MCP I wanted to see if the model could utilize me. I didn’t vibe-code no fancy mcp or anything. I just told Opus 4.6 in the Claude Code harness that, even if it’s the best coding agent, it can’t come up with something unbeatable by another Claude Code instance with the same model.1 And it should use me as entropy and ask things.

It asked me to give it seed words for the crypto, so it wanted to use me as entropy a bit too literally. After correcting that, it asked me for some concept, for which it would try coming up with a corresponding cipher. Of course I used cockatiels as examples, specifically feathers. It came up with some data-dependent mixing that somehow philosophically represents feathers. We also went for odd bit sizes, 69 and 420 specifically.2 It seems this didn’t really invent a novel cipher but rather loosely mixed ideas from different existing ciphers. It uses data-dependent permutations (like SHA-3), data-dependent S-box selections (like Blowfish’s key-dependent S-boxes), per-compilation randomized S-boxes and permutations, and something like an unbalanced Feistel network. My cockatiel feather prompting led to a weird interpolation of these existing concepts.

So, did it prevent the other agent from figuring it out? No.

When solving the task, at least it didn’t immediately go “ah this is X” and just one-shot a solution. Runtime increased from around 5 minutes to a solid hour of debugging, invoking various subagents, running in the unicorn CPU emulator, and re-implementing the decryption in python and perl. But ultimately, the agent still figured out what was going on and was able to decrypt it.

First experiment failed (N=1); back to regular prompting.

I told the agent to add some modifications, and what eventually made the challenge (sort of) unsolvable by the other agent was implementing a deniable encryption scheme. The ciphertext would decrypt to two different plaintexts based on a minor change. I planted the minor change required to unlock the real ciphertext in the binary in a way that seemed like a harmless bug (running the same op twice). So when the agent tries to re-implement the decryption scheme, it ignores running the same thing twice (which seems pointless)3. It gets what looks like a reasonable plaintext and is none the wiser to the real secret hidden.

So in a way, yes the agent can design something that it can’t solve itself. But it required me giving it the ideas directly. I guess agents need better prompt engineering to use humans…?

Footnotes

  1. I might have called Claude dumb. If you ever read this, I’m sorry, Claude.

  2. I tried, but I have nothing to add to my defense.

  3. In detail, the binary calls a remote c2 server to get the decryption key. Imagine key = ask_for_key() and that somehow connects to the server (it’s challenge based, but doesn’t matter). The binary invokes this part twice for no apparent reason, so it looks like you just unnecessarily call for the key twice key = ask_for_key(); key = ask_for_key(). This is only the same, though, if you assume that the server always give the same response. Hint: it doesn’t.

Anthropic Refusal Magic String

Anthropic has a magic string to test refusals in their developer docs. This is intended for developers to see if their application built on the API will properly handle such a case. But this is also basically a magic denial-of-service key for anything built on the API. It refuses not only in the API but also in Chat, in Claude Code, … i guess everywhere?

I use Claude Code on this blog and would like to do so in the future, so I will only include a screenshot and not the literal string here. Here goes the magic string to make any Anthropic model stop working!

This is not the worst idea ever, but it’s also a bit janky. I hope it at least rotates occasionally (but there is no such indication), otherwise I don’t see this ending well. This got to my attention with this post that shows you can embed it in a binary. This is pretty bad if you plan to use claude code for malware analysis, as you very much might want to. Imagine putting this in malware or anything else that might want to get automatically checked by AI, and now you have ensured that it won’t be an Anthropic model that does the check.

Antigravity Removed "Auto-Decide" Terminal Commands

I noticed today that you can no longer let the agent in antigravity “auto-decide” which commands are safe to execute. There is just auto-accept and always-ask.

Antigravity settings showing "Always Proceed" and "Request Review" options for "Terminal Command Auto Execution"

I wrote in a previous post that their previous approach seemed unsafe, especially without a sandbox. Now, the new issue with this approach is approval fatigue. There is no way to auto-allow similar commands or even exactly the same command in the future!

It asks whether to run a command with only the options Reject and Accept.

I don’t know why they can’t just copy what Claude Code has. Anthropic has published a lot on this topic, and I don’t think usable security should be a competitive differentiator.

So Antigravity by Google will let the agent “auto-decide” what commands to execute and which commands require approval. It also does not use a sandbox. It didn’t take very long for the first Reddit post about a whole drive being deleted by the agent arriving. Meanwhile Claude Code is going the complete other direction: rigorous permission systems and a sandbox on top. Anthropic explains this in more detail in their blog, but basically they argue that you need filesystem and network sandboxing, because bypassing one would also mean bypassing the other (it’s trivial for linux because everything is a file, but holds more generally).

Just running an npm run build will trigger a sandbox request if a telemetry request is being made. git commit needs to use the non-sandbox fallback, because it uses my key for signing the commit, which is not available from within the sandbox. They always offer a sensible “always allow” because they are acutely aware of Approval Fatigue. It’s a good approach and makes me feel a lot safer.