Thoughts on the AI espionage reported by Anthropic

Anthropic recently wrote about a coordinated AI cyber attack that they believe was executed by a state-sponsored Chinese group. You can read their full article here.

The attackers used Claude Code to target roughly thirty organizations including tech companies, financial institutions, chemical manufacturers, and government agencies. They jailbroke Claude, ran reconnaissance to identify high-value databases, identified vulnerabilities, wrote exploit code, harvested credentials, extracted data, and created backdoors.

What stood out to me more than anything was this (emphasis mine):

On the scale of the attack:

Overall, the threat actor was able to use AI to perform 80-90% of the campaign, with human intervention required only sporadically (perhaps 4-6 critical decision points per hacking campaign). The sheer amount of work performed by the AI would have taken vast amounts of time for a human team. At the peak of its attack, the AI made thousands of requests, often multiple per second—an attack speed that would have been, for human hackers, simply impossible to match.

On jailbreaking:

At this point they had to convince Claude—which is extensively trained to avoid harmful behaviors—to engage in the attack. They did so by jailbreaking it, effectively tricking it to bypass its guardrails. They broke down their attacks into small, seemingly innocent tasks that Claude would execute without being provided the full context of their malicious purpose. They also told Claude that it was an employee of a legitimate cybersecurity firm, and was being used in defensive testing.

The jailbreaking part is most interesting to me, because Anthropic was so vague with details - which makes sense, they don’t want to tell the world how to jailbreak their models (don’t worry, it’s easy anyways). That said, just because Claude Code was used this time doesn’t really mean much: they were likely using it because:

  • It’s cost controlled (max $200/month) and therefore they could throw a ton of work at it with no additional spend in compute
  • Claude’s toolset is vast
  • Claude Code is REALLY good at knowing HOW TO USE its vast toolset

I would imagine that Claude would be as good at these kinds of attacks as it is at code, based on my own experience - mainly because this would require a healthy knowledge of bash commands, understanding of common (and uncommon) vulnerabilities, and good coding skill for tougher problems.

Context poisoning attacks like this aren’t hard to pull off. Jailbreaking LLMs is nothing new and goes on literally every minute. Forget Claude Code, all you need is a good model, lots of compute, and a good toolset for the LLM to use. Anthropic just so happened to be the most convenient for whoever was executing the attack.

In reality, AI-assisted attacks are likely being carried out all the time, and it’s even more likely that custom models are being trained to perform these kinds of attacks, unfettered from the guardrails of companies like OpenAI and Anthropic.

This really reinforces the need for good security practices (if you didn’t have a reason enough already).