Posts tagged "Claude"

4 posts

Dec 11, 2025

Responding to "The highest quality codebase"

This post made its way to the Hacker News frontpage today. The premise was interesting - it asked Claude to improve the codebase 200 times in a loop.

First thing’s first: I’m a fan of running Claude in a loop, overnight, unattended. I’ve written real customer-facing software this way, with a lot of success. I’ve sold fixed-bid projects where 80% of the code was written by Claude unsupervised - and delivered too. (Story for another day!)

This writer did NOT have the same success. In fact quite the opposite - the codebase grew from 47k to 120k lines of code (a 155% increase), tests ballooned from 700 to 5,369, and comment lines grew from 1,500 to 18,700. The agent optimized for vanity metrics rather than real quality. Any programmer will tell you, lines of code written does NOT equal productivity.

The premise of the post is really interesting - once you dig in and actually look at the prompt though, all of the intrigue falls away:

Ultrathink. You’re a principal engineer. Do not ask me any questions. We need to improve the quality of this codebase. Implement improvements to codebase quality.

Let’s deconstruct the prompt here.

First, “ultrathink” is a magic word in Claude that means “think about this problem really hard” - it dials thinking on the model up to its max.

Secondly, the rest of the prompt - “improve the codebase, don’t ask questions” - was almost bound to fail (if we define success as “not enough test coverage, less lines of code, bugs fixed”) and anyone who uses LLMs to write code would see this right away.

This post is equivalent to someone saying LLMs are useless because they can’t count the R’s in strawberry - it ignores the fact that LLMs are very useful in somewhat narrow ways.

To be fair, I think the author knew this was just a funny experiment, and wanted to see what Claude would actually do. As a fun exercise and/or as a way to gather data, I think it’s interesting.

I do fear that people will see this post, continue to blithely say “LLM bad”, and go about their day. Hey, if inertia is your thing, go for it!

How would I improve this prompt?

Have the LLM first write a memory - an architecture MD file, a list of tasks to do, and so on. The author literally threw it a codebase and said “go”. No senior engineer would immediately start “improving” the codebase before getting a grasp on the project at hand.
Define what success looks like to the LLM. What constitutes high quality? I’d say adding tests for particularly risky parts and reducing lines of code would be a good start.

While there are justifiable comments here about how LLMs behave, I want to point out something else: There is no consensus on what constitutes a high quality codebase. —mbesto

Give an LLM the ability to check its own work. In this case, I’d have run Claude twice: once to improve the codebase, and once to check that the new was actually “high quality”. Claude can use command line tools to run a git diff so why not instruct it to do so? Better if you have it run its tests after each iteration and fix problems.

The Hacker News Discussion

The HN thread had some (surprisingly) good takes:

hazmazlaz: “Well of course it produced bad results… it was given a bad prompt. Imagine how things would have turned out if you had given the same instructions to a skilled but naive contractor who contractually couldn’t say no and couldn’t question you. Probably pretty similar.”

samuelknight: “This is an interesting experiment that we can summarize as ‘I gave a smart model a bad objective’… The prompt tells the model that it is a principal engineer, then contradicts that role the imperative ‘We need to improve the quality of this codebase’. Determining when code needs to be improved is a responsibility for the principal engineer but the prompt doesn’t tell the model that it can decide the code is good enough.”

xnorswap: “There’s a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving. It can do structured problems very well, and it can transform unstructured data very well, but it can’t deal with unstructured problems very well… But right now, the best way to help an LLM is have a deep understanding of the problem domain yourself, and just leverage it to do the grunt-work that you’d find boring.”

asmor: “I asked Claude to write me a python server to spawn another process to pass through a file handler ‘in Proton’, and it proceeded a long loop of trying to find a way to launch into an existing wine session from Linux with tons of environment variables that didn’t exist. Then I specified ‘server to run in Wine using Windows Python’ and it got more things right… Only after I specified ‘local TCP socket’ it started to go right. Had I written all those technical constraints and made the design decisions in the first message it’d have been a one-hit success.”

ericmcer: “LLMs are good at mutating a specific state in a specific way. They are trash at designing what data shape a state should be, and they are bad at figuring out how/why to propagate mutations across a system.”

The experiment feels interesting, but in reality isn’t anything noteworthy - bad prompting gets bad results. This has been driven into every developer since the dawn of time - garbage in, garbage out. I mean, it’s kind of cute. But anyone concluding “Claude can’t write good code” from this misses the point entirely.

LLMs are tools. Like any tool, the results depend on how you use them. Give vague instructions to a circular saw and you’ll get messy cuts. Give vague instructions to Claude and you’ll get 18,000 lines of comments.

Thoughts on Recent New Models: Claude Opus 4.5, GPT 5.1 Codex, and Gemini 3

Three major model releases hit in the past two weeks. All three are gunning for the developer market - here’s what stood out to me.

IMO, we’re beyond the point where major model releases move the needle significantly, but they’re still interesting to me nonetheless.

We’ll start with the one I’m most excited about:

Claude Opus 4.5

Anthropic announced Opus 4.5 today. If you’re not aware: Anthropic’s models are, in order from most to least powerful: Opus (most capable, most expensive), to Sonnet (mid-range, great price/performance) and Haiku (cheapest, fastest, and smallest - still a decent model). Pricing dropped to $5/million tokens input, $25/million tokens output. That’s 1/3rd what Opus 4.1 cost.

Anthropic claims Opus 4.5 handles ambiguity “the way a senior engineer would.” Bold claim. (Claude has already shown itself to be “senior” when it estimates a 10 minute task will take a couple of weeks. 🥁)

The claim that it will use fewer tokens than Sonnet on a given task because of its better reasoning is interesting, but I’ll have to see how it plays out when actually using it.

Why am I most excited about this one, you might ask? Mainly cause Claude Code is my daily driver, and I see no reason to change that right now.

GPT 5.1 Codex

OpenAI’s GPT 5.1 Codex-Max dropped last week also. I really like the idea of the Codex models - models that are specifically tuned to use the Codex toolset - that just makes a ton of sense to me.

Still… the name. Remember when GPT-5 was supposed to bring simpler model names? We now have gpt-5.1, gpt-5.1-codex, gpt-5.1-codex-mini, and gpt-5.1-codex-max. Plus reasoning effort levels: none, minimal, low, medium, high, and now “xhigh.” For my money, I’m typically always on medium, but I’m interested to try xhigh (sounds like Claude’s Ultrathink?).

I don’t think this does much to alleviate confusion, but all APIs trade flexiblity with simplicity, and I’d prefer to have more levers to pull, not less.

One highlighted feature is “compaction,” which lets the model work across multiple context windows by summarizing and pruning history. Correct me if I’m wrong, but Claude Code has been doing this for a while. When your context runs out, it summarizes previous turns and keeps going (you can also trigger compression, which I do - that and /clear for a new context window). Nice to see Codex get on this, it’s a rather basic tenant of LLM work that “less is more” so compaction frankly should have been there from the get go.

I think the cost is the same as 5.1 Codex - honestly the blog post doesn’t make that super clear.

Gemini 3

Google’s Gemini 3 launched November 18th alongside Google Antigravity, their new coding IDE. I’m happily stuck with CLI coding tools + JetBrains Rider and IDEs come and go frequently, so I haven’t tried it (and probably won’t unless people tell me it’s amazing).

Before release, X was filled with tweets saying how Gemini 3 is going to be a gamechanger. As far as I can tell, it’s great, but it’s not anything crazy. I did use it to one-shot a problem that Claude Code has been stuck on for a while on a personal project (a new .NET language - more on that later!) - which was really cool. I’m excited to use it more, alongside Codex CLI.

The pricing is aggressive: $2/million input, $12/million output. Cheapest of the three for sure.

What Actually Matters

The benchmark numbers are converging. Every announcement leads with SWE-bench scores that differ by a few percentage points. However, that does not and never has excited me. Simon Willison put it well while testing Claude Opus:

I’m not saying the new model isn’t an improvement on Sonnet 4.5—but I can’t say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two. … “Here’s an example prompt which failed on Sonnet 4.5 but succeeds on Opus 4.5” would excite me a lot more than some single digit percent improvement on a benchmark with a name like MMLU or GPQA Diamond.

Show me what the new model can do that the old one couldn’t. That’s harder to market but more useful to know. Right now, I’m reliant on vibes for the projects that I run.

What am I still using?

For my money, Sonnet 4.5 remains the sweet spot. I haven’t missed Opus 4.1 since Sonnet launched. These new flagship releases might change that, but the price-to-performance ratio still favors the tier below.

IMO, the dev tooling race is more interesting than the model race at this point. As I pointed out in my article on Claude Code and GitHub Copilot using Claude are NOT the same thing, models tuned to a specific toolset will definitely outperform a “generic” toolset in most cases.

Claude Code and GitHub Copilot using Claude are not the same thing

Stop telling me “GitHub Copilot can use Claude, so why would I buy Claude Code”. There’s about a million reasons why you should stop using GitHub Copilot, and the main one is that it’s not a good product. Sorry not sorry Microsoft.

Over and over again I’ve heard how much coding agents suck (often from .NET devs), and the bottom line is they’re doing it wrong. If you aren’t at least TRYING multiple coding tools, you’re doing yourself a disservice.

This may sound a bit contemptuous, but I mean it with love - .NET devs LOVE to be force fed stuff from Microsoft, including GitHub Copilot. Bonus points if there is integration with Visual Studio. (The “first party” problem with .NET is a story for another time.)

I ran a poll on X asking .NET devs what AI-assisted coding tool they mainly use. The results speak for themselves - nearly 60% use GitHub Copilot, with the balance being a smattering across different coding tools.

(I know I’m picking on .NET devs specifically, but this applies equally to anyone using a generic one-size-fits-all coding tool. The points I’m making here are universal.)

Here’s the bottom line: GitHub Copilot will not be as good as a model-specific coding tool like OpenAI’s Codex, Claude Code (which is my preferred tool), or Google’s Gemini.

Why?

Sonnet 4.5 is trained to specifically use the toolset that Claude Code provides
GPT-5-Codex is trained to specifically use the toolset that Codex CLI provides
Gemini is trained to specifically use the toolset that Gemini CLI provides

OpenAI has explicitly said this is the case, even if the others haven’t.

“GPT‑5-Codex is a version of GPT‑5 further optimized for agentic software engineering in Codex. It’s trained on complex, real-world engineering tasks such as building full projects from scratch, adding features and tests, debugging, performing large-scale refactors, and conducting code reviews.” (Source)

Why not Copilot?

Giving several models the same generic toolset (with maybe some different prompts with a different model) will simply NOT work as well with an LLM as specific training for a specific toolset.
Model selection paralysis - which model is best suited to which task is really left up to the user, and .NET devs are already struggling with AI as is. (This is totally ancedotal of course, but I talk to LOTS of .NET devs.)
Microsoft has married themselves to OpenAI a little too much, which means their own model development is behind. I know it feels good to back the winning horse, but I’d love to see custom models come out of Microsoft/GitHub, and I see no signs of that happening anytime soon.

My advice

PAY THE F***ING $20 A MONTH AND TRY Claude Code, or Codex, or Gemini, or WHATEVER. I happily pay the $200/month for Claude Code.
Get comfortable with the command line, and stop asking for UI integration for all the things. Visual Studio isn’t the end all be all.
Stop using GitHub Copilot. When it improves, I’ll happily give it another go.

Thoughts on the AI espionage reported by Anthropic

Anthropic recently wrote about a coordinated AI cyber attack that they believe was executed by a state-sponsored Chinese group. You can read their full article here.

The attackers used Claude Code to target roughly thirty organizations including tech companies, financial institutions, chemical manufacturers, and government agencies. They jailbroke Claude, ran reconnaissance to identify high-value databases, identified vulnerabilities, wrote exploit code, harvested credentials, extracted data, and created backdoors.

What stood out to me more than anything was this (emphasis mine):

On the scale of the attack:

Overall, the threat actor was able to use AI to perform 80-90% of the campaign, with human intervention required only sporadically (perhaps 4-6 critical decision points per hacking campaign). The sheer amount of work performed by the AI would have taken vast amounts of time for a human team. At the peak of its attack, the AI made thousands of requests, often multiple per second—an attack speed that would have been, for human hackers, simply impossible to match.

On jailbreaking:

At this point they had to convince Claude—which is extensively trained to avoid harmful behaviors—to engage in the attack. They did so by jailbreaking it, effectively tricking it to bypass its guardrails. They broke down their attacks into small, seemingly innocent tasks that Claude would execute without being provided the full context of their malicious purpose. They also told Claude that it was an employee of a legitimate cybersecurity firm, and was being used in defensive testing.

The jailbreaking part is most interesting to me, because Anthropic was so vague with details - which makes sense, they don’t want to tell the world how to jailbreak their models (don’t worry, it’s easy anyways). That said, just because Claude Code was used this time doesn’t really mean much: they were likely using it because:

It’s cost controlled (max $200/month) and therefore they could throw a ton of work at it with no additional spend in compute
Claude’s toolset is vast
Claude Code is REALLY good at knowing HOW TO USE its vast toolset

I would imagine that Claude would be as good at these kinds of attacks as it is at code, based on my own experience - mainly because this would require a healthy knowledge of bash commands, understanding of common (and uncommon) vulnerabilities, and good coding skill for tougher problems.

Context poisoning attacks like this aren’t hard to pull off. Jailbreaking LLMs is nothing new and goes on literally every minute. Forget Claude Code, all you need is a good model, lots of compute, and a good toolset for the LLM to use. Anthropic just so happened to be the most convenient for whoever was executing the attack.

In reality, AI-assisted attacks are likely being carried out all the time, and it’s even more likely that custom models are being trained to perform these kinds of attacks, unfettered from the guardrails of companies like OpenAI and Anthropic.

This really reinforces the need for good security practices (if you didn’t have a reason enough already).