Writing - December 2025

3 posts from December 2025

← Back to 2025

Dec 11, 2025

Responding to "The highest quality codebase"

This post made its way to the Hacker News frontpage today. The premise was interesting - it asked Claude to improve the codebase 200 times in a loop.

First thing’s first: I’m a fan of running Claude in a loop, overnight, unattended. I’ve written real customer-facing software this way, with a lot of success. I’ve sold fixed-bid projects where 80% of the code was written by Claude unsupervised - and delivered too. (Story for another day!)

This writer did NOT have the same success. In fact quite the opposite - the codebase grew from 47k to 120k lines of code (a 155% increase), tests ballooned from 700 to 5,369, and comment lines grew from 1,500 to 18,700. The agent optimized for vanity metrics rather than real quality. Any programmer will tell you, lines of code written does NOT equal productivity.

The premise of the post is really interesting - once you dig in and actually look at the prompt though, all of the intrigue falls away:

Ultrathink. You’re a principal engineer. Do not ask me any questions. We need to improve the quality of this codebase. Implement improvements to codebase quality.

Let’s deconstruct the prompt here.

First, “ultrathink” is a magic word in Claude that means “think about this problem really hard” - it dials thinking on the model up to its max.

Secondly, the rest of the prompt - “improve the codebase, don’t ask questions” - was almost bound to fail (if we define success as “not enough test coverage, less lines of code, bugs fixed”) and anyone who uses LLMs to write code would see this right away.

This post is equivalent to someone saying LLMs are useless because they can’t count the R’s in strawberry - it ignores the fact that LLMs are very useful in somewhat narrow ways.

To be fair, I think the author knew this was just a funny experiment, and wanted to see what Claude would actually do. As a fun exercise and/or as a way to gather data, I think it’s interesting.

I do fear that people will see this post, continue to blithely say “LLM bad”, and go about their day. Hey, if inertia is your thing, go for it!

How would I improve this prompt?

  • Have the LLM first write a memory - an architecture MD file, a list of tasks to do, and so on. The author literally threw it a codebase and said “go”. No senior engineer would immediately start “improving” the codebase before getting a grasp on the project at hand.
  • Define what success looks like to the LLM. What constitutes high quality? I’d say adding tests for particularly risky parts and reducing lines of code would be a good start.

While there are justifiable comments here about how LLMs behave, I want to point out something else: There is no consensus on what constitutes a high quality codebase. —mbesto

  • Give an LLM the ability to check its own work. In this case, I’d have run Claude twice: once to improve the codebase, and once to check that the new was actually “high quality”. Claude can use command line tools to run a git diff so why not instruct it to do so? Better if you have it run its tests after each iteration and fix problems.

The Hacker News Discussion

The HN thread had some (surprisingly) good takes:

hazmazlaz: “Well of course it produced bad results… it was given a bad prompt. Imagine how things would have turned out if you had given the same instructions to a skilled but naive contractor who contractually couldn’t say no and couldn’t question you. Probably pretty similar.”

samuelknight: “This is an interesting experiment that we can summarize as ‘I gave a smart model a bad objective’… The prompt tells the model that it is a principal engineer, then contradicts that role the imperative ‘We need to improve the quality of this codebase’. Determining when code needs to be improved is a responsibility for the principal engineer but the prompt doesn’t tell the model that it can decide the code is good enough.”

xnorswap: “There’s a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving. It can do structured problems very well, and it can transform unstructured data very well, but it can’t deal with unstructured problems very well… But right now, the best way to help an LLM is have a deep understanding of the problem domain yourself, and just leverage it to do the grunt-work that you’d find boring.”

asmor: “I asked Claude to write me a python server to spawn another process to pass through a file handler ‘in Proton’, and it proceeded a long loop of trying to find a way to launch into an existing wine session from Linux with tons of environment variables that didn’t exist. Then I specified ‘server to run in Wine using Windows Python’ and it got more things right… Only after I specified ‘local TCP socket’ it started to go right. Had I written all those technical constraints and made the design decisions in the first message it’d have been a one-hit success.”

ericmcer: “LLMs are good at mutating a specific state in a specific way. They are trash at designing what data shape a state should be, and they are bad at figuring out how/why to propagate mutations across a system.”

The experiment feels interesting, but in reality isn’t anything noteworthy - bad prompting gets bad results. This has been driven into every developer since the dawn of time - garbage in, garbage out. I mean, it’s kind of cute. But anyone concluding “Claude can’t write good code” from this misses the point entirely.

LLMs are tools. Like any tool, the results depend on how you use them. Give vague instructions to a circular saw and you’ll get messy cuts. Give vague instructions to Claude and you’ll get 18,000 lines of comments.

Dec 9, 2025

Microsoft Reactor Livestream: Building AI Agents with Microsoft Agent Framework and Postgres

I did a livestream for Microsoft Reactor on building AI agents using the Microsoft Agent Framework and Postgres. The recording is now available.


All the code is available on GitHub: github.com/schneidenbach/using-agent-framework-with-postgres-microsoft-reactor

What we covered

The talk walks through building AI agents in C# using Microsoft’s Agent Framework (still in preview). Here’s a breakdown of what’s in the repo:

Demo 1: Basic Agent - A customer support agent with function tools. Shows how agents are defined with instructions (system prompt), model type, and tools (C# functions).

Demo 2: Sequential Agents - Chained agents: Draft, Review, Polish. Also covers observability with Jaeger tracing.

Demo 3: Postgres Chat Store - Storing conversation history in Postgres instead of relying on OpenAI’s thread storage. Why? Retention policies are unclear, you can’t query across threads, and you can’t summarize-and-continue without starting a new thread.

Demo 4: RAG with Hybrid Search - Using pgvector for similarity search combined with full-text search. The solution uses a hybrid approach because RAG is not an easy problem with a single solution.

Why Postgres?

Postgres with pgvector handles both your relational data needs and vector similarity search. You get:

  • Chat thread storage with full query capabilities
  • Vector embeddings for semantic search
  • Full-text search for keyword matching
  • Hybrid search combining both approaches

Azure Database for PostgreSQL Flexible Server supports pgvector out of the box. For local development, the demos use Testcontainers with Docker.

Running the demos

The repo has six projects. Set your Azure OpenAI credentials:

export REACTOR_TALK_AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export REACTOR_TALK_AZURE_OPENAI_API_KEY="your-api-key"

Projects 3 and 4 need Docker for Testcontainers. The “External” variants connect to your own Postgres instance if you prefer.

Key takeaways

  • Agent Framework provides abstractions for common patterns but is still in preview
  • Store your own chat threads rather than relying on LLM provider storage
  • Hybrid search (vector + full-text) often works better than pure semantic search, but there’s no RAG silver bullet

The slides are in the repo as a PDF if you want to follow along.

Enjoy!

Dec 8, 2025

How Did Microsoft Fumble the AI Ball So Badly?

A client of mine who runs an AI consultancy asked if I could do any Copilot consulting for a client. (Not GitHub Copilot, mind you. Microsoft’s Copilot products. As if the branding couldn’t be more confusing.)

I asked them if they were looking for AI strategy overall, or if it was a case of “some exec bought Copilot, now we need to make Copilot work because Boss Said So”.

Of course it was the latter.

My response was: I’ve heard that Copilot was a bad product, and I haven’t invested time or effort into trying to use it. Not to mention there are about 100 different “Copilot” things. What am I supposed to be focused on? Copilot for Office? Copilot Studio? Copilot for Windows? Copilot for Security?

The big bet I made early was that learning AI primitives was the best path forward, and that was 100% the right call.

My question is: other than Azure OpenAI (which is still a product with lots of shortcomings like slow thinking models, incomplete Responses API, etc.), how has Microsoft fumbled the ball so badly?

Google seemed so far behind when they launched Bard, and yet Gemini keeps getting better. It’s a first-party model, unlike Microsoft’s marriage to OpenAI.

It’s looking more and more like OpenAI has come out farther ahead in their deal with Microsoft.

The Numbers Don’t Lie

Windows Central reported that Microsoft reduced sales targets across several divisions because enterprise adoption is lagging.

The adoption numbers are brutal. As of August 2025, Microsoft has around 8 million active licensed users of Microsoft 365 Copilot. That’s a 1.81% conversion rate across 440 million Microsoft 365 subscribers. Less than 2% adoption in almost 2 years.

Tim Crawford, a former IT exec who now advises CIOs, told CNBC: “Am I getting $30 of value per user per month out of it? The short answer is no, and that’s what’s been holding further adoption back.”

Even worse: Bloomberg reports that Microsoft’s enterprise customers are ignoring Copilot and preferring ChatGPT instead. Amgen bought Copilot for 20,000 employees, and thirteen months later their employees are using ChatGPT. Their SVP said “OpenAI has done a tremendous job making their product fun to use.”

The OpenAI Dependency Problem

Microsoft bet the farm on OpenAI. Every Copilot product runs on OpenAI models. But they don’t control the roadmap. They don’t control the pace of improvement. They’re a reseller with extra steps.

Compare this to Google. Gemini is a first-party model. Google controls the training, the infrastructure, the integration. When they want to ship a new capability, they ship it. No waiting on a partner. No coordination overhead.

Microsoft sprinted out of the gate like a bull in a china shop - they looked like they had a huge advantage when married to OpenAI. Fast forward to late 2025, and Google’s AI products simply work better and feel more in-tune with how people actually use them.

I’m not saying Microsoft can’t turn this around. But right now, they’re losing to a competitor they were supposed to have lapped, and they’re losing on products built with technology they’re supposed to own.

That’s a fumble.