Posts tagged "OpenAI"

2 posts

← Back to all posts

Nov 24, 2025

Thoughts on Recent New Models: Claude Opus 4.5, GPT 5.1 Codex, and Gemini 3

Three major model releases hit in the past two weeks. All three are gunning for the developer market - here’s what stood out to me.

IMO, we’re beyond the point where major model releases move the needle significantly, but they’re still interesting to me nonetheless.

We’ll start with the one I’m most excited about:

Claude Opus 4.5

Anthropic announced Opus 4.5 today. If you’re not aware: Anthropic’s models are, in order from most to least powerful: Opus (most capable, most expensive), to Sonnet (mid-range, great price/performance) and Haiku (cheapest, fastest, and smallest - still a decent model). Pricing dropped to $5/million tokens input, $25/million tokens output. That’s 1/3rd what Opus 4.1 cost.

Anthropic claims Opus 4.5 handles ambiguity “the way a senior engineer would.” Bold claim. (Claude has already shown itself to be “senior” when it estimates a 10 minute task will take a couple of weeks. 🥁)

The claim that it will use fewer tokens than Sonnet on a given task because of its better reasoning is interesting, but I’ll have to see how it plays out when actually using it.

Why am I most excited about this one, you might ask? Mainly cause Claude Code is my daily driver, and I see no reason to change that right now.

GPT 5.1 Codex

OpenAI’s GPT 5.1 Codex-Max dropped last week also. I really like the idea of the Codex models - models that are specifically tuned to use the Codex toolset - that just makes a ton of sense to me.

Still… the name. Remember when GPT-5 was supposed to bring simpler model names? We now have gpt-5.1, gpt-5.1-codex, gpt-5.1-codex-mini, and gpt-5.1-codex-max. Plus reasoning effort levels: none, minimal, low, medium, high, and now “xhigh.” For my money, I’m typically always on medium, but I’m interested to try xhigh (sounds like Claude’s Ultrathink?).

I don’t think this does much to alleviate confusion, but all APIs trade flexiblity with simplicity, and I’d prefer to have more levers to pull, not less.

One highlighted feature is “compaction,” which lets the model work across multiple context windows by summarizing and pruning history. Correct me if I’m wrong, but Claude Code has been doing this for a while. When your context runs out, it summarizes previous turns and keeps going (you can also trigger compression, which I do - that and /clear for a new context window). Nice to see Codex get on this, it’s a rather basic tenant of LLM work that “less is more” so compaction frankly should have been there from the get go.

I think the cost is the same as 5.1 Codex - honestly the blog post doesn’t make that super clear.

Gemini 3

Google’s Gemini 3 launched November 18th alongside Google Antigravity, their new coding IDE. I’m happily stuck with CLI coding tools + JetBrains Rider and IDEs come and go frequently, so I haven’t tried it (and probably won’t unless people tell me it’s amazing).

Before release, X was filled with tweets saying how Gemini 3 is going to be a gamechanger. As far as I can tell, it’s great, but it’s not anything crazy. I did use it to one-shot a problem that Claude Code has been stuck on for a while on a personal project (a new .NET language - more on that later!) - which was really cool. I’m excited to use it more, alongside Codex CLI.

The pricing is aggressive: $2/million input, $12/million output. Cheapest of the three for sure.

What Actually Matters

The benchmark numbers are converging. Every announcement leads with SWE-bench scores that differ by a few percentage points. However, that does not and never has excited me. Simon Willison put it well while testing Claude Opus:

I’m not saying the new model isn’t an improvement on Sonnet 4.5—but I can’t say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two. … “Here’s an example prompt which failed on Sonnet 4.5 but succeeds on Opus 4.5” would excite me a lot more than some single digit percent improvement on a benchmark with a name like MMLU or GPQA Diamond.

Show me what the new model can do that the old one couldn’t. That’s harder to market but more useful to know. Right now, I’m reliant on vibes for the projects that I run.

What am I still using?

For my money, Sonnet 4.5 remains the sweet spot. I haven’t missed Opus 4.1 since Sonnet launched. These new flagship releases might change that, but the price-to-performance ratio still favors the tier below.

IMO, the dev tooling race is more interesting than the model race at this point. As I pointed out in my article on Claude Code and GitHub Copilot using Claude are NOT the same thing, models tuned to a specific toolset will definitely outperform a “generic” toolset in most cases.

Read more →

Aug 21, 2025

How Two Words Broke My LLM-Powered Chat Agent

TLDR: LLMs are weird, even between different model versions.

I manage a fairly complex chat agent for one of my clients. It’s a nuanced system for sure, even if it’s “just a chatbot” - it makes the company money and our users are delighted by it.

As is tradition (and NECESSARY) for LLMs, we have a huge suite of evals covering the functionality of the chat agent, and we wanted to move from gpt-4o to gpt-4.1 So we did what any normal AI engineer would do - we ran our evals against the old and the new, fixed a few minor regressions, and moved on with our lives. This is a short story about one bug that didn’t get caught right away.

Recently, one of the QA folks at a client found an odd bug - requests made thru the chat interface to an LLM would randomly fail. Like, maybe 1% of the time.

Here’s what we were seeing in our logs:

Tool call exception: **Object of type 'System.String' cannot be converted to type 'client.Controllers.AIAgent.SemanticKernel.Plugins.FilterModels.AIAgentConversationGeneralFilters'.**
Stack trace:    at System.RuntimeType.CheckValue(Object& value, Binder binder, CultureInfo culture, BindingFlags invokeAttr)
   at System.Reflection.MethodBaseInvoker.InvokeWithManyArgs(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at Microsoft.SemanticKernel.KernelFunctionFromMethod.Invoke(MethodInfo method, Object target, Object[] arguments)
   at Microsoft.SemanticKernel.KernelFunctionFromMethod.<>c__DisplayClass21_0.<GetMethodDetails>g__Function|0(Kernel kernel, KernelFunction function, KernelArguments arguments, CancellationToken cancellationToken)
   at Microsoft.SemanticKernel.KernelFunctionFromMethod.InvokeCoreAsync(Kernel kernel, KernelArguments arguments, CancellationToken cancellationToken)
   at Microsoft.SemanticKernel.KernelFunction.<>c__DisplayClass32_0.<<InvokeAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.SemanticKernel.Kernel.InvokeFilterOrFunctionAsync(NonNullCollection`1 functionFilters, Func`2 functionCallback, FunctionInvocationContext context, Int32 index)
   at Microsoft.SemanticKernel.Kernel.OnFunctionInvocationAsync(KernelFunction function, KernelArguments arguments, FunctionResult functionResult, Boolean isStreaming, Func`2 functionCallback, CancellationToken cancellationToken)
   at Microsoft.SemanticKernel.KernelFunction.InvokeAsync(Kernel kernel, KernelArguments arguments, CancellationToken cancellationToken)
   at Microsoft.SemanticKernel.Connectors.FunctionCalling.FunctionCallsProcessor.<>c__DisplayClass10_0.<<ExecuteFunctionCallAsync>b__0>d.MoveNext()
...and so on...

The Investigation Begins

The thing that stood out to me was this:

at System.RuntimeType.CheckValue(Object& value, Binder binder, CultureInfo culture, BindingFlags invokeAttr)
at Microsoft.SemanticKernel.KernelFunctionFromMethod.Invoke(MethodInfo method, Object target, Object[] arguments)
at Microsoft.SemanticKernel.KernelFunctionFromMethod.InvokeCoreAsync(Kernel kernel, KernelArguments arguments, CancellationToken cancellationToken)

My best guess was that Semantic Kernel was failing to deserialize the filters parameter for some reason, which makes sense since OpenAI sends tool call parameters as strings:

"parameters": {
    "filters": "{\"start_date\":\"2024-07-01T00:00:00Z\",\"end_date\":\"2024-07-31T23:59:59Z\"}"
}

My thinking was that, okay, for some reason it’s failing to deserialize the JSON object and therefore attempting to pass the still-string-parameter to the method that was represented by the MethodInfo object above.

Digging Into Semantic Kernel’s Source

The .NET team tends to err on the side of abstraction to the point of hiding lots of important details in the name of “making it easier” - sometimes they even accomplish that goal (though more often than not it’s just more obscure). Looking at Semantic Kernel’s KernelFunctionFromMethod.cs, I found this gem:

private static bool TryToDeserializeValue(object value, Type targetType, JsonSerializerOptions? jsonSerializerOptions, out object? deserializedValue)
{
    try
    {
        deserializedValue = value switch
        {
            JsonDocument document => document.Deserialize(targetType, jsonSerializerOptions),
            JsonNode node => node.Deserialize(targetType, jsonSerializerOptions),
            JsonElement element => element.Deserialize(targetType, jsonSerializerOptions),
            _ => JsonSerializer.Deserialize(value.ToString()!, targetType, jsonSerializerOptions)
        };

        return true;
    }
    catch (NotSupportedException)
    {
        // There is no compatible JsonConverter for targetType or its serializable members.
    }
    catch (JsonException)
    {
        //this looks awfully suspicious
    }

    deserializedValue = null;
    return false;
}

If I was sure before, I was SUPER sure now.

Time to Get Visible

Unless you dig into the source code or create a custom DelegatingHandler for your HttpClient, it’s difficult to see how Semantic Kernel ACTUALLY sends your tools along to OpenAI - and difficult to see how OpenAI responds. This sort of makes sense, since it’s possible for there to be sensitive data in those requests, but these lack of hooks just make life a little harder. Frustrating when you’re trying to debug issues like this. So I did just that - created a DelegatingHandler and just logged the stuff to console.

public class DebugHttpHandler : DelegatingHandler
{
    protected override async Task<HttpResponseMessage> SendAsync(
        HttpRequestMessage request, 
        CancellationToken cancellationToken)
    {
        // Log the request
        if (request.Content != null)
        {
            var requestBody = await request.Content.ReadAsStringAsync();
            Console.WriteLine($"Request: {requestBody}");
        }

        var response = await base.SendAsync(request, cancellationToken);

        // Log the response
        if (response.Content != null)
        {
            var responseBody = await response.Content.ReadAsStringAsync();
            Console.WriteLine($"Response: {responseBody}");
        }

        return response;
    }
}

I was right all along

With my custom handler in place, I finally saw what the LLM was sending back for the tool call parameters:

{
  "start_date": "2024-07-01T00:00:00 AM",
  "end_date": "2024-07-31T23:59:59 PM"
}

There it is - the LLM was incorrectly sending meridiens (AM/PM) attached to what should be ISO 8601 formatted dates.

The Root Cause

I went back to look at our model’s property attributes:

[Required]
[JsonPropertyName(StartDateParameterName)]
[Description("The start date of the conversation. Time must always be set to 12:00:00 AM.")]
public DateTime StartDate { get; set; }

[Required]
[JsonPropertyName(EndDateParameterName)]
[Description("The end date of the conversations. Time must always be set to 23:59:59 PM.")]
public DateTime EndDate { get; set; }

There it was. In the Description attributes. We were literally telling the LLM to include “AM” and “PM” in the time. And very rarely the LLM would take us literally and append those characters to what should have been an ISO-formatted datetime string.

The best part? This was never seen with GPT-4o. Only when we switched to GPT-4.1 did it suddenly behave differently.

The Fix

Obviously the fix was super easy - just change the prompt:

[Required]
[JsonPropertyName(StartDateParameterName)]
[Description("The start date of the conversation. Time must always be set to midnight (00:00:00).")]
public override DateTime StartDate { get; set; }

[Required]
[JsonPropertyName(EndDateParameterName)]
[Description("The end date of the conversations. Time must always be set to end of day (23:59:59).")]
public override DateTime EndDate { get; set; }

No more AM/PM in the descriptions. Problem solved.

(I very deliberately call this a prompt, by the way, because it IS. Any tool descriptions that are passed along to an LLM - whether it be the tool itself OR its parameters - are like mini-prompts and should be treated as such.)

The Lessons

This whole adventure taught me a few things:

  1. LLMs will take what you say literally - When you tell an LLM to format something a certain way, sometimes it takes you at your word. Even when that conflicts with the expected data format.
  2. Model differences matter - This only started happening when we upgraded from GPT-4o to GPT-4.1. Different models interpret instructions differently. This is why you need solid evaluation suites for all changes to your system - prompts, models, you name it.
  3. Observability is crucial - Semantic Kernel’s opacity made this harder to debug than it needed to be. After this, we took the crucial step of logging our tool call parameters BEFORE Semantic Kernel gets them. Using Semantic Kernel’s filter capabilities made this super easy.
  4. Description attributes are prompts - nuff said.
Read more →