Choosing a model (Claude vs GPT)

Sonnet 4.6, Haiku 4.5, or GPT-5.4: when each one wins, what they cost per resolution, and how to mix them.

By ChristopherUpdated May 14, 20264 min read

Choosing a model (Claude vs GPT)

Ochre supports the current shipping catalog from Anthropic and OpenAI: Claude Opus 4.7/4.5, Claude Sonnet 4.6/4.5, Claude Haiku 4.5, plus GPT-5/mini/nano, GPT-4.1/mini/nano, GPT-4o/mini, and the reasoning series (o3, o3-mini, o4-mini, o1, o1-mini).

You set a workspace default per provider on AI → Keys and can override per channel. This article is the cheat sheet.

The short version

Default to Claude Sonnet 4.6 or GPT-4.1. Best mix of quality, speed, and price.
Use Claude Haiku 4.5 or GPT-4o-mini on high-volume, low-stakes channels. A fraction of the cost, often fast enough.
Use Claude Opus 4.7 or GPT-5 when accuracy on a hard, multi-step ticket is worth the extra dollars.
Use the o-series (o3, o3-mini, o4-mini) when you need explicit reasoning on complex debugging tickets and you do not mind a slower response.

Quality

All of them handle the support job well. Differences show up at the edges:

Sonnet 4.6 / GPT-4.1 are the most consistent on tone. Drafts read like the sample replies you fed Voice and tone.
Opus 4.7 / GPT-5 are sharper on complex, multi-step debugging. If your tickets are mostly engineering edge cases, try them.
Haiku 4.5 / GPT-4o-mini / GPT-5-nano are clear steps down on multi-step reasoning, but on FAQs and routine questions they are hard to tell apart from the flagships.
o-series reasoning models trade latency for accuracy on math, code, and chained logic. Slow, but they are right more often on the tickets that matter.

The Playground is the right place to compare. Paste five real tickets, switch models, and read the drafts.

Cost per resolution

Approximate, including classification, retrieval, drafting, and labeling. A "resolution" is the full work of one AI reply.

Haiku 4.5 / GPT-4o-mini / GPT-5-nano $0.005 to $0.01.
Sonnet 4.6 / GPT-4.1 / GPT-4o $0.01 to $0.03.
Opus 4.7 / GPT-5 $0.05 to $0.15 on long threads.
o-series $0.02 to $0.10 depending on reasoning depth.

Long threads with lots of KB context land at the higher end of each range. Every reply records the exact dollar cost in AI receipts.

For comparison, the legacy AI helpdesks bill $1.50 to $2.00 per resolution.

Speed

Median latency, first token to last:

Haiku 4.5 / GPT-4o-mini: about 1 to 2 seconds.
Sonnet 4.6 / GPT-4.1 / GPT-4o: about 2 to 4 seconds.
Opus 4.7 / GPT-5: about 3 to 7 seconds.
o-series: 5 to 30 seconds depending on the problem (they "think" before answering).

For draft mode the agent rarely notices the difference. For auto-send, faster is nicer.

Context length

All of them handle the threads we see in practice. Sonnet, Opus, GPT-4.1, and GPT-5 are comfortable past 100k tokens, which matters if you have customers who reply on the same thread for months. Haiku and the nano tier are fine for normal threads but start to drop nuance on very long histories.

How to set the default

Open AI → Keys.
On the Anthropic or OpenAI card, click rotate or Add key.
Pick the default model in the model picker.
Save.

To override per channel, open the channel's settings and pick a model from the same list. Whichever provider you draft on first is the primary; the other is the backup. See Multiple keys + automatic fallback.

A common pattern:

Email and Slack Connect: Sonnet 4.6 or GPT-4.1.
Chat widget: Haiku 4.5 or GPT-4o-mini.
VIP customer segment: Opus 4.7 or GPT-5.
Engineering bug debugging: o3 or o4-mini.

Mixing models

You can run different models on different channels at the same time. The receipts always tag which model produced the reply, so you can compare quality and cost over time.

A common pattern: draft on Sonnet, review a sample with Opus or GPT-5 in the Playground every few weeks to spot blind spots.

When the model is the wrong knob

If drafts are off, the model is rarely the problem. Try these first:

Bad voice. Add more sample replies on AI → Behavior. See Voice and tone.
Bad length. Tighten or relax Reply length.
Bad facts. Confirm the right articles are indexed in The brain: KB graph.
Bad routing. Some tickets should not get drafted at all. Tighten Guardrails and bypass labels.

Switching models

Switching the default is hot. New conversations use the new model on the next message. In-flight drafts are not regenerated. There is no batch reprocessing of past tickets.

A note on changes

Anthropic and OpenAI roll out new models on their own clock. When new versions become available we add them to the catalog. We do not silently swap your default. You stay on the version you picked until you change it.

Recommended setup

Most teams start with:

Sonnet 4.6 as Anthropic default.
GPT-4.1 or GPT-4o-mini as OpenAI default.
The other provider connected as fallback. See Multiple keys + automatic fallback.

Then revisit after the first month using your AI receipts and the spend chart.

Was this article helpful?

← Back to Ochre Help