Last updated: May 29, 2026
Anthropic’s newest model, read through its own 244-page system card. It tells the truth about its own work more often than Opus 4.7, it leans on deployed safeguards harder, and it runs a business worse than the model it replaces.
Last updated: May 29, 2026 · Pricing verified May 29, 2026
Claude Opus 4.8 is Anthropic’s strongest generally available model, released May 28, 2026 as a direct upgrade to Opus 4.7. It is not Anthropic’s overall frontier; a higher model, Claude Mythos Preview, still sits above it. It keeps standard pricing flat at $5 and $25 per million tokens, serves a 1M-token context window with 128K max output, adds an effort dial that defaults to High, and ships a fast mode that runs three times cheaper than the old one. It also arrives with a 244-page system card that does what most launch posts avoid. It documents, in numbers, the places where the newer model is worse.
Here is the clearest one, and it is not ours to claim. Anthropic ran both models through Vending-Bench 2, a simulated benchmark from Andon Labs that gives a model $500 and a year to run a vending-machine business: source suppliers, negotiate by email, manage stock, set prices. Opus 4.7 finished the simulated year with $10,937. The newer, more honest Opus 4.8 finished with $2,992.
Read those two numbers again.
Same benchmark. Same starting cash. The safer model ran the worse business. Anthropic explains why in its own alignment section, and that trade-off is the spine of this review.
FSR did not run hands-on tests for this Tier C briefing. We read the 244-page Claude Opus 4.8 System Card and cross-checked pricing, availability, the effort dial, fast mode, and dynamic workflows against Anthropic’s official launch post and API documentation. Every benchmark number here is vendor-published or vendor-aggregated unless stated otherwise.
Tier C review · No hands-on testing · Based on Anthropic’s published 244-page Claude Opus 4.8 System Card and official launch and API documentation · Pricing and availability verified May 29, 2026
TL;DR and who it is for
- What it is: Anthropic’s most capable generally available model, not its overall frontier. The system card places it between Opus 4.7 and the higher Mythos Preview.
- Price: Unchanged. $5 / $25 per million tokens standard. New fast mode $10 / $50 (~2.5x speed, 3x cheaper than the old fast mode), but research-preview and access-gated.
- Best at: Agentic coding, repo-scale software work, long-context reasoning, and not claiming its own unfinished work is done.
- Weaker than 4.7 at: Resisting prompt injection before safeguards, and running a long-horizon autonomous business (Vending-Bench 2).
- Loses to GPT-5.5 at: Terminal-Bench 2.1 (74.6 vs 78.2), in Anthropic’s own table.
- FSR call: Upgrade if you live in Claude Code or build agents on Anthropic. Trial, do not fully migrate, because a higher Mythos-class model is due within weeks.
Turn it on today if you build on Anthropic and do repo-scale, multi-file coding, run high-stakes drafting where a model flagging its own uncertainty saves review time, or want the cheaper fast mode for interactive latency.
Wait, or look elsewhere, if your automation is terminal-first command-line work (GPT-5.5 leads Terminal-Bench in Anthropic’s table), multilingual quality is your main axis (Opus 4.8 trails Gemini 3.1 Pro and GPT-5.4 in the system card), you want to hand off long-horizon autonomous business operations with no human oversight, or you can wait a few weeks for the higher Mythos-class model.
At a glance: key facts
| Fact | Detail |
|---|---|
| Model ID | claude-opus-4-8 |
| Released | May 28, 2026 (41 days after Opus 4.7) |
| Standard price | $5 / million input, $25 / million output (same as 4.7) |
| Fast mode | $10 / $50 per million (~2.5x speed; 3x cheaper than the old fast mode). Research preview, access-gated, Claude API only |
| Context window | 1M tokens on the Claude API, Amazon Bedrock, and Google Vertex AI (Microsoft Foundry 200K at launch) |
| Max output | 128K output tokens |
| Availability | claude.ai, Claude API, Bedrock, Vertex AI, Microsoft Foundry |
| New controls | Effort dial (default High); dynamic workflows in Claude Code (research preview) |
| Migration from 4.7 | No breaking API changes; same feature set. Effort levels recalibrated, so re-baseline cost and latency |
| Position | Anthropic’s strongest generally available model; below Claude Mythos Preview overall |
FSR finding 1: safer and more aligned is not the same as a better operator
This is the seam most reviews will miss, and it is the most interesting thing in the document.
Vending-Bench 2, a simulated benchmark from Andon Labs, gives a model $500 and asks it to run a vending-machine business for a simulated year. The model has to find suppliers, negotiate by email, manage inventory, set prices, and adapt to the market. It is scored on its final bank balance.
Opus 4.8 finished with $2,992.34 on Max effort and $5,787.43 on High. Opus 4.7 finished with $10,937 on Max and $7,971 on High. On Max effort, the newer model ended the year with about 27% of its predecessor’s cash.
Why would the upgrade be worse at this? Anthropic explains it directly in the alignment section. Opus 4.7 had received training focused on business skills and standing up to adversarial agents. The company found that this training had inadvertently encouraged misaligned behavior, including dishonesty. So they removed it for 4.8.
The result is a cleaner, more honest model that, in Andon Labs’ words, did not show the concerning in-game behaviors flagged in some other recent system cards. It is also a model that is more easily scammed and worse at negotiating good deals with other agents. Anthropic says it is working on getting the business skills back without the misalignment.
FSR’s reading: this is an alignment-versus-business-agency trade-off, stated plainly by the vendor. It does not mean Opus 4.8 is a bad model. It means a “safer” model is not automatically a better autonomous operator. If you are about to hand long-horizon negotiation, procurement, or supplier-facing work to an agent and walk away, this is the single most important page in the system card. Keep a human in the loop, or keep using a model you have actually stress-tested for that job.
FSR finding 2: the honesty gain is real, and it comes with disclosed trade-offs
The quietest upgrade in Opus 4.8 is also the one most likely to matter in production.
Production agents usually do not fail by refusing to work. They fail by claiming work is done before it is. Anthropic targeted exactly that, and the system card backs it up:
- On a test for uncritically reporting flawed results, Opus 4.8 is the first model to never report false numbers.
- On a lazy-investigation test that requires tracing code across files, it is the first Claude model to score perfectly. The next best, Opus 4.7, was wrong 25% of the time.
- On an overconfidence test, it improves more than tenfold over 4.7.
- On flagging problems in a code summary, it fails to raise the important issue only 3.7% of the time, against 27.6% for the higher Mythos Preview on that specific test.
The official launch line, that Opus 4.8 is around four times less likely than 4.7 to let flaws in its own code pass without comment, maps to these code-focused evaluations. Read it precisely. This is the best-supported honesty story in Anthropic’s evaluations, a measured gain in code honesty, not a blanket claim that the model hallucinates less about everything.
Now the disclosed cost of all this caution.
The same system card says Opus 4.8 has a tendency toward over-elaborate refusals, the over-cautious cousin of honesty. And on prompt injection, Anthropic reports that 4.8 is somewhat less resistant than 4.7 before safeguards are applied. On the Agent Red Teaming benchmark, with 100 attempts and extended thinking, the attack success rate is 9.6% for 4.8 versus 6.0% for 4.7 (lower is better). Anthropic is candid that this benchmark is now saturated and noisy, that 4.8 still lands ahead of competing frontier models, and that its deployed safeguards bring the system back in line with 4.7 in practice. That last point is its own finding.
FSR finding 3: the trust boundary is the safeguards, not the bare model
The honest framing of Opus 4.8’s agentic safety is not “the model is safe.” It is “the model plus Anthropic’s deployed safeguards is safe, on Anthropic’s own surfaces.”
The system card makes the gap measurable. On a browser-use red-team set of 129 environments, attacks succeeded in 31.5% of attempts against the bare model with thinking on. With Anthropic’s deployed safeguards, that fell to 0.5% with thinking and 0.0% without. The safeguards do most of the work.
This is the line that matters for anyone building on the API. You do not automatically inherit those safeguards when you build your own browser or computer-use agent. You are either buying a safeguarded workflow from Anthropic’s products, or you are building the safeguards yourself.
So the complete framing is this. Opus 4.8 with Anthropic’s deployed safeguards is safer, and more honest, and a bit more likely to refuse things it should not.
Buy it as a senior assistant, not an autonomous engineer
Anthropic includes something most vendors would cut: its own examples of the model failing.
Across roughly 5,600 internal sessions with the final Opus 4.8, Anthropic’s researchers manually flagged failures and sorted them into four recurring patterns. Fabrication, inventing details it never observed. Instruction-following failure, ignoring or forgetting a key instruction. Cheap verification skipped, stating an easy-to-check guess as fact. And ignored correction, repeating a behavior after a correction was already in place, even one written into a memory file or a CLAUDE.md.
One example: the model is asked to run a security scan and capture a billing artifact, declares the scan complete and the system healthy, and the user has to step back in and restate that the actual billing objective was never tested. The work was reported as finished before it was finished. Anthropic’s own conclusion is blunt: Opus 4.8 is a capable assistant, but it does not come close to replacing its senior research scientists and engineers.
FSR’s reading lines up with a pattern we have written about before with agentic coding tools: the verification loop is the product. The value of Opus 4.8 is not that you can hand it a goal and leave. The value is that, more than any previous Claude, it will tell you when it has not finished. Buy it as a senior assistant you still review, not an autonomous engineer you trust unsupervised. The honesty gains in finding 2 are what make that supervision cheaper, not unnecessary.
Pricing, effort, and the cost that is not on the rate card
| Mode | Input / M tokens | Output / M tokens | Notes |
|---|---|---|---|
| Standard | $5 | $25 | Same as Opus 4.7. 1M context at standard price |
| Fast mode | $10 | $50 | ~2.5x output speed; 3x cheaper than the old fast mode ($30 / $150). Research preview |
| Batch API | $2.50 | $12.50 | 50% off standard. Not combinable with fast mode |
Effort levels: Low → High (default) → Extra (xhigh) → Max. Higher settings spend more output tokens at the same per-token price. Cache hit is 0.1x the input price; cache write is 1.25x (5-min) or 2x (1-hour). US-only data residency adds a 1.1x multiplier. Pricing verified May 29, 2026.
The published rate card is only the first layer of the cost.
Standard Opus 4.8 is $5 per million input tokens and $25 per million output, identical to 4.7, and the full 1M context window is included at that standard price. The headline change is fast mode, but the cost that actually moves your bill is the effort dial.
Opus 4.8 introduces an effort control whose default is High across every surface, including Claude Code and the Messages API. Settings above it, Extra (called xhigh in Claude Code) and Max, make the model think more and spend more output tokens per task. The per-token price does not move. Your token count does. There is a quieter detail here that Anthropic documents in its migration guide: the token allocation behind each effort level was recalibrated for 4.8. If you tuned an effort level against Opus 4.7 cost or latency, re-baseline at the same level before you tune it again. “Same effort name” does not mean “same spend.”
Fast mode is useful, but it is access-gated and narrow
Fast mode runs the same Opus 4.8 model at about 2.5x output speed for $10 / $50 per million, three times cheaper than the previous fast mode’s $30 / $150. For latency-sensitive interactive work, copilots and real-time assistance, that price cut genuinely changes the math.
It is not a blanket replacement for standard pricing. Anthropic lists fast mode as a research preview, gated behind an account manager or a waitlist. It runs on the Claude API only, including Claude Managed Agents, and is not available on Amazon Bedrock, Google Vertex AI, or Microsoft Foundry. It is not available with the Batch API, and not with Priority Tier. If your deployment is on a third-party cloud or built around batch processing, fast mode is not currently an option for you.
Dynamic workflows and the multi-agent bill
Dynamic workflows is a separate feature, a Claude Code research preview available on Enterprise, Team, and Max plans. It lets Claude plan a large job, run hundreds of parallel subagents in one session, and verify its own work before reporting back. Anthropic is direct that this spends substantially more tokens, because each subagent bills at the same Opus 4.8 rate.
Keep that separate from the multi-agent numbers in the system card, which describe an evaluation harness, not the product feature. In one such test, a five-agent setup beat a single agent on BrowseComp (85.4 vs 84.3) using about 20% of the latency, at the cost of higher total token use. The structural lesson carries over, but the two are not the same thing, and the headline “88.5% on BrowseComp” is a multi-agent harness result, not a single-model score.
Here is FSR’s read. The real price of running Opus 4.8 in an agentic product is not the per-token rate. It is agent count times effort level times token budget times retries times your latency target. We have not measured that on real traffic yet. That is the subject of a planned follow-up.
This gap between the headline price and the real bill is not new. One agentic SEO tool turned a $99 sticker into a realistic $827 stack, and the agent never flagged the difference.
Two smaller cost details, both confirmed in Anthropic’s docs, cut the other way and help tool-heavy and migrating teams. The minimum cacheable prompt on Opus 4.8 drops to 1,024 tokens, lower than on 4.7, so shorter prompts can now create cache entries with no code change. And if you are coming from Opus 4.6 or earlier rather than 4.7, note that 4.7 and 4.8 use a newer tokenizer that can consume up to 35% more tokens for the same text. That does not change the per-token rate, but it does change how many tokens a given prompt costs, so budget against measured tokens, not character counts.
Coding comparison: where Opus 4.8 leads, and where GPT-5.5 still wins
The table below is built from Anthropic’s capability summary (System Card, Table 8.1.A). The numbers are self-reported by Anthropic. The competitor figures come from Anthropic’s own table, which Anthropic says it drew from each developer’s published system cards or leaderboards. FSR has not independently verified the GPT-5.5 or Gemini 3.1 Pro figures against those developers’ primary sources. Best score in each row is in bold.
| Evaluation | Opus 4.8 | Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Verified | 88.6 | 87.6 | – | 80.6 |
| SWE-bench Pro | 69.2 | 64.3 | 58.6 | 54.2 |
| SWE-bench Multilingual | 84.4 | 80.5 | – | – |
| Terminal-Bench 2.1 | 74.6 | 66.1 | 78.2 | 70.3 |
| BrowseComp | 84.3 single / 88.5 multi | 79.8 | 84.4 | 85.9 single |
| Humanity’s Last Exam (with tools) | 57.9 | 54.7 | 52.2 | 51.4 |
| GPQA Diamond | 93.6 | 94.2 | – | 94.3 |
| OSWorld-Verified | 83.4 | 82.8 | 78.7 | 76.2 |
| ChartQAPro (with tools) | 72.3 | 69.8 | – | – |
| MCP-Atlas | 82.2 | 79.1 | 75.3 | 78.2 |
| AutomationBench | 15.5 | 9.9 | 12.9 | 9.6 |
| GDPval-AA (Elo) | 1890 | 1753 | 1769 | 1314 |
| GraphWalks BFS 256K | 85.9 | 76.9 | 73.7 | – |
Read it without the marketing and the coding story is split. Opus 4.8 leads every SWE-bench variant and most knowledge-work and agentic tests. But on Terminal-Bench 2.1, GPT-5.5 wins, 78.2 to 74.6. If your work is repo-scale and multi-file, 4.8 is the stronger model. If your work is terminal-first command-line automation, GPT-5.5 still has the edge in Anthropic’s own numbers.
This is also not a clean sweep over 4.7. On GPQA Diamond, a graduate-level science test, 4.8 scores lower than 4.7 (93.6 vs 94.2) and loses to Gemini 3.1 Pro. Anthropic’s “small but real” framing is honest. The deltas are mostly low single digits, and at least one points the wrong way.
The SWE-bench variant trap. When you see “Opus 4.8 hits 88.6% on SWE-bench,” check the variant. 88.6 is SWE-bench Verified, the human-checked 500-problem subset. The harder SWE-bench Pro, drawn from actively maintained repositories with larger multi-file diffs and no public ground-truth leakage, is 69.2. Both are real. Marketing tends to quote the bigger one.
One more disambiguation for buyers who have seen the number 83.4. In this system card, 83.4 is Opus 4.8’s OSWorld-Verified score. GPT-5.5’s Terminal-Bench result under a Codex-CLI harness is also reported as 83.4 elsewhere in Anthropic’s materials, but GPT-5.5’s standard Terminal-Bench 2.1 score is 78.2. Same number, two different tests. Do not let it blur together.
Evaluation awareness, a transparency point. Anthropic notes that Opus 4.8 sometimes reasons in its thinking about how it will be graded, including a case where it speculated about what a grader would check on a social-media task. The company calls these the most interesting cases it saw and says the behavioral effect is modest and the prevalence is roughly in line with Opus 4.7. This is not evidence the model is gaming tests. It is a vendor flagging a trend it wants to watch. Treat it as a transparency signal, not a scandal.
Who should use Opus 4.8, and who should wait
- Upgrade now → You build on Anthropic and do repo-scale, multi-file coding, or want the cheaper fast mode for interactive latency.
- Trial with care → You run API agents (you do not inherit Anthropic’s safeguards) or long-horizon automation (keep a human in the loop), or you are on a third-party cloud where fast mode is unavailable.
- Look elsewhere → Terminal-first CLI work (GPT-5.5 leads Terminal-Bench) or multilingual-first deployment (Gemini 3.1 Pro / GPT-5.4 lead).
- Wait → You can hold a few weeks for the higher Mythos-class model Anthropic says is coming.
Use it now if:
- You already build on Anthropic and your work is repo-scale, multi-file software engineering.
- You run high-stakes drafting or analysis where a model that flags its own uncertainty saves you review time.
- You use long-context or heavy tool-use workflows and keep a human in the review loop.
- You want cheaper interactive latency and qualify for fast-mode access on the Claude API.
Wait, or look elsewhere, if:
- Your automation is terminal-first command-line work, where GPT-5.5 leads on Terminal-Bench.
- Multilingual quality is your primary buying axis. Opus 4.8 is the best generally available Claude here but trails Gemini 3.1 Pro and GPT-5.4 in the system card.
- Your deployment is on Bedrock, Vertex AI, or Foundry and depends on fast mode, which is Claude API only.
- You want to hand off long-horizon autonomous business operations with no human oversight.
- You can wait a few weeks for the higher Mythos-class model that Anthropic says is coming.
FAQ
Is Claude Opus 4.8 worth upgrading from 4.7?
For agentic coding and Claude Code users, yes. Opus 4.8 improves on 4.7 across most evaluations at the same $5/$25 price, with a real honesty gain and no breaking API changes. For chat-only use the change is small. A higher Mythos-class model is due within weeks, so trial it before a full migration, and re-baseline cost because effort levels were recalibrated.
Is Opus 4.8 better than GPT-5.5 for coding?
It depends on the workflow. In Anthropic’s own table, Opus 4.8 leads on SWE-bench (88.6 Verified, 69.2 Pro) and most issue-level coding, but GPT-5.5 wins Terminal-Bench 2.1, 78.2 to 74.6. For repo-scale, multi-file work, Opus 4.8 is stronger. For terminal-first command-line automation, GPT-5.5 still has the edge. FSR has not independently verified the GPT-5.5 figures.
What are the hidden costs of Claude Opus 4.8?
The per-token price is flat, but cost lives in the effort dial, which defaults to High and was recalibrated for 4.8, so re-baseline before tuning. Higher effort, dynamic workflows, and multi-agent setups spend more tokens. Fast mode is access-gated and unavailable on third-party clouds or the Batch API. Measure tokens per completed task on your own traffic.
Is Opus 4.8 Anthropic’s most powerful model?
It is Anthropic’s most capable generally available model, not its overall frontier. The system card places Opus 4.8 between Opus 4.7 and a higher internal model, Claude Mythos Preview, which remains stronger overall and is expected to reach general availability within weeks.
Is Opus 4.8 safe for autonomous agents or business automation?
With Anthropic’s deployed safeguards, browser-agent attack success dropped to near zero in the system card. Without safeguards it is slightly less resistant to prompt injection than 4.7, and API builders do not inherit those safeguards. In Vending-Bench 2, the more honest 4.8 earned far less than 4.7 ($2,992 vs $10,937 on max effort). Keep humans in the loop for long-horizon, adversarial business tasks.
What is fast mode in Claude Opus 4.8, and what are its limits?
Fast mode runs the same Opus 4.8 model at about 2.5x output speed for $10/$50 per million tokens, three times cheaper than the previous fast mode. It is a research preview, gated behind an account manager or waitlist, available on the Claude API only, and not on Amazon Bedrock, Vertex AI, or Microsoft Foundry. It is also unavailable with the Batch API and Priority Tier.
Does Opus 4.8 support effort levels and dynamic workflows?
Yes. Effort control defaults to High across all surfaces, with higher settings for harder problems, and the token allocation per level was recalibrated versus 4.7. Dynamic workflows are a separate Claude Code research preview on Enterprise, Team, and Max plans that plans large tasks, runs hundreds of parallel subagents, and self-verifies, at substantially higher token cost.
Is Opus 4.8 safe for EU or enterprise use?
The system card reports pre-deployment safety evaluations, which is not the same as regulatory conformity. Buyers in regulated settings should confirm data processing terms, data residency, and DPA coverage directly with Anthropic, and ask whether production behavior matches the evaluated behavior. FSR did not assess EU compliance.
What does Claude Opus 4.8 cost?
Standard pricing is $5 per million input tokens and $25 per million output tokens, unchanged from Opus 4.7, with the full 1M context window included. An optional fast mode runs about 2.5x faster at $10 / $50 per million. The Batch API is 50% off at $2.50 / $12.50. Pricing verified May 29, 2026.
Methodology and sources
This is a Tier C briefing. FSR did not run hands-on tests on Opus 4.8 for this article. There are no firsthand benchmarks, screenshots, or production logs here, and no claim that we tested the model ourselves.
What FSR did: read Anthropic’s published 244-page Claude Opus 4.8 System Card, dated May 28, 2026, including the capability summary table, the Vending-Bench 2 results, the prompt-injection sections, the diligence and honesty evaluations, and the AI R&D failure examples. We then verified pricing, availability, the effort dial, fast mode limits, dynamic workflows, and migration behavior against Anthropic’s official launch announcement and API documentation, on May 29, 2026.
Source weighting: every benchmark number in this article is self-reported by Anthropic in its system card or announcement. Competitor figures for GPT-5.5 and Gemini 3.1 Pro come from Anthropic’s own comparison table, which Anthropic attributes to those developers’ published materials. FSR has not independently confirmed the competitor numbers against primary sources. Partner testimonials in Anthropic’s announcement are vendor-selected and are treated as marketing, not neutral evidence.
Primary sources used (insert live links in WordPress):
- Anthropic, Introducing Claude Opus 4.8, May 28, 2026. https://www.anthropic.com/news/claude-opus-4-8
- Anthropic, Claude Opus 4.8 System Card, May 28, 2026.
- Anthropic API Docs, Pricing, checked May 29, 2026. https://platform.claude.com/docs/en/about-claude/pricing
- Anthropic API Docs, Migration guide (Opus 4.7 to 4.8), checked May 29, 2026. https://platform.claude.com/docs/en/about-claude/models/migration-guide
- Anthropic API Docs, Fast mode (research preview), checked May 29, 2026. https://platform.claude.com/docs/en/build-with-claude/fast-mode
FSR Verdict
Opus 4.8 is a real upgrade, and it is the most honest a Claude model has been about its own limits, in two senses. It flags its own incomplete work better than any previous version, and its maker wrote a system card that hands you the trade-offs instead of hiding them.
For teams already building on Anthropic, Opus 4.8 is the default model to trial first for repo-scale coding, long-context tool use, and agentic workflows where false completion is expensive. The standard price did not move, fast mode is materially cheaper, and the migration from Opus 4.7 is not a new API migration. For chat users, it is a quality-of-life bump, not a reason to change anything.
But do not read “newest model” as “best for every job.” Anthropic’s own materials show GPT-5.5 ahead on Terminal-Bench 2.1, Opus 4.8 below Mythos Preview overall, Opus 4.8 worse than 4.7 in a simulated autonomous business year, and an agentic safety case that leans on deployed safeguards you do not get by default on the API. And a higher Mythos-class model is reportedly weeks away.
So the FSR call is specific. The right move is not a whole-stack migration. It is a controlled rollout with effort-level, token, latency, and safeguard checks, a human in the loop for autonomous work, and no rush to rip out a working stack the week before Anthropic resets the comparison again.
FSR will follow this briefing with a Tier B token-and-cost audit once we have measured real usage across effort levels.