Is Claude 4.6 really worse than at launch?

Quick Summary
A recent report on GitHub raises questions about Claude Opus 4.6 – Anthropic's most powerful model to date – experiencing a severe decline in capabilities, causing many business automation processes to stall.
On Reddit, Hacker News, and Anthropic's GitHub, hundreds of developers are reporting the same issue: Claude Opus 4.6 and Sonnet 4.6 are performing significantly worse in real-world tasks compared to their launch. One GitHub user recorded their performance score dropping from 92/100 to 38/100 when using Opus 4.6. The question is whether this is due to ongoing business losses, a technical issue at Anthropic, or a more complex story?
What the Community is Reporting About Claude Opus 4.6
The Most Clearly Documented Complaints
Most of the most reliable complaints might come from social media, but when they come from Anthropic's own GitHub repository – where developers report bugs with Claude Code – it's truly an issue. These are professional users with measured processes, not subjective feelings.
A developer reported that a production automation pipeline, which had been running stably for over 2 weeks, suddenly produced chaotic results on March 6th with the same Opus 4.6 model. According to this person, when asked to self-evaluate the conversation quality, the model consistently scored itself as Sonnet 4, not Opus 4.6. In other words, Opus 4.6 is also recognizing that it is performing below expectations. (Source: GitHub Issue #31480 — Anthropic/claude-code)
Another report documented more specifically with a real-world example: requesting Opus 4.6 to generate 3 emails based on a template for 3 insurance companies, the result was only 1 email. When prompted again, the model generated all 3, but when the user made a minor edit, the model reverted to generating 1 email. This loop repeated without any consistent logic — the reporter noted their performance score dropped from 92/100 to 38/100 after switching to Opus 4.6. (Source: GitHub Issue #24991 — Anthropic/claude-code)
In addition to the two reports above, a compiled thread on Hacker News noted many independent developers confirming similar situations and stating they reverted to using Claude 4.5 while awaiting a response from Anthropic. (Source: Hacker News thread)
Real-world Comparison Between Opus 4.6 at Launch and Recently
Below are some specific examples from the community, and I have also had time to compare the behavior of the two versions:
Example 1 — Instruction Adherence:
Prompt: "Write an email to a customer. NEVER mention the price in this email."
- Previous Opus 4.6: Complied correctly, with no mention of price.
- Opus 4.6 (after some point in March 2026): Mentioned "suitable pricing package" in the second paragraph despite the clear "NEVER" rule.
Example 2 — Reading Reference Files:
Prompt requested reading a style guide file and applying it to the output.
- Previous Opus 4.6: The ability to read the file was quite accurate and applied the specified style correctly.
- Opus 4.6 (at the time of the report above): Ignored reading the file while creating a completely different format.
Example 3 — Multi-part Task Handling:
Prompt: "Create 3 scenarios for 3 different situations."
- Previous Sonnet 4.6: Generated all 3 scenarios in one go, with a clear structure.
- Opus 4.6 (according to the February 2026 report): Generated 1 scenario, when prompted to continue, forgot the previous 2 scenarios, leading to an endless loop.
Is Reverting to Opus 4.5 the Best Solution?
Reverting to Opus 4.5 Even Though Opus 4.6 is Still Quite Good
Many people have suggested reverting to Opus 4.5 as a temporary solution to this problem. However, if we only look at official benchmarks, Opus 4.6 outperforms Opus 4.5 in almost all important criteria, especially for those who need long contexts. Opus 4.5 currently only has 200k context, which cannot be compared to Opus 4.6's ability to expand to 1M context. Regarding scores, on BrowseComp – a benchmark evaluating multi-step web research capabilities – Opus 4.6 achieved 84.0% while Opus 4.5 only reached 67.8%, an improvement of 16.2 percentage points. On SWE-bench Verified, which assesses real-world coding, Sonnet 4.6 achieved 79.6% compared to Sonnet 4.5's 77.2%. ARC-AGI 2 – a test of new problem-solving abilities – Opus 4.6 nearly doubled its score compared to 4.5.
However, there's an interesting point: on the SWE-Bench Multi-Agent benchmark, which measures the ability to coordinate multiple tools simultaneously, Opus 4.5 achieved 62.3% while Opus 4.6 only reached 59.5% – a small but real decline, which seems to be the scenario most users are complaining about.
Subjective and Objective Causes for Opus 4.6's Poor Experience?
This is the most important part to correctly understand the problem. There are at least three different reasons leading to the same symptom of "model performing worse":
- Temporary Technical Issues: Anthropic has confirmed multiple official incidents on its status page, including "Elevated errors on Claude Opus 4.6" on February 28, 2026, a similar incident on March 31, 2026, and "Opus 4.6 and Sonnet 4.6 error rate elevated" on the same day. These are not subjective complaints — these are officially recorded technical incidents, and many "regression" reports occurred precisely during these periods.
- Default Behavior Changes: Opus 4.6 is designed to think more by default through "adaptive thinking" — meaning it decides when to engage in deep reasoning and when not to. This makes it slower and sometimes feel more cumbersome on simple tasks, making users accustomed to 4.5 feel like the model is "overthinking" instead of performing quickly.
- Anthropic is Still Profit-Oriented: (This is a personal opinion) It seems Anthropic's biggest goal is still profit, as they might adjust to reduce Opus 4.6's computational capacity to lessen the cost burden, just as OpenAI had to shut down Sora to reduce cost burdens, which everyone knows.
So, Are People Mentioning Other Solutions?
First, Switching to Codex
Based on what Opus has demonstrated previously, Opus 4.6's current issues appear temporary, but this inadvertently benefits OpenAI's Codex significantly as people flock to Codex with GPT-5.3 Codex.
Codex currently offers more generous quotas than Claude Code, but I don't think this will significantly threaten Anthropic, as my experience with Opus 4.6 on both Antigravity and Claude Code is much better than with Codex. For instance, when I only needed to modify one file, Opus 4.6 did it correctly and precisely, but Codex also modified other files, messing up my entire website, which was truly frustrating.
Deep Edits in the Settings File
Someone has shown how to modify Claude Code to address Claude Opus 4.6's "thinking" part by editing the ~/.claude/settings.json file. Anyone who has tried it, please comment on your experience so others know.

Is This an Industry Standard?
Yes. OpenAI, Google, and Anthropic all have a history of releasing new models with better benchmarks but causing complaints about real-world experience — often because optimization for a benchmark set doesn't reflect the full diversity of actual workflows. This is why large companies often don't upgrade models immediately upon a new version release but thoroughly test them on their specific workloads first.
If you are using Claude Opus 4.6 for research workflows, computer use, or long-term reasoning tasks, the best approach currently is still to revert to Opus 4.5 to continue your work without interruption.



