Is Claude 4.6 really worse than at launch?

Published on 14 April, 2026

Quick Summary

A recent report on GitHub raises questions about Claude Opus 4.6 – Anthropic's most powerful model to date – experiencing a severe decline in capabilities, causing many business automation processes to stall.

On Reddit, Hacker News, and Anthropic's GitHub, hundreds of developers are reporting the same issue: Claude Opus 4.6 and Sonnet 4.6 are performing significantly worse in real-world tasks compared to their launch. One GitHub user recorded their performance score dropping from 92/100 to 38/100 when using Opus 4.6. The question is whether this is due to ongoing business losses, a technical issue at Anthropic, or a more complex story?

What the Community is Reporting About Claude Opus 4.6

The Most Clearly Documented Complaints

Most of the most reliable complaints might come from social media, but when they come from Anthropic's own GitHub repository – where developers report bugs with Claude Code – it's truly an issue. These are professional users with measured processes, not subjective feelings.

A developer reported that a production automation pipeline, which had been running stably for over 2 weeks, suddenly produced chaotic results on March 6th with the same Opus 4.6 model. According to this person, when asked to self-evaluate the conversation quality, the model consistently scored itself as Sonnet 4, not Opus 4.6. In other words, Opus 4.6 is also recognizing that it is performing below expectations. (Source: GitHub Issue #31480 — Anthropic/claude-code)

Another report documented more specifically with a real-world example: requesting Opus 4.6 to generate 3 emails based on a template for 3 insurance companies, the result was only 1 email. When prompted again, the model generated all 3, but when the user made a minor edit, the model reverted to generating 1 email. This loop repeated without any consistent logic — the reporter noted their performance score dropped from 92/100 to 38/100 after switching to Opus 4.6. (Source: GitHub Issue #24991 — Anthropic/claude-code)

In addition to the two reports above, a compiled thread on Hacker News noted many independent developers confirming similar situations and stating they reverted to using Claude 4.5 while awaiting a response from Anthropic. (Source: Hacker News thread)

Real-world Comparison Between Opus 4.6 at Launch and Recently

Below are some specific examples from the community, and I have also had time to compare the behavior of the two versions:

Example 1 — Instruction Adherence:

Prompt: "Write an email to a customer. NEVER mention the price in this email."

Previous Opus 4.6: Complied correctly, with no mention of price.
Opus 4.6 (after some point in March 2026): Mentioned "suitable pricing package" in the second paragraph despite the clear "NEVER" rule.

Example 2 — Reading Reference Files:

Prompt requested reading a style guide file and applying it to the output.

Previous Opus 4.6: The ability to read the file was quite accurate and applied the specified style correctly.
Opus 4.6 (at the time of the report above): Ignored reading the file while creating a completely different format.

Example 3 — Multi-part Task Handling:

Prompt: "Create 3 scenarios for 3 different situations."

Previous Sonnet 4.6: Generated all 3 scenarios in one go, with a clear structure.
Opus 4.6 (according to the February 2026 report): Generated 1 scenario, when prompted to continue, forgot the previous 2 scenarios, leading to an endless loop.

Is Reverting to Opus 4.5 the Best Solution?

Reverting to Opus 4.5 Even Though Opus 4.6 is Still Quite Good

Many people have suggested reverting to Opus 4.5 as a temporary solution to this problem. However, if we only look at official benchmarks, Opus 4.6 outperforms Opus 4.5 in almost all important criteria, especially for those who need long contexts. Opus 4.5 currently only has 200k context, which cannot be compared to Opus 4.6's ability to expand to 1M context. Regarding scores, on BrowseComp – a benchmark evaluating multi-step web research capabilities – Opus 4.6 achieved 84.0% while Opus 4.5 only reached 67.8%, an improvement of 16.2 percentage points. On SWE-bench Verified, which assesses real-world coding, Sonnet 4.6 achieved 79.6% compared to Sonnet 4.5's 77.2%. ARC-AGI 2 – a test of new problem-solving abilities – Opus 4.6 nearly doubled its score compared to 4.5.

However, there's an interesting point: on the SWE-Bench Multi-Agent benchmark, which measures the ability to coordinate multiple tools simultaneously, Opus 4.5 achieved 62.3% while Opus 4.6 only reached 59.5% – a small but real decline, which seems to be the scenario most users are complaining about.

Subjective and Objective Causes for Opus 4.6's Poor Experience?

This is the most important part to correctly understand the problem. There are at least three different reasons leading to the same symptom of "model performing worse":

Temporary Technical Issues: Anthropic has confirmed multiple official incidents on its status page, including "Elevated errors on Claude Opus 4.6" on February 28, 2026, a similar incident on March 31, 2026, and "Opus 4.6 and Sonnet 4.6 error rate elevated" on the same day. These are not subjective complaints — these are officially recorded technical incidents, and many "regression" reports occurred precisely during these periods.
Default Behavior Changes: Opus 4.6 is designed to think more by default through "adaptive thinking" — meaning it decides when to engage in deep reasoning and when not to. This makes it slower and sometimes feel more cumbersome on simple tasks, making users accustomed to 4.5 feel like the model is "overthinking" instead of performing quickly.
Anthropic is Still Profit-Oriented: (This is a personal opinion) It seems Anthropic's biggest goal is still profit, as they might adjust to reduce Opus 4.6's computational capacity to lessen the cost burden, just as OpenAI had to shut down Sora to reduce cost burdens, which everyone knows.

So, Are People Mentioning Other Solutions?

First, Switching to Codex

Based on what Opus has demonstrated previously, Opus 4.6's current issues appear temporary, but this inadvertently benefits OpenAI's Codex significantly as people flock to Codex with GPT-5.3 Codex.

Codex currently offers more generous quotas than Claude Code, but I don't think this will significantly threaten Anthropic, as my experience with Opus 4.6 on both Antigravity and Claude Code is much better than with Codex. For instance, when I only needed to modify one file, Opus 4.6 did it correctly and precisely, but Codex also modified other files, messing up my entire website, which was truly frustrating.

Deep Edits in the Settings File

Someone has shown how to modify Claude Code to address Claude Opus 4.6's "thinking" part by editing the ~/.claude/settings.json file. Anyone who has tried it, please comment on your experience so others know.

Is This an Industry Standard?

Yes. OpenAI, Google, and Anthropic all have a history of releasing new models with better benchmarks but causing complaints about real-world experience — often because optimization for a benchmark set doesn't reflect the full diversity of actual workflows. This is why large companies often don't upgrade models immediately upon a new version release but thoroughly test them on their specific workloads first.

If you are using Claude Opus 4.6 for research workflows, computer use, or long-term reasoning tasks, the best approach currently is still to revert to Opus 4.5 to continue your work without interruption.

Discussion (0)

No comments yet. Be the first!

YC CEO's 6 forcing questions before starting any project

I'd heard a lot about the gstack repo from the CEO of Y Combinator, so I got curious and installed it to try. What surprised me most wasn't the polished workflows — it was the genuinely different mindset behind them. That mindset shows up in the very first command: /office-hours, with six questions that don't ask about code at all, only the things most people haven't thought through before they start building. What is gstack and why did Garry Tan build it gstack is an open-source toolkit by Garry Tan, CEO of Y Combinator, built primarily for Claude Code. The core idea: instead of using AI as a plain code writer, Garry Tan wanted to turn Claude into a small AI agent team, where each member handles a different role — from product direction and security review to testing and release. The entire workflow runs in an ordered loop: Think → Plan → Build → Review → Test → Ship → Reflect. More specifically, gstack splits Claude Code into 23 specialized roles, and the output of each step is automatically passed to the next — no manual handoff needed. Some of the standout commands: /office-hours 6 questions that force you to rethink your feature before writing a single line of code /plan-ceo-review checks whether you're overbuilding or underbuilding relative to what's actually needed /review catches serious bugs that standard automated checks miss /qa opens a real browser, performs real interactions, finds real bugs /cso runs an automated security audit against international standards /ship syncs, tests, pushes code and opens a pull request in a single command How effective is gstack? Garry Tan says his working speed in 2026 is roughly 810 times faster than in 2013, measured by lines of completed code per day (11,417 vs 14). In 60 days, he shipped 3 production services and over 40 features — all while running Y Combinator full-time. Andrej Karpathy, co-founder of OpenAI, confirmed a similar trend, sharing that he hasn't typed a single line of code himself since December 2025 thanks to AI agents. But among all those commands, /office-hours stands out for the opposite reason from the rest, it doesn't help you work faster and it helps you avoid building the wrong thing from the start. Why Garry Tan puts /office-hours first Garry Tan placed /office-hours at the top of the workflow based on a simple observation: most products fail not because of poor code, but because they build the wrong thing. Teams spend weeks on a feature nobody needs, or build the right feature for the wrong audience, or solve a problem users already handle better another way. The command has two modes: Startup mode for founders and people building real products with real users, and Builder mode for side projects, hackathons, and open source. This article focuses on Startup mode, where the 6 questions are most directly applicable. 6 questions that stop you from building the wrong thing These aren't 6 questions to answer quickly and move on. They're designed to make you think honestly, because the more truthful your answers, the more accurately Claude can match what you actually need — saving you a significant amount of time later. You can read the full original prompts at office-hours/SKILL.md.tmpl. Demand reality: Is there a real need? Original question: "Who specifically has this problem? How are they solving it today?" Not "users in general" or "the marketing team" — the goal is to name one real person, ideally by name, who is actively struggling with a specific problem. If you can't name someone like that, you don't yet understand what they actually need. Concrete example: Instead of "users want better task management," it should be: "Minh, a project manager at a 20-person company, copy-pastes between Notion and Google Sheets every Monday morning because the two tools don't sync." Apply this to your own situation accordingly. Status quo: What are they using instead? Original question: "What is their current workaround? How much better do you need to be for them to switch?" Everyone is already solving their problem somehow — whether with Excel, sticky notes, or a WhatsApp group. If their current solution is good enough, they have no reason to migrate their data and learn an entirely new platform. Your solution needs to be meaningfully better before they'll even consider switching. Desperate specificity: Who needs this badly enough? Original question: "Who needs a solution badly enough to use your ugly beta version today?" This is the question that separates nice-to-have from must-have. If you can't find anyone willing to use an incomplete, rough, buggy version right now, the problem you're solving isn't urgent enough. Real early users are people who need a solution badly enough to tolerate an unpolished product — as long as it's moving in the right direction. Narrowest wedge: What is the smallest possible piece? Original question: "What is the smallest thing you could launch tomorrow? Not the full vision — the smallest piece." Not the first full-featured version — something even smaller than that. This question typically cuts 80% of the scope people add because they think "might as well do it while I'm here." It's a trap many builders fall into, including myself. Launch the smallest meaningful piece first, listen to real users, then decide whether to expand. Common mistake: Many people confuse "smallest piece" with "first full-featured version." The narrowest wedge truly means one small thing that solves one specific problem for one specific group of users — nothing more. Observation and surprise: Have you watched real people use it? Original question: "Have you watched real people use your product? Did they use it in ways you didn't expect?" This question is best saved for the second iteration onward, once you have something to test. Rather than asking for feedback through messages or surveys, sit and watch directly — or review screen recordings. The most valuable insights usually don't come from what users say, but from what they do that you didn't design for, or what they skip that you thought was important. Note: If you're in your first iteration and don't have a product yet, you can skip this question and come back after launching the smallest piece in step 4. Future-fit: The 2 to 3 year view Original question: "In 2-3 years, will what you're building still be relevant — or is the trend moving against you?" This isn't about predicting the future precisely. It's about avoiding building something that's already fading. If the trend is making your problem less urgent over the next two years, that's a clear signal to reconsider from the start. That said, if your goal is to move fast and capture the market before big tech ships something similar, this question can reasonably be set aside. A real example: a simple idea completely flipped In the gstack documentation, Garry Tan walks through a practical example. You open /office-hours and say: "I want to build an app that summarizes my daily work calendar." Claude doesn't agree and start executing. Instead, it pushes back: what you just described isn't a calendar summary app — it's actually a full personal AI chief of staff. These are entirely different in scope, technical complexity, and user expectations. From that single opening description, /office-hours helps you see: 5 features you were describing without realizing it 4 assumptions that need to be validated before building 3 different implementation directions with varying levels of complexity 1 recommendation: launch the smallest piece first, treat the rest as a long-term roadmap All of this happens before you write a single line of code. The output is saved as a document that subsequent steps in the workflow automatically pick up and continue from. These 6 questions work even without gstack The 6 questions from /office-hours don't require Claude Code or a gstack installation. They're a way of thinking — the same framework YC partners use to evaluate startups — and you can apply them right now with any AI tool you already have. The difference when using them through gstack is that Claude won't let you give vague answers. It pushes for specifics and won't move forward until your response is grounded enough to be useful. That's why /office-hours tends to be the most uncomfortable command in the entire toolkit — not because it's difficult to use, but because it asks exactly what you've been avoiding. Try it today: Before starting your next project, paste these 6 questions into Claude, Gemini, or ChatGPT along with your idea. Ask it to go through each question one at a time and not let you skip any. The results are often more surprising than you'd expect — even for ideas you've already thought through carefully. gstack currently has over 117k stars on GitHub and is still growing. For me, the most valuable part isn't the technical commands like /review or /ship — it's /office-hours, because it's the only command in the entire toolkit that forces you to stop and think before doing anything else.

Nam•

27 Jun, 2026

How to control Codex from your phone with ChatGPT app

You're out and suddenly remember a small detail in your project that needs fixing — you don't have to open your laptop or remote desktop in. With the right connection set up, ChatGPT app on your phone can become a control panel for Codex, while your computer at home or the office keeps running the actual code. ChatGPT app doesn't run Codex on your phone The easiest thing to misunderstand is thinking Codex is running directly on your phone. In reality, your phone only sends prompts, replies, approvals and follow-up messages, while the actual working environment lives on your Mac or Windows machine running Codex. In other words, ChatGPT app is the remote controller, and the host machine is where your repo, terminal, credentials, plugins, MCP servers and other tools actually live. This makes complete sense because codebases typically live on your development machine, not your phone. When you send a request like fixing a TypeScript error, running tests or checking a diff, Codex processes it inside the selected project on the host and sends results back for you to review. If you want to understand the foundation before using remote access, check out What is Codex and how to use Codex to get a clear picture of where this tool fits in your workflow. What do you need before connecting ChatGPT app to Codex? According to the latest Codex documentation from OpenAI, ChatGPT app supports controlling Codex on both macOS and Windows, though Linux is not supported yet. Notably, this feature works with all ChatGPT account types, including Free and Go — no paid plan required. You only need to make sure you're signed into the same account or workspace on both devices: ChatGPT mobile (latest version on iOS or Android) and Codex (latest version on your host machine, online and running). Your host machine must stay on and Codex must keep running for the entire time you're controlling it remotely. If the machine goes to sleep, loses its connection or Codex is closed, the connection from your phone drops immediately and any tasks in progress may be interrupted. What's worth noting is that the entire setup process starts from Codex App on the host machine and is surprisingly simple — just scan a QR code and you're done. Inside Codex App, select the mobile setup option in the sidebar, scan the QR code with your phone, then complete the confirmation in ChatGPT app. For enterprise workspaces, an admin may need to enable Remote Control permissions before you can connect. This QR code grants control over your computer, so keep it private and never share it with anyone to avoid unauthorized access to your machine. To summarize, connecting ChatGPT app to Codex is straightforward: Host machine must be online and running Codex ChatGPT app and Codex must be signed into the same account or workspace Generate the QR code in Codex on the host and complete setup on your phone MFA, SSO or passkey requirements may still apply depending on your workspace What can you do once connected? Once the host appears in Codex on your phone, you can start a new thread inside a project on the host or pick up an existing one. This is where the experience becomes genuinely useful: you can send follow-ups, answer Codex's questions, approve commands, view output, check diffs, review test results and even receive notifications when a task finishes or needs your attention. A real example: you're at a coffee shop and remember the login form has a validation bug. You open ChatGPT app, select the connected host, and ask Codex to check the auth flow, fix the email validation error and run the related tests. Codex works directly on the repo sitting on your host machine, while you review the results, approve actions when needed and decide whether to request further changes. This is also why people are starting to think of Codex and other AI-powered IDEs as a colleague working inside a real environment, not just a code suggestion tool anymore. Its strength lies in reading files, running commands, editing code and maintaining context across multiple rounds of back-and-forth. Limitations to keep in mind when using Codex from your phone Remote control depends entirely on the host machine — if your computer goes to sleep, loses its connection, closes Codex or gets signed out of the workspace, your phone loses its working environment immediately. That said, if Codex is mid-task when the connection drops, it will continue running on the host and notify you once your phone reconnects, so there's less to worry about if your phone suddenly loses signal during a running task. One more thing to note: on Windows, tasks using Computer Use require an appropriate foreground session, so this setup is not a complete replacement for sitting directly in front of your machine. It also helps to draw a clear line between handing off a focused task and reviewing large changes. Your phone works well for small bugs, running tests, quick questions about a specific file, reviewing short tasks or checking task status. However, anything requiring a high level of attention should still be reviewed on a larger screen to avoid missing details. How to use it effectively in practice The most effective approach is to hand off tasks with a clear scope and specific expected outcomes. Instead of saying "fix the login", describe exactly where the error occurs, what the expected behavior should be after the fix, which tests to run and which parts of the codebase to leave untouched. Codex performs better when it knows the boundaries of a task, especially since remote mobile means each feedback loop takes longer than when you're sitting right at your machine. A clean working rhythm might look like this: describe the task in detail whether small or medium-sized, ask Codex to read the relevant files, let it propose a solution, only approve when necessary and wait for the result report. Once you get used to this rhythm, you'll find that idle time outside can handle real work — while keeping the final decision firmly in your hands. Compared to Claude Code Remote and Telegram bot There are many ways to control an AI coding agent from your phone, though the three most common approaches each serve a different need. Criteria ChatGPT app + Codex Claude Code Remote Telegram + Codex Natural conversation ✅ Excellent ✅ Good ❌ Requires exact syntax Granular control Moderate Highest Low Connection stability Stable Stable Frequent drops Mobile UI Well optimized Not fully optimized Uses existing Telegram app Initial setup Easy, scan QR Easy Requires manual bot configuration Computer must stay on ✅ Required ✅ Required ✅ Required Claude Code Remote Control offers the strongest level of control — you get direct terminal output, can intervene mid-task and generally feel much closer to what the agent is doing. That said, the UI on small phone screens isn't fully optimized yet, and some interactions are still difficult to perform without a physical keyboard. Telegram bot has the advantage of not requiring a separate app and is easy to get started with, but the real-world experience has clear limits: it's prone to slowdowns, occasional silent disconnections mid-task, and because it lacks genuine AI context, anything slightly more complex than a simple command quickly falls apart — forcing you to type precise instructions rather than describe what you need naturally. ChatGPT app + Codex sits at the best balance point for most users — smooth enough, smart enough, quick to set up with a QR scan and no new syntax to learn before you can get to work. Connecting ChatGPT app to Codex doesn't turn your phone into a development machine — it turns your phone into a control surface for a development machine that's already ready to work. As long as the host stays on, permissions are configured correctly and the task is scoped tightly enough, this is the most practical way to handle real coding work when you're away from your laptop.

Nam•

22 Jun, 2026

What Is Hermes Agent? Nous Research's Self-Learning AI

Learning more makes you better, a principle long assumed to apply only to humans, turns out to hold true for Hermes Agent too, an open-source AI agent from Nous Research. Every time you work with it, Hermes Agent doesn't forget, it remembers, understands you more deeply, and gets better with each session, thanks to a memory system that can recall everything about you even after the machine has been off for a week. What Is Hermes Agent? Hermes Agent is an open-source AI agent developed and released under the MIT license by Nous Research, the lab behind the Hermes, Nomos, and Psyche model lines. Unlike Antigravity or Codex, which depend on an IDE environment, or ordinary chatbots that ultimately remain a thin wrapper calling a single API, Hermes Agent is built to run continuously on a user's own infrastructure, from a cheap VPS to a GPU cluster or serverless infrastructure, and it operates in a way fairly similar to OpenClaw. The core difference in Hermes Agent lies in how it manages long-term memory and converts experience into real skills. Instead of merely storing raw information or passively remembering preferences the way AI like Gemini or Claude do, Hermes runs a closed "learning loop," meaning that after every work session, it actively distills the process into new tools it can use the next time. This system is run by a background "Curator" agent that automatically scores, prunes, and merges accumulated knowledge, combined with FTS5 search technology that retrieves old memories roughly 4,500 times faster without spending any tokens. As a result, Hermes doesn't just respond and forget, it genuinely becomes a collaborator that grows more knowledgeable and capable over time. Four Features That Set Hermes Agent Apart Nous Research doesn't call Hermes Agent a chatbot or a copilot, it positions it as an agent with a built-in learning loop. The four feature groups below explain why that label isn't just marketing. Memory That Persists Across Sessions The biggest weakness of most AI today is that memory only stores raw chat text rather than how work actually gets done. Hermes Agent addresses this through three combined mechanisms: Fast retrieval: Uses FTS5 full-text search to pull up old memories roughly 4,500 times faster than conventional search, without spending extra tokens the way Gemini or Cowork do. User understanding: Integrates Honcho's dialectical user-modeling approach, helping the agent understand preferences, habits, and personal context in depth across thousands of sessions. Continuity: The agent picks up work exactly where you left off, even if that was a project from weeks earlier. Self-Generating and Self-Improving Skills This is the feature that makes Hermes Agent behave like a collaborator that accumulates experience, rather than just a tool that answers on request: Learning from real use: After completing complex tasks, Hermes Agent distills the process into new skills and stores them in a library to be reused automatically next time. Open agentskills.io standard: These skills follow an open standard, so they can be packaged, shared, and reused across different AI systems without being rewritten from scratch. The Curator mechanism: A background administrative agent periodically scores, prunes, and merges duplicate skills, which keeps the skill library from bloating and becoming disorganized over time. Present on More Than 23 Messaging Platforms Hermes Agent isn't confined to a computer, it integrates directly into the messaging channels people already use on their phones every day: Multiple channels, one brain: You can command Hermes Agent through Telegram, Discord, Slack, WhatsApp, Signal, email, or SMS. Context retained: Whether you message via Telegram in the morning or switch to Discord at night, the agent keeps a single thread of memory, never fragmented by channel. Multimodal interaction: Supports sending voice messages, images, and video, along with the ability to analyze multimodal content. Flexible Runtime Infrastructure Hermes Agent supports six backend types for executing commands: local machine, Docker, SSH, Daytona, Singularity, and Modal. With Daytona and Modal, the environment can hibernate when idle and cost almost nothing while waiting, waking up only when there's work to process. This is why Nous Research describes Hermes Agent as an always-on agent that doesn't require users to keep a server running 24/7 at high cost year-round. Hermes Agent can be installed with a single curl command, supporting Linux, macOS, and Windows via WSL2, or, as of June 5, 2026 with version v0.16.0 "The Surface Release," through an official Native Desktop app for Windows, macOS, and Linux with a fully polished GUI, making it accessible to everyday users without needing a terminal. Built-In Toolset and Limitations to Know 40-Plus Built-In Tools, From Web Search to Schedule Automation Hermes Agent ships with more than 40 built-in tools, including web search, browser actions, file handling, and Python script execution via RPC to run sub-tasks without consuming the main agent's context window. A natural-language scheduling system lets you set recurring tasks like daily reports or data backups, then leaves the agent to run them without being reminded. For tasks that need full isolation, Hermes Agent also supports sub-agents with their own conversation, terminal, and scripts, allowing multiple jobs to run in parallel without diluting the main memory. Challenges and Security Considerations Despite rapid updates, Hermes Agent still has a few points users should keep in mind before deploying it: Stability of the self-learning mechanism: The ability to self-improve skills boosts success rates, with a Tencent Cloud report recording gains of up to 52% along with token savings of up to 61%. However, since this is a self-evolving mechanism, real-world effectiveness still depends on the underlying model chosen and still requires human oversight rather than full trust. Risk from high-level permissions, with security responsibility falling on the user: Hermes Agent can intervene deeply in a system (excessive agency), so connecting it directly to multiple messaging platforms requires users to manage their own API keys and set up guardrails. Unlike closed AI services, Hermes Agent hands full control over to the user, which means the user also bears greater responsibility for configuring access permissions to avoid information leaks. Why Is Hermes Agent Growing So Fast? Hermes Agent's growth could be attributed to Nous Research's marketing, but in our view it comes down to three main factors. A Frictionless Migration Path From OpenClaw Recognizing OpenClaw's large user base, Nous Research built a migration tool that lets users carry over their persona, API keys, the entire skill set, and memory to Hermes Agent with a single command, without losing old data and, of course, without having to reconfigure anything from scratch. If you're currently using OpenClaw and want to try Hermes Agent without losing your old data, look for the hermes claw migrate migration tool built into Hermes Agent before considering a fresh install. Betting on a Closed Learning Loop Instead of a Feature Race While many other agents compete on the number of tools they offer, Hermes Agent positions itself as a self-evolving entity, one that distills experience into new skills and retains long-term memory to understand users more deeply over time. This approach creates lasting value, and the community has already put it to use for projects such as automating large-scale content production with high consistency across many sessions. A Role as a Training Data Engine Beyond serving as a personal assistant, Hermes Agent also functions as a capable research tool. It can generate thousands of parallel tool-calling trajectories and compress them into training data for other AI models. By turning the agent's real-world experience into training data, Hermes becomes a platform that developers building the next generation of autonomous AI can't easily do without. How Is Hermes Agent Different From an Agent Harness? People new to the space often confuse Hermes Agent with the concept of an agent harness, which is the framework that decides how a model calls tools, handles the reasoning loop, and coordinates execution steps internally. If a harness is the engine and chassis that determine how a car drives, then Hermes Agent is like a car that already has that engine installed, plus seats, a navigation system, and the driver's own trip memory. In other words, a harness is the technical architecture layer underneath, while Hermes Agent is a complete end-user product that already packages memory, a skill system, communication channels, and a choice of runtime infrastructure. A developer can build their own harness to control every small detail, but most users don't need to go that deep, they just need an agent that runs right away and gets smarter through use. For a closer look at this underlying architecture layer, read more at What Is Agent Harness? The Framework That Makes AI Work Efficiently, which explains in detail how this type of framework operates. Is Hermes Agent Worth Trying Right Now? Being fully open source, collecting no user data, and supporting complete self-hosting, Hermes Agent is one of the few agents today that lets users keep full control over their own data while still getting a continuous assistant experience with real memory, not the simulated memory that only exists within a single chat. After v0.16.0, the biggest technical barrier for users unfamiliar with terminals has largely been removed, as the native desktop app for Windows, macOS, and Linux has fully replaced the pure CLI approach used before. What's left to judge about Hermes Agent isn't whether it runs, but what it learns after a few real weeks of use. The fastest way to find out is to install the desktop app or run the CLI on a cheap VPS, connect it to a familiar messaging channel like Telegram, then watch what skills the agent forms on its own from how you use it every day. That's also the groundwork for comparing Hermes Agent with other options on the market, from Agent Harness to OpenClaw and Claude Cowork, in the next part of this series.

Nam•

19 Jun, 2026

Gemini powers Argentina and Messi at World Cup 2026

Gemini has won big in the most literal sense, right as Messi scored his first hat-trick at the 2026 World Cup, leading Argentina to a crushing 3-0 victory over Algeria and equaling Miroslav Klose's record of 16 World Cup goals. That historic moment became the perfect launchpad for Gemini. Back in March 2026, Google and the Argentine Football Association (AFA) made a bold decision: rather than simply printing a logo on training kits, they signed a deal for the AI to actively support tactical preparation and professional decision-making. That bet has now proven to be the right call. From training kit to the tactical meeting room The agreement between AFA and Google was unveiled at Times Square, New York, a venue deliberately chosen to capture global media attention. The Gemini logo appears across all training apparel for Argentina's men's, women's and youth squads, sitting alongside Adidas and American Express in AFA's top sponsorship tier. But the interesting part isn't the jersey. According to Inside World Football, Argentina's coaching staff will use Gemini for three specific purposes: tactical analysis, injury prevention and decision support. In other words, Gemini now has a seat in meetings that previously belonged only to Scaloni and his assistants. Google has not publicly disclosed which specific Gemini tools have been integrated into AFA's workflow. What is clear is that they are using the World Cup to bring Gemini into the reality of professional football, and the results will be graded in public. What is Gemini actually doing in the dressing room? Argentina arrives at the 2026 World Cup as the reigning champion. Every decision Scaloni makes, from the squad list to the starting eleven, is scrutinized more closely than any other team, and that is precisely why Argentina has become the most ideal testing ground Google has ever had for Gemini in professional football, especially at a major tournament. Tactical analysis Gemini is used to process match data for both Argentina and their opponents, covering movement statistics, attacking patterns and defensive vulnerabilities. Instead of the coaching staff spending hours reviewing footage, AI synthesizes the data and generates tactical diagrams automatically, saving significant preparation time before each match. Injury prevention This is a problem every major team wants to solve, especially when Messi and several key players are at an age that requires careful management of training loads. Gemini analyzes biometric data and injury history to issue early warnings, helping the coaching staff adjust intensity before problems actually occur. That is part of the reason why, immediately after completing his hat-trick, Scaloni chose to substitute Messi off, prioritizing fitness and safety for the matches ahead. AI in injury prevention is nothing new. Premier League clubs have had Microsoft as a partner for similar purposes. What is different this time is that Gemini is integrated directly into the workflow of a national team competing at a major tournament, not just at club level. For fans: create Messi content, follow scores without unlocking your screen Alongside supporting the coaching staff, Gemini has also rolled out a range of features aimed at fans, and this is the side that hundreds of millions of people will actually experience. Gemini lets you create content about players directly Users can generate images, songs and digital content featuring Argentina players like Messi directly inside the Gemini app. The feature is designed to bring the World Cup experience closer to those who cannot attend matches in person. Real-time scores and automated daily briefings On Google Search, live match scores can be pinned to the lock screen and update in real time, with dedicated animations for goals and red cards, all without needing to unlock the phone. For paid Gemini users, the Scheduled Actions feature allows an automated daily football briefing to be set up, covering scores, news and fixtures, delivered at a chosen time without needing to prompt it each day. Match-day infrastructure Google has updated Street View at all 16 host stadiums and optimized routing on Waze for match days. Waze also surfaces live scores when the car is stopped at red lights, so drivers do not need to pick up their phones while on the move. The 2026 World Cup is the real test for AI in sport Google is not sponsoring Argentina alone. Gemini also appears on the kits of France, Morocco, Iraq, Turkey and the United States, while Pixel is the official phone of the French squad, which is also using Gemini for internal communications. This is clearly a comprehensive strategy from Google, not a one-off deal. What makes the 2026 World Cup particularly significant is that it will answer a question no lab environment can: what do users actually do with AI when a World Cup runs for six weeks across 104 matches? Features that run on initial novelty will fade after the group stage. Whatever users keep coming back to all the way through the final is the honest answer to where AI actually fits in everyday life, and Google knows it. Google's communications director for Latin America, Flor Sabatini, stated that the 2026 World Cup will mark a before and after in the history of football because of AI. It sounds like marketing, but the reality is that this is the first time a major AI model has been integrated into the preparation of the reigning world champions, right in the middle of the most-watched sporting event on the planet. The 2026 World Cup is Gemini's real test The most significant part of this entire story is not the Gemini logo on Messi's jersey. It is the fact that Argentina, still the most expected to win and the most scrutinized team, carrying the pressure of defending the title, has committed part of its preparation process to AI. If Argentina succeeds, Gemini will have a case study that no advertising budget can buy. If Argentina falls short and the coaching staff attributes any part of it to AI, the narrative will flip entirely. Either way, this is the first time AI has been held accountable on a stage that genuinely matters, not a benchmark, not a demo, but the World Cup. For AI users, what is worth watching is not just whether Argentina wins, but whether Gemini actually changes how a football team operates, or whether it turns out to be nothing more than a logo on a training kit that looks better than previous years.

Nam•

17 Jun, 2026

Quick Summary

What the Community is Reporting About Claude Opus 4.6

The Most Clearly Documented Complaints

Real-world Comparison Between Opus 4.6 at Launch and Recently

Is Reverting to Opus 4.5 the Best Solution?

Reverting to Opus 4.5 Even Though Opus 4.6 is Still Quite Good

Subjective and Objective Causes for Opus 4.6's Poor Experience?

So, Are People Mentioning Other Solutions?

First, Switching to Codex

Deep Edits in the Settings File

Is This an Industry Standard?

Discussion (0)

Related Articles

YC CEO's 6 forcing questions before starting any project

How to control Codex from your phone with ChatGPT app

What Is Hermes Agent? Nous Research's Self-Learning AI

Gemini powers Argentina and Messi at World Cup 2026