Claude Opus 4.6 Goes Exponential on METR Benchmark, Completing 14-Hour Human Tasks
Anthropic's latest model achieves a 50% time horizon of 14.5 hours on METR's task-completion benchmark, continuing an exponential trend that has AI capabilities doubling every four months.

The Chart That Has r/singularity Buzzing
A single benchmark result has become one of the most discussed data points in the AI community this year. Claude Opus 4.6, Anthropic's latest frontier model, has achieved a 50% time horizon of 14.5 hours on METR's task-completion benchmark — meaning it can reliably complete tasks that would take a skilled human expert nearly 15 hours to finish.
The result has dominated discussion on r/singularity, r/MachineLearning, and r/LocalLLaMA, not just because the number is impressive, but because of what the trend line reveals.
Understanding the METR Benchmark
METR (Model Evaluation and Threat Research) runs a benchmark where AI agents are given complex, self-contained tasks drawn from software engineering, machine learning, and cybersecurity. The difficulty of each task is measured by how long it would take a skilled human professional, working without prior context, to complete the same job.
The "50% time horizon" is the task length at which a given AI model succeeds roughly half the time. It is one of the most concrete measures available for tracking how capable AI systems are becoming at sustained, real-world work.
The Exponential Curve
When METR plots time horizons against model release dates, the trend is unmistakably exponential. The fitted trend line shows a doubling time of approximately 123 days — meaning AI task-completion capability has been roughly doubling every four months.
To appreciate the trajectory:
- Mid-2024: Frontier models like GPT-4o had time horizons measured in single-digit minutes
- Early 2025: Models cleared tasks in the 15 to 30 minute range
- Late 2025: Claude Opus 4.5 reached approximately 4 hours
- Early 2026: Claude Opus 4.6 hit 14.5 hours
That is a jump from minutes to nearly a full workday in under two years.
Why This Result Matters
The METR benchmark is significant because it measures something closer to real-world usefulness than traditional benchmarks. Scoring well on multiple-choice tests or coding puzzles is one thing. Completing a task that would take a human expert the better part of two days is qualitatively different.
At 14.5 hours, Claude Opus 4.6 is operating at a level where it can handle substantial software engineering projects, complex data analysis workflows, and multi-step research tasks with meaningful autonomy. This is the territory where AI transitions from "assistant" to "agent" in practical terms.
Reddit's Take
The discussion on r/singularity has been characteristically intense. The original thread sharing the METR results gathered thousands of upvotes, with Sam Altman's comment that "the world is not prepared for AI takeoff" being frequently cited alongside the data.
Commenters on r/MachineLearning have been more measured, noting that METR's benchmark methodology has specific characteristics — isolated tasks, clear success criteria, no ambiguity — that may overstate real-world capability. Several researchers cautioned against extrapolating the exponential trend indefinitely.
"Exponential growth in a bounded domain always looks exponential until it hits the ceiling," wrote one commenter. "The question is where the ceiling is, and nobody knows."
The Misunderstood Graph
MIT Technology Review published an analysis calling the METR chart "the most misunderstood graph in AI." The piece argued that while the exponential trend is real, it measures a specific type of capability — performance on well-defined, time-bounded tasks — and should not be interpreted as a general measure of intelligence or a countdown to artificial general intelligence.
The distinction matters. An AI that can complete a 14-hour coding task is not the same as an AI that can navigate the ambiguity, politics, and judgment calls of a real work environment. But it is significantly more capable than anything available 18 months ago, and the rate of improvement shows no signs of slowing.
Competitive Implications
Claude Opus 4.6's METR performance puts Anthropic at the top of this particular leaderboard, ahead of OpenAI's GPT-5.1 and Google's Gemini Ultra. The result is especially notable given Anthropic's emphasis on safety and alignment — the company has demonstrated that responsible development and frontier performance are not mutually exclusive.
For enterprise customers evaluating AI agents for real workloads, the METR time horizon has become a key metric. A model that can sustain coherent work over 14 hours opens doors to use cases — long-running code migrations, comprehensive security audits, multi-day research synthesis — that were previously impractical.
What the Trend Means
If the 123-day doubling time holds, AI systems could be completing tasks equivalent to multiple human workdays by late 2026. Whether that trajectory continues, plateaus, or accelerates is one of the most consequential open questions in technology.
What is not in question is that the capability frontier is moving faster than most predictions anticipated. As one r/singularity commenter put it: "We are not ready for the next data point on this chart."


