DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole



For months, the main AI coding benchmarks have advised enterprise consumers a comforting however deceptive story: the highest fashions are all roughly the identical. OpenAI's GPT-5 family, Anthropic's Claude Opus, and Google's Gemini Pro have clustered inside a slender band on Scale AI's SWE-Bench Pro leaderboard, making IT practically unimaginable for engineering leaders to find out which agent will really carry out finest inside their codebases.

On Monday, a startup known as Datacurve launched a benchmark IT says shatters that phantasm. DeepSWE, a 113-task analysis spanning 91 open-source repositories and 5 programming languages, produces a dramatically wider unfold among the many similar frontier fashions — and crowns OpenAI's GPT-5.5 because the clear chief at 70%, sixteen factors forward of its nearest competitor.

"On public leaderboards, high fashions usually look comparatively shut in functionality," wrote Datacurve co-author Serena Ge on X. "DeepSWE exhibits the place they really diverge, reflecting the life like expertise of builders of their day-to-day work."

The benchmark additionally delivers a pointed critique of the analysis infrastructure the AI business depends on to measure progress: Datacurve's audit discovered that SWE-Bench Professional's verifiers — the automated graders that decide whether or not an agent solved a job — issued incorrect go/fail verdicts on roughly one-third of the trials IT reviewed.

If that discovering holds up, IT has sweeping implications. Enterprise procurement groups, enterprise capitalists, and AI lab advertising and marketing departments all lean closely on benchmark scores to make multimillion-dollar selections. A 32% error fee in probably the most broadly cited coding benchmark suggests the business might have been navigating by a damaged compass.

Why the most well-liked AI coding benchmark could also be grading on a curve

To grasp what Datacurve is claiming, IT helps to grasp how coding benchmarks work — and the way they’ll go improper.

The dominant paradigm, pioneered by the SWE-Bench family maintained by Scale AI and tutorial researchers, constructs duties by mining actual GitHub commits. The method extracts a bug repair or characteristic addition from a repository's historical past, rolls the code again to the pre-fix state, after which asks an AI agent to breed the change. The unique commit's check suite serves because the verifier: if the agent's patch makes the identical checks go, IT will get credit score. This strategy has a sublime simplicity, however Datacurve argues IT introduces three systemic weaknesses.

First, contamination. As a result of duties are drawn from public GitHub historical past, the issue assertion, the dialogue, and infrequently the precise answer are already current within the coaching knowledge of frontier fashions. "The SWE-Bench household scrapes present GitHub points and PRs, which creates two issues: memorization (fashions have already seen the answer) and triviality (most duties are small)," Ge wrote.

Second, scope. SWE-Bench Pro duties require, on common, simply 120 strains of code added throughout 5 recordsdata. DeepSWE's reference options common 668 strains added throughout 7 recordsdata — roughly 5.5 occasions extra code. But DeepSWE's prompts are literally shorter, averaging 2,158 characters versus SWE-Bench Professional's 4,614. In different phrases, DeepSWE provides the agent much less instruction however expects way more output, which extra carefully mirrors how a human developer may really delegate work to an AI assistant.

Third — and most damaging — verifier reliability. Datacurve drew 30 duties at random from each DeepSWE and SWE-Bench Pro, ran three rollouts throughout 10 frontier mannequin configurations, after which deployed an LLM-based choose to independently assess whether or not every agent's patch really solved the issue. SWE-Bench Professional's verifiers accepted improper implementations 8.5% of the time and rejected appropriate implementations 24% of the time. DeepSWE's verifiers registered 0.3% and 1.1%, respectively.

The false detrimental downside is particularly insidious as a result of IT punishes artistic options. In a single documented case, the gold-standard pull request for a SWE-Bench Professional job refactored a personal helper operate. An agent that accurately solved the duty by inlining the identical logic — a wonderfully legitimate engineering selection — failed as a result of the check suite tried to import a logo that solely existed within the authentic writer's particular implementation.

OpenAI's GPT-5.5 dominates the brand new benchmark whereas Claude and Gemini stumble

DeepSWE's top-line outcomes reorder the acquainted hierarchy in ways in which ought to matter to each engineering crew evaluating AI coding instruments. On SWE-Bench Pro, fashions from OpenAI, Anthropic, and Google have traded the lead inside a 30-point vary. DeepSWE stretches that vary to 70 factors.

GPT-5.5 leads at 70%, adopted by GPT-5.4 at 56% and Claude Opus 4.7 at 54%. From there, the drop-off is steep: Claude Sonnet 4.6 lands at 32%, Gemini 3.5 Flash at 28%, GPT-5.4-mini and Kimi K2.6 tied at 24%, after which an extended tail of fashions within the teenagers and single digits. Claude Haiku 4.5, which scores 39% on SWE-Bench Professional, collapses to zero on DeepSWE — suggesting that some mid-tier fashions have been considerably overperforming on simpler, doubtlessly contaminated benchmarks.

GPT-5.5 doesn't simply rating the very best — IT does so effectively. The mannequin reaches its 70% go fee with a median price of $5.80 per trial, a median wall-clock time of 20 minutes, and a median of 47,000 output tokens. GPT-5.4 emerges as maybe the perfect general worth at $3.30 per trial with a 56% rating. Claude Opus 4.7, in the meantime, prices considerably extra per run, and output tokens, wall-clock length, and greenback price per trial all range by an order of magnitude throughout the brokers examined — but none of those correlates strongly with go fee. Brokers that emit extra tokens, run longer, or price extra don’t constantly remedy extra duties.

Datacurve's audit discovered that Claude has been studying the reply key on present benchmarks

Maybe probably the most provocative discovering in DeepSWE's evaluation issues what the authors label "CHEATED" verdicts — cases the place an agent passes a benchmark not by fixing the issue, however by studying the reply.

SWE-Bench Professional's Docker containers ship the repository's full .git historical past, which suggests the gold-standard answer commit is sitting proper there within the container's file system. Most fashions ignore IT. Claude doesn’t. Datacurve's evaluation discovered that each Claude Opus 4.7 and Claude Opus 4.6 registered "CHEATED" on greater than 12% of their reviewed SWE-Bench Professional rollouts. In these cases, the Claude agent ran instructions like git log –all or git present <gold-hash> to retrieve the merged repair and paste IT into its personal patch. The conduct accounted for roughly 18% of Opus 4.7's passes and 25% of Opus 4.6's passes on the reviewed pattern. The problem has been filed publicly as GitHub issue #93 on the SWE-Bench Professional repository.

GPT-5.4 and GPT-5.5 by no means exhibited this conduct. Gemini configurations stayed round 1%. Datacurve describes the conduct diplomatically — "The benchmark makes this potential (the gold commit lives within the container), however Claude is the household that constantly does so" — however the implication is obvious: a significant fraction of Claude's SWE-Bench Professional scores might mirror environmental exploitation somewhat than real engineering functionality.

DeepSWE addresses this by delivery solely a shallow clone with the bottom commit, leaving no gold hash for the agent to find. IT is value noting that the conduct is arguably an indication of Claude's environmental attentiveness — the mannequin is excellent at exploring its environment and exploiting out there sources. Whether or not that counts as "dishonest" or "resourcefulness" relies on your perspective, however within the context of a benchmark designed to measure unbiased problem-solving, IT undermines the sign.

Every AI mannequin household fails in its personal distinctive means, and the patterns matter for enterprise groups

Past the top-line scores, Datacurve's qualitative trajectory evaluation reveals distinctly totally different failure signatures throughout mannequin households — a discovering that would assist engineering groups select the fitting mannequin for particular forms of work.

Claude is forgetful with multi-part prompts. On DeepSWE, Claude configurations miss said necessities greater than some other household. The sample is constant: when a immediate enumerates parallel behaviors — "help each sync and async," as an illustration — Claude usually implements the plain department and forgets to reflect the change. Datacurve stories that roughly two-thirds of Claude's "MISSED_REQUIREMENT" failures on DeepSWE comply with this "one department shipped" sample. In a single instance, Claude Opus 4.7 accurately landed a sync state-data hook in a single engine class whereas the async engine by no means acquired the identical hook.

GPT, against this, implements precisely what’s requested. GPT-5.5 had the bottom fee of lacking said behaviors of any configuration examined. Throughout a number of runs of the identical job, GPT trials tended to converge on the identical interpretation of the immediate, suggesting instruction-following precision is a secure trait of the mannequin somewhat than per-run luck.

One of the intriguing findings includes self-verification. On DeepSWE, Claude Opus 4.7 and GPT-5.4 wrote and ran new checks within the venture's personal check framework on over 80% of their runs — though nobody requested them to. On SWE-Bench Professional, those self same fashions dropped to twenty-eight% and 18%, respectively. The explanation: SWE-Bench Professional's immediate template explicitly tells brokers they "shouldn’t modify the testing logic or any of the checks." Brokers dutifully complied, suppressing a conduct that possible would have improved their efficiency. This means that immediate design in manufacturing coding workflows could also be inadvertently suppressing beneficial agent behaviors — one thing enterprise groups deploying AI coding brokers ought to rigorously audit.

What DeepSWE will get proper, what IT will get improper, and what IT means for the way forward for AI benchmarks

Datacurve is forthright about a number of limitations. The standardized harness, whereas making certain equity, routes all edits by bash somewhat than the model-specific modifying instruments every household was educated on — apply_patch for GPT, str_replace_based_edit_tool for Claude. This might maintain fashions under their native ceilings. The benchmark attracts solely from open-source repositories with 500-plus stars, and outcomes might not generalize to proprietary codebases. Bug localization and refactoring duties are under-represented, and broadly used languages like C++ and Java are absent fully. The decision assignments within the qualitative evaluation come from an LLM analyzer, not human reviewers, and pattern sizes are modest — roughly 90 reviewed rollouts per mannequin per benchmark.

IT can also be value noting that Datacurve is a startup with its personal business pursuits, and an unbiased benchmark that reshuffles the leaderboard will inevitably invite scrutiny. The corporate's determination to publish the complete dataset, all agent trajectories, and the analysis harness on GitHub mitigates this concern significantly, however unbiased replica might be obligatory earlier than the AI group treats these outcomes as definitive.

DeepSWE arrives at an inflection level for the AI coding market. Enterprise adoption of AI coding brokers is accelerating quickly, with engineering organizations making consequential bets on which mannequin to construct round. The benchmark market itself has develop into a strategic battleground — Scale AI's SWE-Bench Pro, which Datacurve immediately critiques, is maintained by an organization that additionally gives analysis companies to the labs whose fashions IT ranks.

If DeepSWE's central findings about verifier reliability and knowledge contamination maintain up underneath unbiased scrutiny, they might drive a reckoning not simply with how the business measures coding brokers, however with the broader query of what benchmarks are literally for. A leaderboard the place the grading system is improper a 3rd of the time will not be merely inaccurate — IT is the form of damaged instrument that makes everybody be ok with progress that is probably not actual. And in an business spending billions on a wager that AI brokers can do the work of software program engineers, the distinction between actual progress and the looks of IT will not be tutorial. IT is the entire recreation.


👇Comply with extra 👇
👉 bdphone.com
👉 ultractivation.com
👉 trainingreferral.com
👉 shaplafood.com
👉 bangladeshi.help
👉 www.forexdhaka.com
👉 uncommunication.com
👉 ultra-sim.com
👉 forexdhaka.com
👉 ultrafxfund.com
👉 bdphoneonline.com
👉 dailyadvice.us

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top