
For many years, the IQ check has been probably the most acquainted — and most contested — yardsticks for human intelligence. Now, a startup venture referred to as AI IQ is making use of the identical metaphor to synthetic intelligence, assigning estimated intelligence quotients to greater than 50 of the world's strongest language fashions and plotting them on a regular bell curve.
The result’s a set of interactive visualizations at aiiq.org which have ricocheted throughout social media previously week, drawing reward from enterprise technologists who say the charts make an impossibly complicated market legible — and sharp criticism from researchers and commentators who warn your entire framework is deceptive.
"That is tremendous helpful," wrote Thibaut Mélen, a Technology commentator, on X. "A lot simpler to grasp mannequin progress when IT's mapped like this as a substitute of one other large leaderboard desk."
Brian Vellmure, a enterprise strategist, supplied an identical endorsement: "That is useful. Anecdotally tracks with private expertise."
However the backlash arrived simply as rapidly. "IT's nonsense. AI is way too jagged. The map isn’t the territory," posted AI Deeply, a man-made intelligence commentary account, crystallizing a fear shared by many researchers: that lowering a language mannequin's sprawling, uneven capabilities to a single quantity creates a harmful phantasm of precision.
Twelve benchmarks, 4 dimensions, and one controversial quantity: how AI IQ truly works
AI IQ was created by Ryan Shea, an engineer, entrepreneur, and angel investor greatest often called a co-founder of the blockchain platform Stacks. Shea additionally co-founded Voterbase and has invested within the early levels of a number of unicorns, together with OpenSea, Lattice, Anchorage, and Mercury. He holds a Bachelor of Science in Mechanical Engineering from Princeton College.
The positioning's methodology rests on a deceptively easy method. AI IQ teams 12 benchmarks into 4 reasoning dimensions: summary, mathematical, programmatic, and educational. The composite IQ is a straight common of these 4 dimension scores: IQ = ¼ (IQ_Abstract + IQ_Math + IQ_Prog + IQ_Acad).
The summary reasoning dimension attracts from ARC-AGI-1 and ARC-AGI-2, the notoriously troublesome pattern-recognition benchmarks designed to check common fluid intelligence. Mathematical reasoning contains FrontierMath (Tiers 1–3 and Tier 4), AIME, and ProofBench. Programmatic reasoning makes use of Terminal-Bench 2.0, SWE-Bench Verified, and SciCode. Educational reasoning pulls from Humanity's Last Exam, CritPt, and GPQA Diamond.
Every uncooked benchmark rating will get mapped to an implied IQ by way of what the positioning describes as "hand-calibrated problem curves." Crucially, the methodology compresses ceilings for benchmarks thought-about simpler or extra vulnerable to information contamination, stopping them from inflating scores above 100. Tougher, much less gameable benchmarks retain increased ceilings. The system additionally handles lacking information conservatively: fashions want scores on at the least two of the 4 dimensions to obtain a derived IQ, and when benchmarks are absent, the pipeline intentionally pulls scores down relatively than up. The positioning states that "each derived IQ averages all 4 dimensions, so lacking protection can’t make a mannequin look higher by omission."
OpenAI leads the bell curve, however the hole between the highest AI fashions has by no means been smaller
As of mid-Could 2026, the AI IQ charts inform a narrative of fast convergence on the high of the frontier — and widening variety within the tiers beneath.
In response to the Frontier IQ Over Time chart, GPT-5.5 from OpenAI at present sits on the peak of the bell curve, with an estimated IQ close to 136 — the very best of any mannequin tracked. IT is intently adopted by GPT-5.4 (roughly 131), Opus 4.7 from Anthropic (roughly 132), and Opus 4.6 (roughly 129). Google's Gemini 3.1 Pro lands close to 131, making the highest cluster terribly tight.
That compression isn’t distinctive to AI IQ's framework. Visual Capitalist, drawing from a separate Mensa-based rating by TrackingAI, not too long ago noticed the identical dynamic, noting that "the largest takeaway is how compressed the highest of the leaderboard has change into." On that scale, Grok-4.20 Skilled Mode and GPT 5.4 Professional tied at 145, with Gemini 3.1 Professional at 141.
Under the frontier cluster, the AI IQ charts present a crowded midfield. Fashions from Chinese language labs — Kimi K2.6, GLM-5, DeepSeek-V3.2, Qwen3.6, MiniMax-M2.7 — bunch between roughly 112 and 118, making the cost-performance tier more and more aggressive for enterprise consumers who don't want the very best mannequin for each job. One X person, ovsky, famous that the info "confirms expertise with sonnet 4.6 being an absolute workhorse versus opus 4.5" — pointing to the best way the charts can validate practitioner intuitions that headline rankings typically miss.
Why emotional intelligence scores have gotten the brand new battleground in AI mannequin rankings
What distinguishes AI IQ from most different benchmarking efforts is its inclusion of an "EQ" — emotional intelligence — rating. The positioning maps every mannequin's EQ-Bench 3 Elo rating and Enviornment Elo rating to an estimated EQ utilizing calibrated piecewise-linear scales, then takes a 50/50 weighted composite of the 2.
The EQ scores produce a meaningfully completely different rating than IQ alone. On the IQ vs. EQ scatter plot, Anthropic's Opus 4.7 leads on EQ with a rating close to 132, pushing IT into the upper-right quadrant — essentially the most fascinating place, signaling each excessive cognitive and excessive emotional intelligence. OpenAI's GPT-5.5 and GPT-5.4 cluster within the high-IQ zone however lag barely on EQ. Google's Gemini 3.1 Professional sits in a powerful center place on each axes.
One notable methodological selection has drawn consideration: EQ-Bench 3 is judged by Claude, an Anthropic mannequin, which the positioning acknowledges "creates potential scoring bias in favor of Anthropic fashions." To right for this, AI IQ subtracts a 200-point Elo penalty from the EQ-Bench part for all Anthropic fashions earlier than mapping to implied EQ. The Enviornment part is unaffected since IT makes use of human judges. That self-correction is uncommon within the benchmarking world, and IT suggests Shea is conscious of the methodological minefield he has entered. Nonetheless, the EQ dimension captures one thing IQ alone can’t: the rising significance of conversational high quality, collaboration, and belief in fashions deployed for user-facing work.
The AI cost-performance chart that enterprise consumers truly must see
Maybe essentially the most virtually helpful chart on the positioning isn’t the bell curve however the IQ vs. Effective Cost scatter plot. IT maps every mannequin's estimated IQ towards an "efficient value" metric — outlined because the token value for a job utilizing 2 million enter tokens and 1 million output tokens, multiplied by a utilization effectivity issue.
The chart reveals a well-recognized sample in enterprise Technology: the perfect fashions aren’t at all times the perfect worth. GPT-5.5 and Opus 4.7 sit within the upper-left nook — excessive IQ, excessive value, with efficient per-task prices north of $30 and $50 respectively. In the meantime, fashions like GPT-5.4-mini, DeepSeek-V3.2, and MiniMax-M2.7 occupy a candy spot within the center: respectable IQ scores between 112 and 120, at efficient prices starting from roughly $1 to $5 per job. On the most cost-effective excessive, GPT-oss-20b (an open-source OpenAI mannequin) seems close to $0.20 efficient value with an IQ round 107 — probably essentially the most economical possibility for bulk classification or extraction workloads.
The positioning additionally provides a 3D visualization mapping IQ, EQ, and efficient value concurrently. A dashed line working by way of the dice factors towards the best: increased IQ, increased EQ, and decrease value. Fashions close to the "inexperienced finish" of that axis are stronger all-around offers; these close to the "purple finish" sacrifice functionality, value effectivity, or each. For CIOs watching API invoices, the implication is obvious: the intelligence hole between a $50 mannequin and a $3 mannequin has narrowed sufficient that routing — utilizing costly fashions for onerous issues and low cost ones for every part else — is not non-compulsory. IT is the dominant structure for critical AI deployments.
Critics say AI's "jagged" capabilities make a single IQ rating dangerously deceptive
The loudest objection to AI IQ is philosophical, and IT cuts deep. Critics argue that collapsing a mannequin's uneven capabilities right into a single rating obscures greater than IT reveals.
"IQ as a proxy is fading — we're seeing reasoning density spikes that don't map to g-factor," posted Zaya, a Technology commentator, on X. "GPT-5.5 already hit saturation on MMLU-Professional, however nonetheless fails ClockBench 50% of the time."
That statement touches on what AI researchers name the "Technology/how-jagged-intelligence-can-reframe-the-ai-debate.html”>jaggedness" downside: giant language fashions typically exhibit wildly uneven capabilities, excelling at graduate-level physics whereas failing at duties a toddler might do. A composite rating can paper over these gaps.
Pressureangle, one other X person, posted a extra granular critique, calling out "complete lack of transparency" and arguing the positioning by no means absolutely discloses how its calibration curves had been created or validated. In equity, AI IQ does record its 12 benchmarks and exhibits the form of every calibration curve in its methodology modal. However the uncooked information and exact mathematical transformations aren’t printed as open datasets — a niche that issues to researchers accustomed to totally reproducible strategies.
Others questioned the premise itself. "As ineffective as human IQ testing," wrote haashim on X. Shubham Sharma, an AI and Technology author, supplied a constructive various: "Why not having the Fashions take an official (MENSA-Grade) check? Wouldn't this be essentially the most correct and most 'human-comparable' strategy to benchmark intelligence?" That strategy already exists by way of TrackingAI, which administers the Mensa Norway IQ check to language fashions. However Mensa-style assessments measure solely summary sample recognition, whereas AI IQ makes an attempt a broader composite throughout coding, arithmetic, and educational reasoning. As Visible Capitalist famous, "an IQ-style benchmark captures just one slice of functionality." Every strategy has tradeoffs — and neither has received the argument but.
The actual race isn't for the very best rating — IT's for the neatest mannequin stack
For all the talk about methodology, an important sign in AI IQ's information is probably not any single mannequin's rating. IT is the form of the market the charts reveal.
There at the moment are greater than 50 frontier-class fashions obtainable by way of APIs, from at the least 14 main suppliers spanning the USA, China, and Europe. Every supplier publishes its personal benchmarks, typically cherry-picked to showcase strengths. The result’s a Tower of Babel the place no two firms measure the identical factor in the identical manner. Educational analysis has highlighted that "most benchmarks introduce bias by specializing in a specific sort of area," and the Frontier IQ Over Time chart on AI IQ exhibits simply how briskly the targets are transferring: in October 2023, GPT-4-turbo sat close to an estimated IQ of 75. By early 2026, the highest fashions had been brushing 135 — roughly 60 factors of enchancment in 30 months.
That tempo raises a elementary query about whether or not any scoring system can sustain. The positioning compresses ceilings for saturated benchmarks, however as fashions proceed to max out even the toughest assessments — ARC-AGI-2, FrontierMath Tier 4, Humanity's Last Exam — the framework will face the identical ceiling results which have plagued each AI analysis earlier than IT. Connor Forsyth pointed to this dynamic on X: "ARC AGI 3 disagrees," he wrote, referencing a next-generation benchmark which will already be undermining present scores.
AI IQ isn’t good. Its methodology is partially opaque. Its IQ metaphor can mislead. And its creator acknowledges identified biases whereas probably lacking others. However the various — wading by way of dozens of provider-specific benchmark tables, every utilizing completely different check suites and scoring conventions — is worse. The positioning provides enterprise consumers one thing genuinely scarce: a single framework for evaluating fashions throughout suppliers, dimensions, and value factors, up to date often, with sufficient nuance to indicate that the correct reply to "which mannequin is greatest?" is sort of at all times "IT relies on the duty."
As Debdoot Ghosh mused on X after viewing the charts: "Now a human's role is just to orchestrate?"
Perhaps. But when the AI IQ information exhibits something clearly, IT is that orchestration — understanding which mannequin to deploy, when, and at what value — has change into its personal type of intelligence. And for that, there isn’t any benchmark but.
👇Comply with extra 👇
👉 bdphone.com
👉 ultractivation.com
👉 trainingreferral.com
👉 shaplafood.com
👉 bangladeshi.help
👉 www.forexdhaka.com
👉 uncommunication.com
👉 ultra-sim.com
👉 forexdhaka.com
👉 ultrafxfund.com
👉 bdphoneonline.com
👉 dailyadvice.us