openai

OpenAI previews o3 and sparks claims that "human-level intelligence" has been born

Does the new model mark the dawn of artificial general intelligence and the imminent doom of our species - or is it all just hype?

Machine

21 Dec 2024 — 4 min read

Sam Altman and Mark Chen of OpenAI introduce the new model, o3

OpenAI has released a new model called o3 that has absolutely destroyed a benchmark designed to compare the intelligence of humans and machines.

The new model beat previous GPTs on the ARC-AGI test so convincingly that it prompted members of the online AI community to declare that artificial general intelligence (AGI) has officially been born. And at Christmas too, which is a great birthday for any aspiring deity.

The model also achieved incredible performance on difficult maths and coding benchmarks - although we'd advise some caution around the AGI claims for now.

ARC-AGI is managed by the ARC Prize Foundation and measures progress towards genuine intelligence, which its creators argue "lies in broad or general-purpose abilities; it is marked by skill-acquisition and generalisation, rather than skill itself".

"AGI is a system that can efficiently acquire new skills outside of its training data," wrote François Chollet, the now legendary ex-Google AI engineer who first established the benchmark.

After GPT-3 was first released, it took four years to go from a 0% score with GPT-3 in 2020 to 5% in 2024 with GPT-4o.

An illustration of o1's incredible performance on the ARC-AGI benchmark

OpenAI's new o3 system, which was trained on the ARC-AGI-1 Public Training set, scored a breakthrough of 75.7% - triple the score of o1 - with a high-compute model using 172 times as many resources achieving 87.5%.

"All intuition about AI capabilities will need to get updated for o3," Chollet added.

However, there is bad news for anyone hoping to get on their knees and start praying to a newborn digital deity. In its low compute mode, o3 cost between $17 and $20 to solve a task - compared to $5 for a human - although this advantage is not likely to last long.

"Cost-performance will likely improve quite dramatically over the next few months and years, so you should plan for these capabilities to become competitive with human work within a fairly short timeline," the ARC Prize said.

Chollet also stated that ARC-AGI is "not an acid test for AGI".

"Passing ARC-AGI does not equate to achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet," Chollet continued. "o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence."

The AI don said the upcoming ARC-AGI-2 benchmark will "still pose a significant challenge to o3", potentially reducing its score to less than 30% even at high compute, while a smart human could get 95% with no training.

O3 is a maths and coding powerhouse

The new model's performance on two coding benchmarks

The model also blasted through a range of other benchmarks. On EpochAI’s Frontier Math - a compendium of novel, unpublished and mega-hard problems - it solved 25.2% of challenges. Until now, no other model has made it beyond 2%.

On SWE-Bench Verified, o3 beat o1 by 22.8 percentage points and on Codeforces, o3 achieved 2727 points - beating OpenAI’s Chief Scientist’s score of 2665.

During the livestream announcing o3, Altman asked Mark Chen, OpenAI SVP of research, what his own Codeforces score was.

"My best at a comparable site was about 2500," Chen said.

"That's tough," Altman replied.

There is one person with a score of above 3,000 at OpenAI, who only "has a few months to enjoy", the croaky-voice CEO quipped.

Additionally, o3 scored 96.7% on AIME 2024, dropping just one question, and 87.7% on GPQA Diamond - much better than a human.

Has OpenAI achieved AGI?

OpenAI o3 is 2727 on Codeforces which is equivalent to the #175 best human competitive coder on the planet.

This is an absolutely superhuman result for AI and technology at large. pic.twitter.com/l43DTJDTqR
— Deedy (@deedydas) December 20, 2024

Despite repeated insistences that AGI has not been achieved, the launch of o3 sparked wild excitement on social media.

"Artificial Human-level intelligence (AHI) just dropped today," wrote Beff Jezos, the pseudonym of an AI entrepreneur.

Yana Welinder, CEO of AI Copilot firm Kraftful, also posted: "o3 has scored over 85% on ARC-AGI 🤯. Human performance is at 85%. In other words: AGI has been achieved in 2024. 2025 is going to be wild!"

Matthew Berman, Owner of AI firm Forward Future, proclaimed: "This is AGI (not clickbait). o3 is the best AI ever created, and its performance is WILD."

Meanwhile, the anonymous account AISafetyMemes wrote: "This is the date most likely so far to go down in history as the day AGI was achieved."

pic.twitter.com/uI1MDz41d7 Sam Altman already referred to o3 four days ago and gave the first hints. He is in favor of a safety framework, which indicates that we will probably only see a preview today. But since he hints at the danger, I assume that AGI really is almost here.
— Chubby♨️ (@kimmonismus) December 20, 2024

OpenAI has not yet released 03 publically (it skipped the name o2 to avoid any clashes with the phone firm owned by the Spanish telecommunications company Telefónica).

It seems to be taking extra caution with this release - a move which has been hailed as evidence that it has achieved AGI internally, although this claim cannot be confirmed.

OpenAI has invited safety researchers to apply for early access to its next frontier models. This initiative complements existing safety testing processes, which included internal evaluations, external red teaming efforts, and collaborations with organisations such as the UK and US AI Safety Institutes.

"As models become more capable, we are hopeful that insights from the broader safety community can bring fresh perspectives, deepen our understanding of emerging risks, develop new evaluations, and highlight areas to advance safety research," it wrote.

Have you got a story to share? Get in touch and let us know.