OpenAI's Deep Research smashes records for the world's hardest AI exam, with ChatGPT o3-mini and DeepSeek left in its wake

Humanity's Last Exam
(Image credit: Scale AI, CAIS)

  • The accuracy achieved by the top-scoring AI in the world's hardest benchmark as improved by 183% in just two weeks
  • ChatGPT o3-mini now scores up to 13% accuracy depending on capacity
  • OpenAI Deep Research obliterates competition with 26.6% accuracy result

The world's hardest AI exam, Humanity's Last Exam, was launched less than two weeks ago, and we've already seen a huge jump in accuracy, with ChatGPT o3-mini and now OpenAI's Deep Reasoning topping the leaderboard.

The AI benchmark created by experts from around the world contains some of the hardest reasoning problems and questions known to man – it's so hard, that when I previously wrote about Humanity's Last Exam in the article linked above, I couldn't even understand one of the questions, let alone answer it.

At the time of writing that last article, world phenomenon DeepSeek R1 sat at the top of the leaderboard with a 9.4% accuracy score when evaluated only on text (not multi-modal). Now, OpenAI's o3-mini, which launched earlier this week, has scored 10.5% accuracy at the o3-mini setting, and 13% accuracy at the o3-mini-high setting, which is more intelligent but takes longer to generate answers.

More impressive, however, is OpenAI's new AI agent Deep Research's score on the benchmark, with the new tool scoring 26.6%, a whopping 183% increase in result accuracy in less than 10 days. Now, it's worth noting that Deep Research has search capabilities which make comparisons slightly unfair, as the other AI models don't. The ability to search the web is helpful for a test like Humanity's Last Exam, as it includes some general knowledge-based questions.

That said, the accuracy of results by models taking Humanity's Last Exam results is steadily improving, and it does make you wonder just how long we'll need to wait to see an AI model come close to completing the benchmark. Realistically, AI shouldn't be able to come close any time soon, but I wouldn't bet against it.

Better, but 26.6% never got me any SATs

OpenAI Deep Research is an incredibly impressive tool, and I've been blown away by the examples that OpenAI showed off when it announced the AI agent. Deep Research is able to work as your personal analyst, taking time to conduct intense research and come up with reports and answers that would otherwise take humans hours and hours to complete.

While a score of 26.6% on Humanity's Last Exam is seriously impressive, especially considering how far the benchmark's leaderboard has come in just a couple of weeks, it's still a low score in absolute terms – no one would claim to have passed a test with anything less than 50% in the real world.

Humanity's Last Exam is an excellent benchmark, and one that will prove invaluable as AI models develop, enabling us to gauge just how far they've come. How long will we have to wait to see an AI bypass the 50% mark? And which model will be the first to do so?

You may also like

TOPICS
John-Anthony Disotto
Senior Writer AI

John-Anthony Disotto is TechRadar's Senior Writer, AI, bringing you the latest news on, and comprehensive coverage of, tech's biggest buzzword. An expert on all things Apple, he was previously iMore's How To Editor, and has a monthly column in MacFormat. He's based in Edinburgh, Scotland, where he worked for Apple as a technician focused on iOS and iPhone repairs at the Genius Bar. John-Anthony has used the Apple ecosystem for over a decade, and is an award-winning journalist with years of experience in editorial.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

Read more
Humanity's Last Exam
Could you pass 'Humanity’s Last Exam'? Probably not, but neither can AI
Sam Altman and OpenAI
I pitted ChatGPT’s new o3-mini reasoning model against DeepSeek-R1, and I was shocked by the results
DeepSeek vs ChatGPT
DeepSeek is the new AI chatbot that has the world talking – I pitted it against ChatGPT to see which is best
ChatGPT logo with circuitry in the background.
OpenAI’s new Deep Research is the ChatGPT AI agent we’ve been waiting for – 3 reasons why I can’t wait to use it
A person using DeepSeek on their smartphone
Only two weeks in and AI phenomenon DeepSeek is officially growing faster than ChatGPT
A hand reaching out to touch a futuristic rendering of an AI processor.
DeepSeek and the race to surpass human intelligence
Latest in Artificial Intelligence
A laptop screen showing a ChatGPT coding panel
The ChatGPT Mac app just got a massive coding upgrade – and it’s coming to Windows soon
EDMONTON, CANADA - FEBRUARY 10: A woman uses a cell phone displaying the Open AI logo, with the same logo visible on a computer screen in the background, on February 10, 2025, in Edmonton, Canada
How to use ChatGPT to prepare for a job interview
GPT 4.5
ChatGPT 4.5 understands subtext, but it doesn't feel like an enormous leap from ChatGPT-4o
AI Learning for kids
AI doesn't belong in the classroom unless you want kids to learn all the wrong lessons
EDMONTON, CANADA - FEBRUARY 10: A woman uses a cell phone displaying the Open AI logo, with the same logo visible on a computer screen in the background, on February 10, 2025, in Edmonton, Canada
ChatGPT-4.5 is here (for most users), but I think OpenAI’s model selection is now a complete mess
Google AI Mode
Google previews AI Mode for search, taking on the likes of ChatGPT search and Perplexity
Latest in News
Android 16 logo on a phone
Android 16 beta users are reporting major battery drain issues – but I’m not too worried about it
Woman using iMessage on iPhone
UK government guidelines remove encryption advice following Apple backdoor spat
Man adjusting settings on Garmin Fenix 6 watch
Garmin Fenix 6, Enduro, Marq and Tactix watches are getting fixes to solve some frustrating problems – here's what's new
The Samsung Galaxy S24 Ultra with S Pen drawn, demonstrating Circle to Search
Samsung says ‘millions’ are using Galaxy AI regularly, despite surprising survey results
The Oppo Find N5 open to Google Maps
Android 16 brings a much-needed upgrade to Google Maps that iOS users already have
Apple iPhone 16 Plus Review
iPhone 17 Air leaks suggest it'll get next-gen battery – and offset the 17 Pro Max's weight gains