Could you pass 'Humanity’s Last Exam'? Probably not, but neither can AI

Humanity's Last Exam
(Image credit: Scale AI, CAIS)

Did you know some of the smartest people on the planet create benchmarks to test AI’s capabilities at replicating human intelligence? Well, scarily enough most AI benchmarks are easily completed by artificial intelligence models, showcasing just how smart the likes of ChatGPT’s GPT-4o, Google Gemini’s 1.5, and even the new o3-mini really are.

In the quest to create the hardest benchmark possible, Scale AI and the Center for AI Safety (CAIS) have teamed up to create Humanity’s Last Exam, a test they're calling a “groundbreaking new AI benchmark that was designed to test the limits of AI knowledge at the frontiers of human expertise.”

I’m not a genius by any means, but I had a glance at some of these questions and let me tell you, they're ridiculously tough. So much so that only the brightest minds on the planet could probably answer them. This incredible degree of difficulty means that in testing current AI models were only able to answer fewer than 10 percent of the questions correctly.

The original name for the test was 'Humanity’s Last Stand', but that was changed to Exam, just to take away the slightly terrifying nature of the concept. The questions were crowdsourced, with expert contributors from over 500 institutions across 50 countries coming up with the hardest reasoning questions possible.

The current Humanity’s Last Exam dataset consists of 3,000 questions, and we’ve selected a few samples below to show you just how tricky it is. Can you pass Humanity’s Last Exam? Good luck!

Are you smarter than an AI chatbot?

Question 1:

Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.

Question 2:

I am providing the standardized Biblical Hebrew source text from the Biblia Hebraica Stuttgartensia (Psalms 104:7). Your task is to distinguish between closed and open syllables. Please identify and list all closed syllables (ending in a consonant sound) based on the latest research on the Tiberian pronunciation tradition of Biblical Hebrew by scholars such as Geoffrey Khan, Aaron D. Hornkohl, Kim Phillips, and Benjamin Suchard. Medieval sources, such as the Karaite transcription manuscripts, have enabled modern researchers to better understand specific aspects of Biblical Hebrew pronunciation in the Tiberian tradition, including the qualities and functions of the shewa and which letters were pronounced as consonants at the ends of syllables.

מִן־גַּעֲרָ֣תְךָ֣ יְנוּס֑וּן מִן־ק֥וֹל רַֽ֝עַמְךָ֗ יֵחָפֵזֽוּן (Psalms 104:7) ?

Question 3:

In Greek mythology, who was Jason's maternal great-grandfather?

How did you do? There’s no shame in saying "not very well". I won’t lie – I don’t think I even understood what I was being asked in that second one.

When should we panic?

Humanity's Last Exam benchmark results

(Image credit: Humanity's Last Exam, Scale AI, CAIS)

According to the initial results reported by CAIS and Scale AI, OpenAI’s GPT-4o achieved 3.3% accuracy on Humanity’s Last Exam, while Grok-2 achieved 3.8%, Claude 3.5 Sonnet 4.3%, Gemini 6.2%, o1 9.1%, and DeepSeek-R1 (purely text as it’s not multi-modal) achieved 9.4%.

Interestingly, Humanity’s Last Exam is substantially harder for AI than any other benchmark out there, including the most popular options, GPQA, MATH, and MMLU.

So what does this all mean? Well, we’re still in the infancy of AI models with reasoning functionality, and while OpenAI’s brand-new o3 and o3-mini is yet to take on this incredibly difficult benchmark, it’s going to take a very long time for any LLM to come close to completing Humanity’s Last Exam.

It’s worth bearing in mind however, that AI is evolving at a rapid rate, with new functionality being made available to users almost daily. Just this week OpenAI unveiled Operator, its first AI agent, and it shows huge promise in a future where AI can automate tasks that would otherwise require human input. For now, no AI can come close to completing Humanity’s Last Exam, but when one does… well, we could be in trouble.

You may also like

TOPICS
John-Anthony Disotto
Senior Writer AI

John-Anthony Disotto is TechRadar's Senior Writer, AI, bringing you the latest news on, and comprehensive coverage of, tech's biggest buzzword. An expert on all things Apple, he was previously iMore's How To Editor, and has a monthly column in MacFormat. He's based in Edinburgh, Scotland, where he worked for Apple as a technician focused on iOS and iPhone repairs at the Genius Bar. John-Anthony has used the Apple ecosystem for over a decade, and is an award-winning journalist with years of experience in editorial.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

Read more
Humanity's Last Exam
OpenAI's Deep Research smashes records for the world's hardest AI exam, with ChatGPT o3-mini and DeepSeek left in its wake
EDMONTON, CANADA - FEBRUARY 10: A woman uses a cell phone displaying the Open AI logo, with the same logo visible on a computer screen in the background, on February 10, 2025, in Edmonton, Canada
I asked ChatGPT to work through some of the biggest philosophical debates of all time – here’s what happened
Man with headache
I asked ChatGPT to invent 6 philosophical thought experiments – and now my brain hurts
ChatGPT
ChatGPT wants to write your next novel, and readers and writers alike should be very worried
AI hallucinations
Hallucinations are dropping in ChatGPT but that's not the end of our AI problems
Sam Altman and OpenAI
I pitted ChatGPT’s new o3-mini reasoning model against DeepSeek-R1, and I was shocked by the results
Latest in Artificial Intelligence
David Kampf #64 of the Toronto Maple Leafs warms-up before playing the Philadelphia Flyers at the Scotiabank Arena on March 25, 2025 in Toronto, Ontario, Canada.
ChatGPT and Gemini Deep Research helped me choose an NHL team to support, and now I'm obsessed with ice hockey
A robot painting, created by ChatGPT.
ChatGPT’s new AI image capabilities are genuinely amazing, but they’re so frustrating to use that it made me want to throw my laptop in the trash
Google Gemini 2.5 and ChatGPT o3-mini
I pitted Gemini 2.5 Pro against ChatGPT o3-mini to find out which AI reasoning model is best
Opera AI Tabs
Feel like your browser tabs are out of control? Opera's new AI tab-management tool will bring order to the chaos
Sama virtual assistant
Speak, Book, Fly. Qatar Airways debuts industry-first AI travel agent, Sama
Apple WWDC 2025 announced
3 things Apple needs to do at WWDC 2025 to save Apple Intelligence, and why I'm convinced it will
Latest in Features
Assassin's Creed
Assassin's Creed Shadows has Max subscribers streaming the 2016 movie flop – here are 3 better video game adaptations with over 90% on Rotten Tomatoes
David Kampf #64 of the Toronto Maple Leafs warms-up before playing the Philadelphia Flyers at the Scotiabank Arena on March 25, 2025 in Toronto, Ontario, Canada.
ChatGPT and Gemini Deep Research helped me choose an NHL team to support, and now I'm obsessed with ice hockey
Context Windows
Why are AI context windows important?
A collage of a demasked Spider-Man, Captain Marvel staring into the camera, and Daredevil shouting
17 Marvel heroes I want to see added to the Avengers: Doomsday cast – Spider-Man, Ms Marvel, Wolverine, and more
BERT
What is BERT, and why should we care?
Google Gemini 2.5 and ChatGPT o3-mini
I pitted Gemini 2.5 Pro against ChatGPT o3-mini to find out which AI reasoning model is best