I’m a massive Pokémon fan, and now I’m obsessed with AI models like Gemini and Claude trying to complete Pokémon Red and Blue

Pallett Town Pokemon Red
(Image credit: The Pokémon Company)

Pokémon Red and Blue on Game Boy are some of the best-selling video games of all time, and part of the world’s biggest media franchise. Now, it’s being used as an AI benchmark, and as a lifelong fan of Pikachu, I’m absolutely hooked watching AI try to figure these games out.

You may have heard of Claude Plays Pokemon already, but if not, there’s a Twitch channel live streaming Anthropic’s AI attempting to complete Pokemon Red, and it has been playing the 1996 video game classic for months now.

I had stumbled across the project a while ago, but it wasn’t until I heard of Gemini Plays Pokemon, another Twitch livestream using Google’s Gemini 2.5 Pro Experimental, that I started to take notice.

Pokémon has turned into an actual benchmark for agentic AI models to the extent that Anthropic even highlighted 3.7 Sonnet’s ability to get deeper into the game compared to its predecessor, 3.5, when announcing the new model.

Google has since showcased its own Pokémon prowess, with Logan Kilpatrick, a Lead for AI Studio, sharing Gemini 2.5 Pro’s ability to earn the 5th badge on X in just 500 hours.

But why Pokémon?

It turns out that using checkpoints in Pokémon Red is a great way to showcase each AI model's ability to problem solve and think its way to success. Pokémon on Game Boy is also proving to be a fun way to see AI’s critical thinking and ability to complete tasks with ambiguity.

Anthropic says, “The model's ability to maintain focus and accomplish open-ended goals will help developers build a wide range of state-of-the-art AI agents.”

If you’ve never played a Pokémon game before, you set out on a journey to catch new monsters and earn eight badges by defeating gym leaders (essentially bosses). Once you’ve got eight badges, you set out to defeat the Elite Four (another boss rush with four increasingly difficult Pokémon trainers).

It’s not black and white

Pokémon Red and Blue Game Boy box arts

(Image credit: Nintendo)

Currently, Claude 3.7 Sonnet is at Mt. Moon, following a recent reset where it got stuck, the model's best result so far is earning the 3rd badge in Vermillion City. Gemini, on the other hand, is now navigating Victory Road, having earned all eight badges in Pokémon Blue.

While that comparison might sound like Gemini 2.5 Pro Experimental is far better than Claude 3.7 Sonnet at playing Pokemon, an AI expert on the site Lesswrong, has explained how it's not so clear-cut.

You see, each Twitch stream has different conditions, including how the playthrough has been set up and how the developer running it interacts with the run. If you’re remotely interested in the concept of Pokémon as a testing tool for AI, I highly encourage you to read the brilliant analysis on the subject found on Lesswrong.

In the piece, “Is Gemini now better than Claude at Pokémon?” The author explains the implementation differences between the two streams and how they aren’t directly comparable because of the way each AI model takes action, as well as the “agent harness” each one has been given.

Think of an agent harness as the external factors that are put in place to help the AI agent obtain its goal. That could be a tool to help with pathfinding, or even extra information to determine the best way to approach a playthrough. In the article, the author concludes that due to the unknowns around the exact circumstances each AI model is playing Pokémon, it’s impossible to say for sure which one is better at it.

That said, while it might not be the most in-depth benchmark, Pokémon is an incredibly fun way to see AI’s capabilities in action.

Are you smarter than a 6-year-old?

There’s something incredibly endearing about watching a livestream of AI try to maneuver itself through a game most of us played when we were kids. If you were born in the 90s, chances are Pokémon Red or Blue were the first video game titles you ever played as a child.

As kids, there was an element of the unknown with Pokémon that made it enthralling; it launched at a time when you couldn’t quickly search for results on Google, and ChatGPT definitely didn’t exist to tell you where to go next. As 6-year-olds, we managed to figure it out in the end, but will Claude and Gemini?

When talking about why they watch AI play Pokémon on stream, one Reddit user said, “I've been watching ClaudePlaysPokemon for a long time now.”

“The project really highlights the weaknesses of LLMs for sure, but it's also strangely addictive. You just want to root for the little guy.”

Another said, “The funny thing is that Pokémon is a simple, railroady enough game that RNG can beat the game given enough time (and this has been done), but it turns out to take a surprising amount of cognitive architecture to play the game in a fully-sensible-looking way.”

At the time of writing, 70+ people are watching Claude Plays Pokémon on Twitch, and another 100+ are watching Gemini Plays Pokémon. While it’s likely Gemini will become a Pokémon master before Claude, these projects are more about the journey and the perfect balance between modern technology and the nostalgia of the past.

You might also like

TOPICS
John-Anthony Disotto
Senior Writer AI

John-Anthony Disotto is TechRadar's Senior Writer, AI, bringing you the latest news on, and comprehensive coverage of, tech's biggest buzzword. An expert on all things Apple, he was previously iMore's How To Editor, and has a monthly column in MacFormat. John-Anthony has used the Apple ecosystem for over a decade, and is an award-winning journalist with years of experience in editorial.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.