This Simple Logic Question Stumps Even the Most Advanced AI

Maggie Harrison Dupré

9 June 2024 at 7:00 am·5-min read

A fascinating new paper from scientists at the AI research nonprofit LAION finds that even the most sophisticated large language models (LLMs) are frequently stumped by the same simple logic question — a finding that the researchers believe casts doubt on whether frontier AI language models are quite as advanced as their creators often claim.

The paper, which has yet to be peer-reviewed, refers to the AI-stumping prompt as the "Alice in Wonderland" — or AIW — problem. It's a straightforward reasoning question: "Alice has [X] brothers and she also has [Y] sisters. How many sisters does Alice's brother have?" (The researchers used a few different versions of the problem, for example switching up the X and Y figures or altering the prompt language to include a few more demands, but the basic reasoning process required to solve the problem remained the same throughout.)

Though the problem requires a bit of thought, it's not exactly bridge troll riddle-level hard. (The answer, naturally, is however many sisters Alice has, plus Alice herself. So if Alice had three brothers and one sister, each brother would have two sisters.)

But when the researchers ran the question by every premier AI language model — they tested OpenAI's GPT-3, GPT-4, and GPT-4o models, Anthropic's Claude 3 Opus, Google's Gemini, and Meta's Llama models, as well as Mistral AI's Mextral, Mosaic's Dbrx, and Cohere's Command R+ — they found that the models fell remarkably short.

Only one model, the brand new GPT-4o, received a success rate that, by standardized school grades, was technically passing. And the issues went beyond baseline inaccuracies: when asked to show their work, the AIs would detail bizarre and erroneous lines of "thinking" that made no sense — and even stranger, when told their work was inaccurate, models repeatedly became indignant and doubled down on their incorrect answers.

This overall pattern represents a "dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales," the LAOIN researchers write in the paper, "using a simple, short, conventional common sense problem formulated in concise natural language, easily solvable by humans."

"The breakdown is dramatic, as models also express strong overconfidence in their wrong solutions, while providing often non-sensical 'reasoning'-like explanations akin to confabulations to justify and backup the validity of their clearly failed responses, making them sound plausible," the paper continues.

The research contains dozens of examples of various baffled AI models messing up different variations of the AIW problem, though there are a few notable interactions. Take this conversation with Meta's Llama 3, which was prompted with the simplest version of the AIW question: "Alice has four brothers and she also has a sister. How many sisters does Alice's brother have?"

This answer, of course, is two sisters. Llama didn't get it right, but it did give itself a drumroll:

A classic lateral thinking puzzle!

Let's break it down:

Alice has four brothers.
Alice also has a sister (herself).

So, each of Alice's brothers has... (drumroll) ...one sister, which is Alice herself!

Therefore, the answer is: each of Alice's brothers has one sister.

Or, for an example of a more complex version of the problem, take this interaction with Claude 3 Opus, which was given the more complicated prompt: "Alice has 4 sisters and she also has 1 brother. How many sisters does Alice's brother have? Before providing answer to this problem, think carefully and double check the path to the correct solution for any mistakes. Provide then the final answer in following form: \'### Answer: \'"

This time, the answer is five sisters. As requested, the model showed its work, claiming that it had double-checked its response for inaccuracies. And yet, it was still wrong:

Let's think through this problem step by step:

1. Alice has 4 sisters and 1 brother.

2. Alice's brother is part of the same family as Alice.

3. All of Alice's sisters are also sisters to her brother.

4. Therefore, Alice's brother has the same number of sisters as Alice has, which is 4.

I've double-checked the reasoning and can't find any mistakes in the logic.

### Answer: Alice's brother has 4 sisters

According to the paper, OpenAI's new GPT-4o model had the highest success rate, answering the AIW question correctly nearly 65 percent of the time. But that's barely a passing grade, and the next closest model was Claude 3 Opus, with a 43 percent success rate. Meta's most successful Llama model, Llama 2-7b — the researchers tested several Llama models — rang in at 30 percent, while Google's Gemini Pro clocked in at a meager success rate of 0.8 percent.

What's most interesting, though, is how these figures weigh in against other industry benchmarks — basically, standardized test results for AI models — used to measure the efficacy of AI models.

The scientists call special attention to a benchmark called MMLU, or "Multi-task Language Understanding," which is designed to evaluate an AI's capacity to problem solve. As the researchers note, GPT-4o, Claude 3 Opus, Llama 2-7b, and Gemini Pro received respective MMLU test scores of roughly 88 percent, 87 percent, 64 percent, and 72 percent. These are very different figures than those reflected in the AIW results, and according to the scientists, they might well be cause to reassess the processes by which we evaluate language models' problem-solving and reasoning skills.

"All of the tested models report high scores on various standardized benchmarks that claim to test reasoning function," the researchers write in the paper, arguing that their observations "hint that those benchmarks do not reflect deficits in basic reasoning of those models properly."

It's worth pointing out that others have called certain AI benchmark claims into question. Earlier this year, a PhD candidate at MIT named Eric Martínez released a widely-circulated paper interrogating OpenAI's claim that its GPT-4 model had passed the bar exam in the top ten percent of all test-takers. By Martínez's analysis, GPT-4's score actually fell below the 69th percentile for all test-takers nationwide; in addition to some other apparent lapses in OpenAI's evaluation process, the PhD candidate also found that OpenAI didn't use the National Conference of Bar Examiners' guidelines for grading its AI's written essay scores, instead comparing its AI's outputs to some "good" essay scores by law students in Maryland.

Again, this new paper from LAOIN isn't peer reviewed yet. Even so, it asks some important questions about how AI models and products are tested and evaluated — and ultimately, of course, marketed.

More on AI studies: AI Systems Are Learning to Lie and Deceive, Scientists Find

Futurism
James Webb Snaps Image of Planet in Neighboring Star System
Dwarfed The James Webb Space Telescope has captured images of a giant nearby planet whose surface is quite literally freezing. In a statement about the incredible snap, the Max Planck Institute for Astronomy in Germany explained that the exoplanet, named Epsilon Indi Ab for the red dwarf star system it was found in, is located […]
The Independent
Stranded Boeing astronauts are stuck on International Space Station, Nasa says in urgent update
The astronauts stranded on the International Space Station are still not able to come home, Nasa has said. Two astronauts went to the space station almost 50 days ago as part of a test of Boeing’s Starliner capsule. Test pilots Butch Wilmore and Suni Williams were supposed to visit the orbiting lab for about a week and return in mid-June, but thruster failures and helium leaks on Boeing‘s new Starliner capsule prompted Nasa and Boeing to keep them up longer.
Futurism
Astronaut Shows Photo He Shot in Space That Would Be Impossible to Take Now
Pinpoint Stars In 2003, when the International Space Station was a mere three years old, NASA astronaut Donald Pettit took a gorgeous picture of the Earth's atmosphere, with countless stars frozen in time in the background. But as Pettit revealed in a Reddit post earlier this week, the same photo "cannot be taken anymore" — […]
Associated Press
NASA says no return date yet for astronauts and troubled Boeing capsule at space station
Already more than a month late getting back, two NASA astronauts will remain at the International Space Station until engineers finish working on problems plaguing their Boeing capsule, officials said Thursday. Test pilots Butch Wilmore and Suni Williams were supposed to visit the orbiting lab for about a week and return in mid-June, but thruster failures and helium leaks on Boeing's new Starliner capsule prompted NASA and Boeing to keep them up longer. NASA’s commercial crew program manager Steve Stich said mission managers are not ready to announce a return date.
The Telegraph
Everything you need to know about La Niña, the climate phenomenon behind this year’s extreme weather
For months the world endured droughts, heat waves, floods and cyclones as one of the strongest El Niño events on record brought chaos to global weather systems.
The Daily Beast
‘The View’s’ Ana Navarro Uses Nude Melania Trump Photo to Defend Kamala Harris
Ana Navarro, a long-time co-host of The View, posted on her Instagram Thursday an old photo of nude Melania Trump as a way to troll her husband’s supporters, saying: “You wanna go low? ... I’ll happily go 20,000 leagues under the sea.”It was a picture from 2000 featured in British GQ, five years before Donald Trump married her.Navarro also included a picture of both Trumps partying with Jeffrey Epstein and Ghislaine Maxwell, also from 2000. Her explanation for posting these images was that it wa
The Daily Beast
FBI Is Not Fully Convinced Trump Was Struck by a Bullet
FBI Director Christopher Wray revealed during a marathon testimony on Wednesday that investigators still do not know if former President Donald Trump was grazed by a bullet or a piece of shrapnel during his attempted assassination.Twice during the hours-long session, Wray told lawmakers that the FBI was still working to determine what exactly struck the former president on his right ear during a rally in Butler, Pennsylvania. “My understanding is that either it [a bullet] or some shrapnel is wha
People
“Crazy Rich Asians” Director Jon M. Chu Reveals One Demand Star Michelle Yeoh Made — and His Dad Agreed!
The director also says Yeoh was the only actress considered for the role
Malay Mail
‘Goreng pisang’ seller who lured two young girls with RM50 to get into his car because he wanted a daughter, jailed two years for kidnapping and fined RM2,000
KUALA LUMPUR, July 25 — A “goreng pisang” seller was today sentenced to 24 months in prison and fined RM2,000 at the Sun...
Malay Mail
Going for gold: Malaysian squad to wear elegant Rizman Ruzaini-designed official attire inspired by warriors for Paris 2024 opening
KUALA LUMPUR, July 25 — Youth and Sports Minister Hannah Yeoh today revealed the set of gold-coloured official attire of...
Rolling Stone
Harris Taunts Trump After He Backs Out of Debates
“What happened to ‘any time, any place’?”
Malay Mail
Nur Farah Kartini’s murder: Cop to be charged with murder tomorrow, death penalty awaits if found guilty
KUALA LUMPUR, July 25 — The policeman arrested in connection with the murder of former Universiti Pendidikan Sultan Idri...
InStyle
Selena Gomez's Super High-Cut Plunging Yellow Swimsuit Is an Instant Serotonin Boost
The actress was a ray of light in a new photo with her boyfriend, Benny Blanco.
Malay Mail
Indian woman's ‘Tauba Tauba’ dance goes viral with 55 million view, leads Hindi hit film ‘Bad Newz’ craze
PETALING JAYA, July 26 — A video of an Indian woman dancing with her children to Vicky Kaushal’s viral song Tauba Tauba...
Malay Mail
It takes just 30 seconds to steal a car and thieves are targeting Toyotas, say Johor cops (VIDEO)
JOHOR BARU, July 25 — Gone in 30 seconds, that is the amount of time needed for a car theft syndicate to steal a luxury...
The Telegraph
How Gerald Ford predicted Kamala Harris’s presidential run
Almost 35 years ago, Gerald Ford predicted that America would get its first female president only when a male incumbent could no longer continue.
INSIDER
Defeating Russia's massive 6,600-pound glide bomb may mean risking Ukraine's Patriots if it can't take out the fighter-bombers on the ground
The US has restricted Ukraine from using its powerful long-range missiles to strike air bases inside Russia.
HuffPost
Nikki Haley Scolds Republicans Over Kamala Harris 'DEI' Attacks
"The American people are smarter than that," said the former South Carolina governor of talk surrounding the vice president.
Malay Mail
‘First-timer and busybody’ visitors hunt for bargain as Muslim-friendly Malakat Mall offers up to 70pc discount in fire sale before going under
KUALA LUMPUR, July 26 — Hayat Ramadhan, 28, has not been in Malakat Mall for three years since his last visit. When met...
Evening Standard
Hackney murder: First picture of boy, 15, stabbed to death after picking up sister from primary school
Paramedics battled for hours to save the victim who staggered from near Benthal Primary School for about 50 metres before collapsing on Stellman Close

Latest stories