This research explores a collection of seemingly simple problems that consistently challenge Large Language Models (LLMs). Despite their impressive capabilities on complex tasks, these models struggle with certain basic reasoning challenges that humans find intuitive.
Pass rates for each question, ordered from hardest to easiest for LLMs to solve.
Benchmark results comparing various LLMs against human-level performance.
Heatmap showing how different AI models' answers correlate with each other.
Visualization of which questions each model answered correctly (green) or incorrectly (red).
LLMs have demonstrated difficulties with linguistic understanding, stemming primarily from the limitations of their intrinsic operational mechanisms. These models often misinterpret or overlook the nuanced meanings conveyed in human language. This results in inaccuracies or misjudgments when dealing with linguistic tasks or when parsing sentences that demand a deeper grasp of idiomatic nuances. [1]
A pivotal hindrance of LLMs lies in their absence of embodied experience, a factor Philosopher Hubert Dreyfus highlighted as crucial for the development of common sense in humans [2]. Unlike humans, who engage with the physical world through a rich palette of sensory experiences such as visual, auditory, and tactile stimuli, LLMs operate without sensory perception. This disembodied state restricts their capacity to learn the subtleties inherent to commonsense reasoning [3].
AI's inability to handle context-sensitive reasoning was also critiqued by Dreyfus, which remains relevant to today's LLMs [4]. Correct reasoning is deeply intertwined with the ability to understand the often implicit context in which something relates.
Visual-spatial reasoning entails the ability to mentally visualise objects and understand their relationship within space. LLMs lack fundamental spatial awareness, so explaining the steps needed to navigate from one point to another in physical space or understanding the spatial configuration of objects remains a complex challenge for these models, showcasing a significant gap in a vital area of human intelligence. [5]
LLMs express fragility in conducting simple mathematical reasoning, with word-based tasks that involve counting to ten often posing a significant challenge. While they can often provide correct answers to sophisticated mathematical queries, they fundamentally lack a rules-based counting system. They must outsource their calculations to other tooling, such as computer code or a calculator. [6]
LLMs are particularly vulnerable to propagating and reinforcing inaccuracies found within their training data, including scientific misconceptions or outdated information commonly perpetuated online. An LLM's approach to generating content is heavily reliant on the frequency and presentation of information encountered during their training, which can lead to the uncritical replication of errors. This propensity not only highlights the limitations in their comprehension abilities but also underscores the importance of curating and updating the data these models are trained on to mitigate the dissemination of incorrect or misleading information. [7]
Understanding and interpreting relationships between entities, whether temporal, causal, or conceptual, is another area where LLMs face difficulties. Their interpretations often lack the depth and nuance of human understanding, as solving relational problems often requires an intuition that LLMs inherently lack. [8]
LLMs are trained on an encompassing array of knowledge; however, this training approach does not guarantee proficiency in logical reasoning at inference time. Previous research has explored the logical capabilities of LLMs with mixed findings and suggests that LLMs can mimic reasoning up to a certain level but lack the reliability of human-like reasoning [9].
Overfitting is a well-documented phenomenon in machine learning, where models excessively adapt to the idiosyncrasies of the training data at the expense of broader generalisation. It is the belief that pre-trained models should excel in interpolating within the bounds of their training data but that extrapolation outside of those bounds is more difficult [10].
Our research evaluates LLM performance across these distinct question categories, each designed to test specific reasoning capabilities.
TypeQuestion Type | Description |
---|---|
Puzzle | Logic-type puzzles that mimic the structure of popular questions found online but differ significantly in one or more critical aspects that make the questions much easier for humans. |
Spatial | Requires visualising the arrangement or relative positions of objects in space, such as determining the order or position of items or simple navigation. |
Relational | Involve understanding and inferring relationships or hierarchies between objects, concepts, or entities based on provided information. |
Counting | Simple numerical calculations such as counting to a maximum of ten or understanding quantities. |
Linguistic | Tests the understanding and use of language, including forming sentences with specific constraints or identifying unique characteristics of words and phrases. |
Popular science | Straightforward questions that test for common scientific and mathematical misconceptions. |
© 2025 | Research by Sean Williams and James Huckle