The Five Fundamental Limitations of LLMs at Scale
And—on thinking critically about AI propaganda
On the Fundamental Limits of LLMs at Scale’s core insight is this: certain failures of LLMs are mathematically guaranteed, no matter how much money, resources, data, time, and smart minds you throw at them. Just as we cannot create a perpetual motion machine or divide by zero, we cannot build LLMs that transcend these five constraints: hallucination, context compression, reasoning degradation, retrieval fragility, and multimodal misalignment.
The authors use a proof-informed framework to demonstrate that five fundamental limitations are intrinsic to how language models work—not as engineering challenges to be overcome, but as theoretical impossibilities. Understanding these limits gives us a peek into the future of LLMs by helping us understand where we can expect to see leaps and bounds in the capabilities of LLMs—and where we won’t.
And—this is what I consider to be the single most important question raised by the authors: why do current benchmarks suggest continuous progress despite these intrinsic ceilings?
“Having established that LLM failures stem from five fundamental theoretical limits, namely hallucination, context compression, reasoning degradation, retrieval fragility, and multimodal misalignment, a natural question emerges - why do current benchmarks suggest continuous progress despite these intrinsic ceilings? […] contemporary evaluation practices systematically obscure these limits rather than measure them. Data contamination inflates scores by rewarding memorization over reasoning; judge bias incentivizes confident fabrication aligned with evaluator priors; compute-agnostic metrics hide the cost of marginal gains; and evaluation instability masks the saturation of genuine capability. Together, these artifacts create an illusion of progress that conflates benchmark score increases with fundamental capability advances.”
—Mohsin, Muhammad Ahmed, et al. “On the Fundamental Limits of LLMs at Scale.” arXiv preprint arXiv:2511.12869 (2025).
Let me tell you something that drives me nuts about academia’s spotless reputation in the public eye. Obviously, research coming out of big tech companies is AI propaganda (including this paper). But, academia and other prestigious research institutions are just as much part of the machinery of capitalism as anything else. We are all beholden to our funders. It is all one system.
We’ll come back to that. First, let’s dig into what each of these five fundamental limits of LLMs is and what it means for how we can expect AI products to be designed and deployed.
The Five Fundamental Limits
1. Hallucination
The paper proves that hallucination is theoretically inevitable. LLMs hallucinate—they confidently generate false information as if it were fact. While this is exacerbated by systematic flaws in training data, it happens not because of poor training, but because of a mathematical principle called diagonalization. In essence, for any model of a given size and design, there exist inputs where the model is doomed to fail. No amount of scaling or fine-tuning can eliminate this.
Product implications: Retrieval-Augmented Generation (RAG) has become a go-to mitigation strategy for reducing hallucinations, and we can expect to see more of it. Retrieval-Augmented Generation (RAG) pairs language models with searchable databases of information—ideally peer-reviewed or verified facts, but not always. Rather than relying solely on what the model learned during training, retrieval-augmented generation works by taking your query, searching a knowledge base for relevant material, and feeding that material into the prompt alongside your query. The model then generates its response informed by this retrieved context. This approach lets models access current information and specialized expertise they weren’t trained on, boosting accuracy and usefulness while avoiding the expense of retraining. However, as the authors note, there are inherent limits of this retrieval process, which are further addressed by the fourth “Fundamental Limit of LLMs”—Retrieval Degradation.
Furthermore, citation features that show you where a model’s claims come from represent another acknowledgment of this hallucination limitation. Rather than trusting the model blindly, verification tools will increasingly be built into AI workflows—as will disclaimers acknowledging the possibility of inaccuracy. (This is of course understated. Inaccuracy is not possible; it is inevitable.)

Another implication is that we can expect a flurry of products coming out of Silicon Valley for use-cases where hallucination is low-cost or easily caught—brainstorming, drafting, explaining concepts.
Products designed for high-stakes claims (medical advice, legal interpretation, historical facts), I predict, will be a lot less generative AI-powered than they’d seem to be at first blush. If you slap a thin veneer of LLM-sprucing up on top of essentially a search engine over a database of peer-reviewed or otherwise verifiably factual knowledge, it creates a nice user experience that can be marketed as ChatGPT for doctors, or for lawyers, etc., while building in guardrails against hallucinations. I expect we’ll see more of that.
2. Context Compression
It is a fundamental limit of LLMs that its effective context is significantly less than its nominal capacity.
While LLMs can accept long contexts—documents of thousands or even millions of tokens—they don’t actually use all that information equally. The way attention mechanisms work, combined with the limitations of how positions are encoded, causes information to blur and compress under the surface. Distant or redundant information gets diluted. Information at the edges of the context gets crowded out by the sheer mathematical weight of the tokens in the middle.
The inability to maintain attention-to-detail across a large context, the paper theoretically proves, is an inevitable consequence of the architecture of LLMs.
Product implications: Products are shifting toward multiple retrieval passes rather than stuffing everything into one prompt. Summarization features help compress large documents into essential points before feeding them to the model.
Practically, as an individual user, this means it would behoove you to organize your context strategically. Place the most important information first and last in your prompt. Break large problems into smaller chunks rather than dumping all context at once. Use step-by-step reasoning to force the model to work through information methodically, rather than relying on it to absorb and integrate massive volumes of text passively.
I expect to see LLM products scaffold this process more and more. In the meantime—do it yourself.
3. Reasoning Degradation
LLMs are trained primarily on pattern completion—predicting the next word based on what came before. This training objective is fundamentally different from reasoning, which requires working backward from goals, testing hypotheses, and correcting course. As tasks become more complex, requiring genuine multi-step inference rather than recognizing familiar patterns, model performance degrades significantly. Scaling helps, but it cannot overcome the mismatch between how models are trained and what reasoning actually requires.
This is a bigger deal to the whole AI Revolution than you probably realize becuase of its implications about AGI. I’m not saying that Artificial General Intelligence (AGI) is impossible. I don’t know that. But LLMs—or even their derivatives, Large Reasoning Models (LRMs)—will not be the foundation of anything like AGI, because of exactly this limitation: LLMs are not built to reason. They are literally unreasonable. LLMs are like politicians. They’re language machines. Masters of making us believe they’re on our side, fighting for us, working hard, connecting with us. But it’s just the smoke and mirrors of a deft command of language. But I digress. Or maybe this is the most important point. In any case—moving on.
Product implications: We’re beginning to see systems that break reasoning into explicit steps. Chain-of-thought prompting (asking models to “show your work”) has become standard, and while this strategy has been shown to improve the ability of large language models to perform complex reasoning, their ability to reasoning has limits and vulnerabilities. There’s active research and real-world adoption of systems combining neural and symbolic approaches with LLMs to pair models that are better at simulating reasoning with the generative benefits of LLMs. But the point is that LLMs specifically are not capable of reasoning.
Always ask models to explain their “reasoning” step-by-step. For complex problems, break them into smaller logical components. Use multiple passes: first generate candidate approaches, then evaluate each. When stakes are high, use the model to generate possibilities and then apply human judgment or external verification. Recognize that the model is better at generating plausible-sounding answers than at rigorous logical derivation.
As stated above: as this is a known deficit of LLMs, you can expect to see the process of breaking down large problems into smaller, more steerable logical components to be designed into LLM products more and more. In the meantime—do it yourself.
4. Retrieval Fragility
This limitation is somewhat redundant with other aforementioned limitations, but it specifically addresses how Retrieval-Augmented Generation (RAG) systems—an iteration on LLMs that is meant to ameliorate some of their weaknesses—comes with additional limitations of its own.
When you use an LLM with retrieval—where it searches a knowledge base to find relevant information before answering your question—you’d think more information would always be better. But that’s not how it works because the retrieved information is then competing with the original prompt and the output for a finite token budget. Let’s pause to go over some critical vocabulary:
A token is the smallest unit of language an LLM processes. A token can represent a character, part of a word, a whole word, or even a short phrase.
The context window is the amount of text, in tokens, that the model can consider at any one time. The context window defines the maximum number of tokens an LLM can process in a single request, including both the input (your prompt plus retrieved information) and the output (the generated text). The context window is a hard limit built into the model’s architecture.
Token budget is how many tokens you’re actually allocating or planning to use within that context window for a specific task. It’s more flexible and strategic.
Thus, the retrieved context not only supplements the original prompt, but it competes with it; what’s more, it competes with the output, too, for token budget. This creates a fundamental tension.
The paper discusses workarounds like hierarchical retrieval (intelligently merging adjacent chunks) or chunk-free retrieval (retrieving full documents instead of fragments). These help, but they don’t solve the core problem: you still have a finite token budget. Even if you retrieve perfectly coherent information, you still have to serialize it into a limited context window, which means some information must be lost.
In simple scenarios—a straightforward factual question—retrieval works reasonably well. You ask “What is the capital of France?” and relevant documents surface easily.
But in complex scenarios, multiple failure modes interact. The harder and more complex the problem, the more context you need to solve it well. But the token budget constraint and fragmentation problems get worse as you try to retrieve more and more interconnected information. You can improve retrieval algorithms, but you can’t escape the fundamental tension between token budgets, semantic fragmentation, and the non-linear nature of real knowledge.
Furthermore, LLMs are surprisingly vulnerable to imperfect retrieval. They get distracted by information that is irrelevant to the query and they are significantly sensitive to irrelevant context. The LLMs attention mechanism is, by design, optimized for coverage, not discrimination. It doesn’t have any sort of immune system-like mechanisms to bias it against flawed, contradictory, or incomplete information. Makes sense, right? This again goes back to the fact that it’s a language machine, not a reasoning machine. Even their spuriously named derivatives, Large Reasoning Machines (LRMs), are first and foremost language machines, not reasoning machines.1
Product implications: Products are layering retrieval systems with better ranking algorithms, relevance feedback, and iterative refinement. Some systems now support multi-round retrieval, where the model can ask clarifying questions or refine its search strategy. Caching and memory features help maintain consistency across long interactions. We will continue to see more and better workarounds such as these.
5. Multimodal Misalignment
As LLMs are extended to handle images, video, and other modalities, a new problem emerges. Different modalities encode information in fundamentally different ways. A vision system might extract spatial relationships and objects, while language encodes concepts and relationships linearly. These representations don’t naturally align, and training on both simultaneously encourages the model to learn shallow correlations rather than deep multimodal understanding. The paper acknowledges that there was hope that giving LLMs access to visual information maya somehow improve performance, but that hope has not borne out: “Apparent benchmark gains often mask persistent brittleness, as visual inputs introduce new bottlenecks while preserving pretrained linguistic biases.”
Product implications: Multimodal products are becoming more cautious about capability claims. We’re seeing the emergence of systems that modularize modalities—using specialized vision models and language models that interact explicitly, rather than trying to merge them into one end-to-end system. Better benchmarks for multimodal tasks are emerging, along with more honest documentation of failure cases.
Outro
Think of all of the pieces of knowledge you draw from as you solve a problem. Context windows will continue to get bigger, but they will always be finite. LLMs will never be able to do what you can do—draw from everything you’ve ever experienced that sits in your working memory, waiting to be conjured. Page through limitless books, hop across the internet, and string together patterns you see running through it all. Your mind, unlike LLMs, is plastic and limitless.
why do current benchmarks suggest continuous progress despite these intrinsic ceilings?
Why, do you think?
You, who, like an LLM, can hallucinate. Be careful of that.
You, who, at time, ingests information unthinkingly, like an LLM, forgetting to reason about its trustworthiness, its veracity, its completeness. Its bias.
You.
Why, do you think, current benchmarks suggest continuous progress despite these intrinsic ceilings?
“Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget.” Shojaee, Parshin, et al. “The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.” arXiv preprint arXiv:2506.06941 (2025).



Great piece summarizing the core limitations of LLMs.
The incentive for AI companies to continue to make progress is so big that benchmaxxing (training for the test) is too tempting. And you know what they say: when a metric becomes a target, it ceases to be a good metric.
The same is true for benchmarks.