RAG or Not RAG: Why Long Context Models Make RAG Obsolete

The RAG Myth, or How Companies Tried to Give Their AI a Memory

For nearly two years, any discussion about artificial intelligence in business started with the same question: “We’re doing RAG, right?”

This phrase, which became a ritual in innovation departments, summed up a collective obsession: providing memory to large language models (LLMs) incapable of reading more than a handful of pages at a time.

In 2022, GPT-4 could only process 8,000 tokens—about twelve pages. In other words, almost nothing. For companies wanting to leverage their annual reports, document repositories, or legal archives, this limitation made AI practically blind.

The solution was then to graft on an external memory: Retrieval-Augmented Generation, or RAG.

The principle was ingenious. Documents were broken into small chunks, embeddings were created to represent their meaning, then, upon a query, the system retrieved relevant passages and injected them into the language model to produce an enriched response.

In short: brilliant tinkering, but tinkering nonetheless.

RAG, or the Artificial Memory of the Early Days

RAG was born from constraint. Its purpose: to compensate for the weak reading capacity of early models. In theory, the idea worked: connecting a semantic search engine to a generative model. In practice, it meant assembling a complex, fragile system that was often expensive to maintain.

Three major limitations quickly emerged:

Technical complexity: chunking, vectorization, re-ranking, maintaining vector databases.
Loss of coherence: fragmented texts caused the continuity of reasoning to be lost.
Imprecise search: the semantic approach, inherently fuzzy, could betray the rigor expected in legal, medical, or financial contexts.

In short, RAG created an illusion: it gave the impression of an “expert” AI, but remained dependent on a set of scripts and often cobbled-together pipelines. As an Elastic engineer wrote in 2024, “RAG was a bit like teaching a parrot to look up its notes before speaking.”

The Arrival of Giant Context Windows: A Silent Shift

Everything changed in 2024.

Claude, Gemini, Grok, then GPT-4 Turbo extended their context windows to spectacular dimensions:

Claude: 200,000 tokens (about 400 pages),
Gemini: 1 million,
Grok: 2 million,
and certain prototypes already reach 10 million tokens, according to IBM Research.

This evolution marks a turning point: LLMs no longer just “memorize,” they read and connect.

As IBM’s article (“Larger Context Windows”) explained, the goal is no longer simply to expand memory, but to increase reasoning capacity over long corpora while preserving logical coherence.

In other words, models don’t just remember words; they understand text structure, detect links between sections, and follow internal references like a researcher leafing through a report.

Claude Code: The Symbol of a Paradigm Shift

Anthropic led the way with Claude Code.

This AI relies on no RAG. It simply uses internal search tools—including the famous grep, invented in 1973—to navigate files, follow references, and understand code like a human.

This approach demonstrates that RAG is no longer indispensable.

With an expanded context window, AI can read a complete set of files, identify dependencies, and provide a coherent response without going through a vector database.

It’s a cultural shift: we’re no longer trying to “retrieve” information, but to reason from a coherent whole.

The Persistent Limitations of RAG

The research paper published on arXiv in July 2024 (“RAGs vs LLMs with Long Context Windows”) synthesizes results from several comparative experiments.

On analytical reading tasks (summarization, complex QA, document synthesis), long context models outperform RAGs in 80% of cases, especially when the corpus has a clear logical structure.

However, RAG retains value in three situations:

Constrained environments: when computational cost prevents using large window models.
Dynamic sources: when data changes continuously (news, e-commerce).
Need for multi-source indexing: to explore multiple types of heterogeneous content (PDFs, emails, ticket databases).

But as soon as large-scale contextual understanding is involved, long context models gain the advantage.

From Search to Action: The End of “Plumbing”

Elastic’s research teams summarize this turning point: “We’ve moved from an era of pipelines to an era of direct understanding.” In other words, engineers spend less time assembling plumbing (chunking, vectorization, API calls) and more time thinking about business value: how to use AI to understand, anticipate, decide. In the old world, RAG architects juggled hundreds of parameters: chunk size, vector distance, scoring, contextual memory…

In the new one, AI can ingest a complete annual report, detect inconsistencies, formulate a diagnosis, and propose an action plan.

The difference is qualitative: we’re moving from an augmented search engine to a complete cognitive assistant.

What the Rise of Long Context Changes

IBM Research’s article highlights three major impacts:

Cognitive autonomy of AI: long-context models can reason without external support, reducing dependence on hybrid architectures.
Emergence of new interfaces: search becomes conversational, users explore knowledge rather than query it.
Simplified deployments: less infrastructure, less maintenance, fewer retrieval error risks.

These gains come with challenges, however: high computational cost, increased latency, and the need to make selective attention mechanisms reliable (to prevent the model from “getting lost” in a million tokens).

But the trend is clear: AI is learning to manage context at human scale.

RAG and Long Context: Toward Reasoned Coexistence

Should RAG be buried? Not yet.

As often in technology, the disruption isn’t sudden: it’s organized.

Hybrid RAG + long context architectures are gaining popularity. They allow combining the best of both worlds:

RAG, for quickly filtering and indexing large document repositories.
Long context, for deep reasoning on relevant excerpts.

IBM already mentions Context-Aware RAGs: systems capable of dynamically adjusting context size based on need.

The future of enterprise AI will therefore not be a binary choice between RAG and long context, but an intelligent orchestration of both approaches depending on the task, volume, and energy cost.

Implications for Businesses

For innovation and IT leadership, the change is strategic.

RAG projects, often launched at great expense, will need rethinking: less pipeline, more understanding.

Three major developments are emerging:

Reduction of technical debt: less dependence on third-party components (vector databases, ranking tools).
Strengthening sovereignty: data remains within the company’s perimeter, without outsourcing to external search systems.
New use cases: analysis of complete corpora (audits, diagnostics, compliance), creation of knowledge copilots capable of interpreting business documentation without RAG.

In the short term, this transformation will disrupt the business models of vendors specializing in RAG pipelines. In the long term, it redirects R&D toward more contextual, more explainable AI that’s closer to human reasoning.

And Tomorrow? Toward Augmented Cognition

Researchers at MIT and Anthropic predict a convergence between long contextual memory and incremental reasoning: AI will no longer need to retrieve fragments, they’ll remember past interactions to reason over time.

We’ll thus move from retrieval-augmented generation to reasoning-augmented cognition: systems capable of building dynamic understanding of the world.

RAG will then have played its role: a transitional stage in the quest for AI capable not only of retrieving, but of thinking.

From RAG to Reason

RAG was a brilliant invention, born from technical constraint.

But technological innovation advances quickly, and long context models now redefine the rules of the game.

As an Anthropic engineer recently summarized:

“RAG was the crutch for short-memory AI. Today, they walk on their own.”

The future of AI will no longer consist of retrieving information, but understanding it in all its complexity.

And this shift, silent but radical, could well mark the end of an era—that of ingenious tinkering, replaced by the advent of true artificial cognition.

RAG or Not RAG: Why Long Context Models Make RAG Obsolete