A closer look at Anthropic's latest proposed approach to overcome context loss in Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) systems have become a cornerstone in integrating generative AI with real-time enterprise data. By enhancing large language models (LLMs) with external knowledge sources, RAG systems enable more accurate, contextually relevant responses, vastly improving information retrieval processes in both structured and unstructured enterprise data.

‍

Understanding retrieval-augmented generation (RAG)

RAG systems augment the capabilities of LLMs by incorporating external information through a structured process:

Chunking: Large documents are divided into smaller, manageable pieces called chunks. This makes the data easier to process and retrieve.
Vectorization: Each chunk is transformed into a numerical vector embedding using an embedding model. These embeddings capture the semantic meaning of the text, allowing for meaningful comparisons between different chunks.
Storage: The vector embeddings, along with metadata about the original text, are stored in a vector database.

When a user submits a query, it's converted into a vector embedding and compared with the stored vectors using similarity search algorithms. The most relevant chunks are retrieved and used to augment the user's prompt before generating a response from the LLM. This approach enhances the accuracy and contextual relevance of AI outputs, making it particularly valuable for enterprises that need to interact with vast amounts of data efficiently.

A standard retrieval-augmented generation (RAG) process flow.

It's important to note, however, that this described type of RAG system, where vector databases are used for storing and retrieving data, is among the most popular implementations but not the only one. Other RAG systems, for instance, use knowledge graphs to store and retrieve information. Knowledge graphs excel at representing relationships between data points explicitly, which can be particularly useful in certain enterprise contexts where structured data with clear relationships between entities is more valuable than semantic similarity.

‍

The essential role of different search types in RAG systems

Understanding the various types of search methodologies is crucial for optimizing RAG systems. Each search type offers unique advantages and limitations that affect how effectively a system can retrieve and interpret data. In complex enterprise environments, where data is vast and varied, leveraging the appropriate search technique can significantly impact the relevance and accuracy of information retrieval. Let's explore the key search types and their roles in RAG systems.

‍

Exploring different types of searches

Semantic search - is a technique that focuses on understanding the meaning and intent behind a user’s query, rather than just matching keywords. It delivers more relevant results by considering factors like word relationships, user context, and previous searches, making it especially useful for handling complex queries.
Keyword Search - focuses on finding exact matches between the words in a query and those in a document. It's straightforward but can miss relevant information expressed differently compared to semantic search that aims to understand the meaning and intent behind a search, capturing synonyms and related concepts. This makes it better at handling complex queries involving ambiguous terms or implied relationships. Often keyword search and semantic search are combined in a hybrid search approach in order to optimize the retrieval stage in RAG systems.
Lexical Search - similar to keyword search, relies on matching words and phrases based on their literal form without considering underlying meanings. Semantic search, on the other hand, delves into the relationships and meanings between words and phrases, providing a deeper understanding of the content.
Contextual Search - expands upon traditional search by considering the user's context, such as location or past interactions. While semantic search focuses on the intrinsic meaning of the query itself, contextual search uses external clues about the user to refine results. Think of contextual search as adding a personalized layer to the semantic understanding.
Vector Search - represents text as mathematical vectors in a high-dimensional space, allowing for the calculation of similarity based on distance between vectors. While semantic search can use vector representations, it is a broader concept encompassing various techniques to understand meaning. Vector search excels in handling unstructured data by capturing nuanced relationships between concepts.

‍

Challenges with RAG systems

A significant issue with traditional RAG systems is that individual text chunks often lack sufficient context for accurate retrieval and understanding. When chunks are too small or stripped of their surrounding information, they may not provide enough detail to be useful in isolation.

A finacial report divided into text chunks.

Imagine an enterprise knowledge base containing numerous financial reports from different companies. A user asks, "What was the revenue growth for ACME Corp in Q2 2023?"

A relevant chunk in the database might read: "The company's revenue grew by 3% over the previous quarter."

Using similarity search, the system converts the user's query and the stored chunks into vector embeddings and searches for matches. However, this chunk doesn't specify which company or quarter it's referring to. In a database filled with similar reports from various companies, the system might retrieve chunks from other organizations like "XYZ Corp" or "Global Tech Inc." The similarity in language across these documents can cause the system to retrieve irrelevant information, leading to inaccurate or misleading responses.

This problem arises because the chunks lack specific context, making it difficult for the similarity search algorithm to distinguish between them based solely on vector embeddings.

‍

Anthropic's contextual retrieval approach

To address these challenges, Anthropic introduces the concept of contextual retrieval. Instead of relying on isolated chunks that may lack crucial information, they propose prepending chunk-specific explanatory context to each chunk before embedding it into the vector database.

But what exactly are contextualized chunks? They are chunks enriched with additional information that provides clarity about their content and origin. This extra context transforms ambiguous chunks into more detailed, self-contained pieces of information, which actually can make the difference between a successful retrieval and a failed one.

‍

Implementing contextualized chunks

In their blog ^[1], Anthropic not only introduced the implementation method they used for contextualized retrieval but also added it to their cookbook. Instead of manually annotating the chunks, they found another use case for their LLM, Claude, using it to automatically generate the necessary context for each chunk, streamlining the process. By providing Claude with a prompt and the entire document from which the chunks were derived, they instructed it to create a concise contextual summary (usually 50-100 tokens) for each chunk based on the overall document.

‍

Example of contextualization

Original Chunk: "The company's revenue grew by 3% over the previous quarter."
Contextualized Chunk: "This chunk is from an SEC filing on ACME Corp's performance in Q2 2023; the previous quarter's revenue was $314 million. The company's revenue grew by 3% over the previous quarter."

By adding this context, the chunk becomes self-explanatory, enabling the retrieval system to more accurately match it with the user's query.

‍

Enhanced retrieval with combined search methods

Anthropic's exemplified approach goes beyond traditional RAG implementations by combining similarity search on contextualized vector embeddings with the BM25 method, a ranking function that uses lexical matching to find precise word or phrase matches. BM25 builds upon the TF-IDF (Term Frequency-Inverse Document Frequency) concept which measures word importance in a document. While BM25 is not typically used in standard RAG systems, integrating it provides in this case a hybrid search capability that leverages both semantic understanding and exact keyword matching.

In their tests, they retrieved the top 20 chunks (top k can be freely chosen) using this combined method. This combined approach overcomes the limitations of using only similarity search. While embedding models are effective at understanding semantic relationships, they can overlook important exact matches. By integrating BM25’s lexical search, Anthropic’s method ensures both precise matching and deeper semantic understanding.

A custom retrieval-augmented generation (RAG) process flow that uses contextualized chunks.

‍

Results of Anthropic's Tests

Anthropic conducted comparative tests to evaluate the effectiveness of their contextual retrieval approach and here is the summary of the results:

Standard RAG with contextualized embeddings and contextualized BM25 - showed a 49% reduction in retrieval failure rates compared to traditional methods.
RAG with only contextualized vector embeddings - achieved a 35% reduction in retrieval failures without the inclusion of contextual BM25.
Advanced RAG with reranking also added to the post-retrieval stage - by passing the top 150 retrieved chunks through a reranking model and selecting the most relevant 20, they achieved a 67% reduction in retrieval failures. This method, while more accurate, introduces additional computational costs and latency which must be also taken into consideration.

‍

Cost optimization with prompt caching

While Anthropic's contextual retrieval approach enhances accuracy, it can increase computational costs due to the additional context generation. To mitigate this, they utilize Claude's prompt caching feature, which was released two months ago. Prompt caching allows in this case to load the document connected to the chunks being contextualized in the cache, so the model doesn't need to process the entire document each time. This significantly reduces the cost per token for processing large documents.

It's important to note that prompt caching is a separate optimization from the contextualized chunking approach. While they can be used together for cost savings, each can function independently. Prompt caching alone helps reduce costs in scenarios where the same documents are frequently processed, making it a valuable tool even without contextualized chunks.

At the moment the prompt caching feature is available on Anthropic's primary API and is expected to roll out to partner environments like AWS Bedrock and GCP Vertex soon. This feature helps enterprises manage computational costs effectively.

‍

Real-world enterprise use cases

For enterprises, the ability to retrieve accurate and contextually relevant information is critical, especially in scenarios involving large knowledge bases like ticketing systems or helpdesk platforms. In these environments, users often submit queries that are inconsistent or lack precision. They might ask, "How do I reset my password for the HR portal?" or "What's the status of my last support ticket?"

Implementing Anthropic's contextual retrieval approach in such systems offers tangible benefits:

Improved accuracy in information retrieval: By providing context-rich chunks, the system can better match user queries with the correct information, even when the queries are vague or incomplete.
Enhanced user experience: Users receive more accurate and helpful responses, reducing frustration and improving satisfaction.
Scalability: Enterprises can handle vast amounts of unstructured data across multiple sources without sacrificing performance.
Efficiency in customer support: Support teams can leverage the system to quickly find relevant information, speeding up response times and reducing workload.

For example, in a helpdesk scenario, a user's query about a software error can be matched with the precise solution from a technical manual, even if the user's description doesn't use the exact terminology found in the documentation. The contextualized chunks ensure that relevant solutions are retrieved, enhancing the effectiveness and reliability of the support provided.

‍

Other RAG optimization techniques

Beyond Anthropic's contextual retrieval, several other optimization strategies can enhance RAG systems:

Advanced chunking techniques: Implementing semantic chunking to ensure that chunks contain coherent and complete thoughts, improving the quality of embeddings.
Query rewriting: Using language models to refine and expand user queries, making them more precise and tailored for retrieval.
Multi-query RAG: Breaking complex queries into multiple sub-queries to retrieve comprehensive information from various sources or time periods.
Multi-hop retrieval: Allowing the system to follow a chain of reasoning across different pieces of information, essential for answering complex questions that span multiple documents.
Hybrid search: Combining vector search with keyword-based and semantic search techniques to leverage the strengths of each method.
Query classification: Implementing modules to determine whether retrieval is necessary for a given query, improving efficiency and response times.

‍

RAG optimizations are key for production-ready generative AI applications

Innovations in RAG optimization, like Anthropic's contextual retrieval approach, are crucial for solving real business problems at scale. Enterprises often grapple with fragmented data sources, inconsistent user queries, and the need for efficient, accurate information retrieval.

By enhancing RAG systems to handle these challenges, businesses can:

Scale AI solutions effectively: Deploy AI applications that maintain high performance, even as data volumes grow and queries become more complex.
Reduce operational costs: Utilize cost-saving measures like prompt caching without compromising on functionality or accuracy.
Improve decision-making and operations: Access to precise, context-rich information empowers better decision-making across the organization.
Enhance customer support and knowledge management: Provide faster, more accurate responses to customer inquiries and internal queries, boosting satisfaction and productivity.

Investing in RAG optimization transforms genAI apps from experimental tools into practical, scalable solutions. For enterprises, this means leveraging the full potential of AI to drive growth, streamline operations, solve real problems and deliver measurable outcomes in a competitive business landscape.

^[1]- https://www.anthropic.com/news/contextual-retrieval

Technology