An Introduction to RAG Evaluation
Plus, hands-on notebook that will show you how to use ragas to compute common RAG evaluation metrics
📏 Evaluation gives you concrete numbers that tell you how accurate the system is, how relevant its answers are, and how well it's working overall.
🔍 With an evaluation system in place, you can compare different models, prompts, and contexts to determine what works best.
📈 Regular evaluation will help you assess the quality of your RAG pipeline over time.
Without checking in on the RAG system's performance, it's difficult to know what needs fixing and how to improve it.
Recall the basic steps in creating a Retrieval Augmented Generation system
Create an Index
Retrieval of relevant context from our Index that is similar to our query
Generate responses based on the retrieved context by injecting the retrieved context into a prompt and sending that to an LLM
Diagram illustrating the components of a RAG system, including the retriever and generator processes. Source: AI Makerspace
Take a moment to pause and think about the two primary components of this system.
Evaluating Different RAG Components
A RAG pipeline has two main parts: retrieval and generation.
🔍 Retrieval component
This fetches external context from the vector database.
If the retriever makes mistakes, those mistakes will carry over to the generator. The retrieved sources must be relevant to the user's query.
So, You need a way to measure how closely the retrieved context matches what the user is asking about.
🤖 Generation component
This combines a prompt template, the user's question, and the context it retrieved to generate an answer using an LLM.
You want the generated response to be grounded in the retrieved context, relevant to the user's query, and adhere to any guidelines you have in place.
You need a way to measure the relevance of responses to the retrieved context and the coherence and fluency of the generated responses.
We need to check how well these components work independently and together.
Aspects of Evaluation
Regarding retrieval and generation, there are two main things to measure: quality and generation.
📊 Measuring Quality
Quality is measured via relevance and faithfulness.
🎯 Relevance
Relevance should be measured for the retrieved context and the generated response.
The context retrieved should be precise. The more relevant each bit of context is, the better the generation will be.
The generated answers should be directly relevant to the user's query.
🤝 Faithfulness
This ensures the answers the system generates are faithful to the context it retrieved. The generated response shouldn't have any contradictions or inconsistencies.
🏹 Measuring Ability
🔇 Noise Robustness: This checks how well the model handles noisy context. That is contexts that relate to the question but don't have useful information.
🙅♂️ Negative Rejection: This looks at how well the model knows when to say, "I don't know." If the retrieved context doesn't have the info needed to answer the question, it should not try to answer.
🧩 Information Integration: This tests how good the model is at putting together context from multiple documents. This is important for handling complex questions that need context from different sources.
🔮 Counterfactual Robustness: This checks if the model can spot and ignore things in retrieved context that it knows are wrong, even if it's told there might be some misinformation info in them.
When evaluating retrieval quality, context relevance and noise robustness are important, while for evaluating generation quality, answer faithfulness, answer relevance, negative rejection, information integration, and counterfactual robustness should be considered.
📐 Evaluation Metrics
A robust RAG evaluation requires carefully measuring the retriever and generator performance.
It's important to use the right metrics and construct datasets that match how the system will be used in the real world. A close look at each component is the key to figuring out where RAG systems need to improve.
🤝 Faithfulness: This examines how factually accurate the answer is based on the context the retriever found. We want to ensure the LLM isn't hallucinating (making up plausible-sounding text that is inaccurate).
🔍 Answer Relevancy: This measures how relevant the answer is to the user query, according to the retrieved context, and if it covers what the question is asking for.
🪡Context Precision: Evaluate whether all ground-truth relevant items present in the contexts are appropriately ranked.
✅ Answer Correctness: This measures how accurate the generated answer is compared to the ground truth.
📊 Context Recall: This is the amount of relevant context chunks the retriever manages to retrieve compared to the total amount of relevant chunks.
Before computing evaluation metrics, you need to make sure:
📂 The evaluation datasets cover many question types, complexities, and domains. This helps assess the RAG system's performance across different scenarios and use cases.
🌍 Datasets that closely resemble real-world queries and information needs. This provides a more accurate assessment of how the RAG system would perform in practical applications.
👨🏽⚖️ LLM-as-a-Judge
LLM-as-a-Judge is a technique that uses an LLM to automatically evaluate the outputs of other LLMs in a way that aims to approximate human judgment.
The key idea is to leverage strong LLMs to assess generated text that normally requires human evaluation, such as the abovementioned metrics. Using LLMs as judges is a scalable and cost-effective way to evaluate RAG systems. It enables faster benchmarking and iteration cycles during AI development.
However, LLM judges have some limitations and biases that need to be carefully addressed, such as:
Position bias: Favoring certain positions in a text
Verbosity bias: Preferring longer outputs
Limited reasoning capabilities for complex topics like math
Despite these challenges, studies have shown that with proper prompt engineering and debiasing techniques, strong LLM judges like GPT-4 can achieve high agreement (>80%) with human preferences and expert ratings across different benchmarks. LLM-as-a-Judge is a promising complement to human evaluation for assessing open-ended AI tasks at scale.
ragas
We'll make use of this ragas
as our RAG evaluation framework. To compute RAG evaluation metrics, you need to build an evaluation dataset with the following features:
❓ question
: This is the user's query that goes into the RAG pipeline.
🤖 answer
: This is what the LLM generates as a response to the question. It's the final output you want to evaluate.
📚 contexts
: These are the retrieved contexts the retriever pulls from external sources to help it answer the question.
✅ ground_truths
: This is the "correct answer" to the question, provided by human annotators or generated via a powerful LLM. It's the gold standard answer to which you'll compare any iteration of your RAG system's answer. This is only needed for measuring the context_recall
metric.
👨🏽💻 Code
The notebooks below will get you hands-on with ragas and explain how each metric is calculated.
Enjoy!
Thanks for reading!
If you found the newsletter helpful, please do me a favour and smash that like button!
Also, remember to share it with your network (whether via re-stack, tweet, or LinkedIn post) so that they can benefit from it, too.
🤗 This small gesture and your support mean far more to me than you can ever imagine!
Cheers,
Harpreet