Measuring retrieval and response quality in RAG-based LLMs

Common Failures in RAG-based LLM apps

RAG-based LLM apps are great, but there are always a lot of kinks and imperfections to iron out.

Here are some common ones:

Bad retrieval

Retrieved information is not aligned with ground truth (Context Recall)
Retrievals are present but they are not ranked high (Context Precision)
Retrieved information doesn't have enough information to answer query (Context Sufficiency)
Retrieved information is not relevant to the query (Context Relevancy)

Bad outputs

Response says something that cannot be inferred from context (Faithfulness)
Response has many sentences that were not grounded to context. (Groundedness)
Conversation / chat has messages that are not coherent given the previous messages. (Conversation Coherence))
Some other criteria... (Custom Evaluation)

How to detect such issues

Just plug in the evaluators you need and run the evals on your dataset.from athina.evals import RagasContextPrecision, RagasAnswerCorrectness, RagasContextRelevancy, RagasContextRecall, RagasFaithfulness, Groundedness

import os
from athina import evals
from athina.loaders import Loader
from athina.keys import OpenAiApiKey
from athina.runner.run import EvalRunner
from athina.datasets import yc_query_mini
import pandas as pd
 
from dotenv import load_dotenv
load_dotenv()
 
OpenAiApiKey.set_key(os.getenv('OPENAI_API_KEY'))
 
# Load a dataset from list of dicts
raw_data = yc_query_mini.data
dataset = Loader().load_dict(raw_data)
 
# View dataset in a dataframe
pd.DataFrame(dataset)
 
# Define evaluation suite
model = "gpt-4-turbo-preview"
eval_suite = [
    evals.RagasAnswerCorrectness(model=model),
    evals.RagasContextPrecision(model=model),
    evals.RagasContextRelevancy(model=model),
    evals.RagasContextRecall(model=model),
    evals.ContextContainsEnoughInformation(model=model),
    evals.RagasFaithfulness(model=model),
    evals.Faithfulness(model=model),
    evals.Groundedness(model=model),
    evals.DoesResponseAnswerQuery(model=model)
]
 
# Run the evaluation suite
batch_eval_result = EvalRunner.run_suite(
    evals=eval_suite,
    data=dataset,
    max_parallel_evals=8
)
batch_eval_result

You can run these evaluations in a python notebook, and view results in a dataframe like this: Example Notebook on Github (opens in a new tab)

Guides Prompt Injection: Attacks and Defenses