Measure retrieval and response quality in RAG

Measuring retrieval and response quality in RAG-based LLMs

Common Failures in RAG-based LLM apps

RAG-based LLM apps are great, but there are always a lot of kinks and imperfections to iron out.

Here are some common ones:

Bad retrieval

Bad outputs

How to detect such issues

Just plug in the evaluators you need and run the evals on your dataset.from athina.evals import RagasContextPrecision, RagasAnswerCorrectness, RagasContextRelevancy, RagasContextRecall, RagasFaithfulness, Groundedness

import os
from athina import evals
from athina.loaders import Loader
from athina.keys import OpenAiApiKey
from athina.runner.run import EvalRunner
from athina.datasets import yc_query_mini
import pandas as pd
 
from dotenv import load_dotenv
load_dotenv()
 
OpenAiApiKey.set_key(os.getenv('OPENAI_API_KEY'))
 
# Load a dataset from list of dicts
raw_data = yc_query_mini.data
dataset = Loader().load_dict(raw_data)
 
# View dataset in a dataframe
pd.DataFrame(dataset)
 
# Define evaluation suite
model = "gpt-4-turbo-preview"
eval_suite = [
    evals.RagasAnswerCorrectness(model=model),
    evals.RagasContextPrecision(model=model),
    evals.RagasContextRelevancy(model=model),
    evals.RagasContextRecall(model=model),
    evals.ContextContainsEnoughInformation(model=model),
    evals.RagasFaithfulness(model=model),
    evals.Faithfulness(model=model),
    evals.Groundedness(model=model),
    evals.DoesResponseAnswerQuery(model=model)
]
 
# Run the evaluation suite
batch_eval_result = EvalRunner.run_suite(
    evals=eval_suite,
    data=dataset,
    max_parallel_evals=8
)
batch_eval_result

You can run these evaluations in a python notebook, and view results in a dataframe like this: Example Notebook on Github (opens in a new tab)