Running Evals

💡

The easiest way to get started is to follow this notebook (opens in a new tab).

1. Configure API Keys

Evals use OpenAI, so you need to configure your OpenAI API key.

If you wish to view the results on Athina Develop, and maintain a historical record of prompts and experiments you run during your development workflow, then you also need an Athina API Key.

from athina.keys import AthinaApiKey, OpenAiApiKey
 
OpenAiApiKey.set_key(os.getenv('OPENAI_API_KEY'))
AthinaApiKey.set_key(os.getenv('ATHINA_API_KEY')) # optional

2. Load your dataset

Loading a dataset is quite straightforward - we support JSON and CSV formats.

from athina.loaders import RagLoader
 
# Load the data from CSV, JSON, Athina or Dictionary
dataset = RagLoader().load_json(json_file)

3. Run an eval on a dataset

Running evals on a batch of datapoints is the most effective way to rapidly iterate as you're developing your model.

from athina.evals import ContextContainsEnoughInformation
 
# Run the ContextContainsEnoughInformation evaluator on the dataset
ContextContainsEnoughInformation(
    model="gpt-4-1106-preview",
    max_parallel_evals=5, # optional, speeds up evals
).run_batch(dataset).to_df()

Your results will be printed out as a dataframe that looks like this.

How do I know which fields I need in my dataset?

For the RAG Evals, we need 3 fields: query, context, and response.

For these evals, you should use the RagLoader (opens in a new tab) to load your data. This will ensure the data is in the right format for evals.

Every evaluator has a REQUIRED_ARGS property that defines the parameters it expects.

If you pass the wrong parameters, the evaluator will raise a ValueError telling you what params you are missing.

For example:, the Faithfulness (opens in a new tab) evaluator expects response and context fields.

Run an eval on a single datapoint

Running an eval on a single datapoint is very simple.

This might be useful if you are trying to run the eval immediately after inference.

# Run the answer relevance evaluator
# Checks if the LLM response answers the user query sufficiently
DoesResponseAnswerQuery().run(query=query, response=response)

Here's a notebook (opens in a new tab) you can use to get started.

Running Evals Python: Run an eval suite