Rag Evals

These evals are very useful for most RAG style applications

They check for 3 things:

Context Contains Enough Information: Does the retrieved context contains enough information to answer the query.
Faithfulness: Is the response faithful to the context. (Unfaithful responses are correlated with hallucinations)
Does Response Answer Query: Does the response answer the user's query. Checks for relevance and answer completeness.

🚫

Eval Result

Result: Fail
Explanation: The context mentions that YC invests $500,000 but it does not mention how much equity they take, which is what the query is asking about.

One of the most common causes for a bad output is bad input. For RAG applications, this usually means a bad retrieval.

Typically for retrieval, you might do a cosine similarity search to the user’s query.

However, similar ≠ relevance.

Often, your retrieved data might not be relevant to the user’s query.

Sometimes, it might be relevant, but might not contain the answer to the user’s query.

We use an LLM grader (GPT-4) to figure out if the retrieved data is relevant and has enough information to answer the query.

🚫

Eval Result

Result: Fail
Explanation: The response mentions that YC takes 5-7% equity, but this is not mentioned anywhere in the context.

Another common problem with RAG applications is when the response is not “faithful” to the context.

This is often the cause of "Hallucinations".

The LLM might use its pretrained knowledge to generate an answer.

But for most RAG apps, you want to constrain it to the context you are providing it (since you know it to be true).

🚫

Eval Result

Result: Fail
Explanation: The query is asking which spaceship landed on the moon first, but the response only mentions the name of the astronaut, and does not say anything about the name of the spaceship.

This is a good eval for nearly any Q&A type application. This can help you check if: