Challenges with running LLM evaluation in production at scale

We've spent much of the last year building an orchestration layer for production evaluation of LLM pipelines.

This is a simplified view of the architecture we had to build out to support running evals in production at scale.

Major Challenges

In that time, we learned that there are a LOT of challenges when trying to evaluate LLM outputs in production.

No ground truth in production

Unlike your test dataset in development, production logs don't include any ground truth.

Solution:

You have to use creative techniques (often using another LLM) to evaluate retrievals and responses without ground truth.
Keep up with the latest and greatest research techniques to add more evaluation metrics / improve reliability

Cost Management

If you need to use an LLM for evaluation, it can get pretty expensive. Imagine running 5-10 evaluations per production log. The evaluation costs could he higher than the actual task costs!

Solution: Implement sampling + cost tracking mechanism

Automation

Needless to say, running evals in production should be automated and continuous. That poses a number of challenges at scale.

This means:

You need to scale your evaluation infrastructure to meet your logging throughput
You need a way to configure evals and store configuration
You need a way to select which evals should run on which prompts
You need mechanisms to handle rate limiting
You need the eval to be run using swappable models / providers
You need a way to run a newly configured evaluation against old logs

Solution: Build an orchestration layer for evaluation

Athina's eval orchestration layer manages eval configurations, sampling, filtering, deduping, rate limiting, switching between different model providers, alerting, and calculating granular analytics to provide a complete evaluation platform.

You can run Evals during development, in CI / CD, as real-time guardrails, or continuously in production.

Support for different models, architectures, and traces

Say your team wants to switch from OpenAI to Gemini.

Suppose you add a new step to your LLM pipeline.

Maybe you're building an agent, and need to support complex traces?

Maybe you switched from Langchain to Llama Index?

Maybe you're building an chat application and need special evals for that?

Can your logging and evaluation infrastructure support this?

Solution: You need a normalization layer that is separate from your evaluation infrastructure.Inspect and debug complex traces and chats

Interpretation & Analytics

What do you do with the eval metrics that were calculated? Ideally, you want to be able to:

Measure overall app performance.
Measure retrieval quality
Measure usage like token counts, cost, response times
Measure safety issues like PII leakage or prompt injection attacks.
Measure changes over time
Measure distributions of eval scores (p5, p25, p50, p75, p95, etc)
Segment the metrics by prompt, model, topic or customer ID

Solution: Build an analytics engine that can segment the data, compute these metrics and render them on a dashboard with filter options.

Observability & Alerts

Of course, along with all this, you will also want to be able:

Manually inspect the traces
Manually annotate the traces individually
Consolidate online and offline eval metrics
Configure alerts to PagerDuty or Slack when failures increase
Export the data
Connect to the logs via API / GraphQL

Solution: Build LLM observability platform

Collaboration

The tool you use should also support collaboration features so teammates.

Solution: Build team features, access controls and separation of workspaces.

👋 Athina

We spent a lot of time working through these problems so you don't need a dedicated team for this. You can see a demo video (opens in a new tab) here.

Website: Athina AI (opens in a new tab) (Try our sandbox (opens in a new tab)).

Github (opens in a new tab): Run any of our 40+ open source evaluations using our Python SDK to measure your LLM app.

Quick Start Open Source Evaluation: Philosophy