LLM Eval Workflows

LLM Development Workflows

At some point in your AI product development lifecycle, you will find a need to improve performance of your model.

For Generative AI models, improving performance is pretty difficult to do systematically because you don't have a quantitative way to measure results.


Demo Stage: The Inspect Workflow 🔎

Manual Inspect Worklow

  • Run prompt on single datapoint
  • Inspect the response manually
  • Change prompt / datapoint and repeat

Usually, people have a workflow like this during their initial prototyping phase.

This worklow is fine to get an initial demo ready, but does not work great after this stage.


MVP Stage: The Eyeball Workflow 👁️👁️

This workflow is similar to the previous workflow, but instead of running 1 datapoint at a time, you are running many datapoints together.

However, you still don't have ground truth data (the ideal response by the LLM) so there's nothing to compare against.

Eyeball Worklow

  • Run prompt on dataset with multiple datapoints
  • Put outputs onto a spreadsheet / CSV
  • Manually review (eyeball) the responses for each
  • Repeat

This worklow is fine pre-MVP, but is not great for iteration.

🚫

Why doesn't this workflow work for rapid iteration?

  • Inspecting generations on a dataset is manual and time-consuming (even if the dataset is small!)
  • You don't have quantitative metrics
  • You have to maintain a historical record of prompts run
  • You don't have a system to compare the outputs of prompt A vs prompt B

Iteration Stage: The Golden Dataset Workflow 🌟🌟

You now have a golden dataset with your datapoints, and ideal responses.

You can now set up some basic evals.

Great! Now you actually have a way to improve performance systematically.

The workflow looks something like this

Iteration Worklow

  • Create golden dataset (multiple datapoints with expected responses)
  • Run prompt on test dataset
  • Option 1: Manual Review
    • Put outputs onto a spreadsheet / CSV
    • Manually compare LLM responses against expected responses
  • Option 2: Evaluators
    • Create evaluators to compare LLM response against expected response
      • But what metrics to use? How to compare 2 pieces of unstructured text?
    • Build internal tooling to:
      • run these evaluators, and score them
      • track history of runs
      • a UI

This is actually a good workflow for all stages.

🚫

What are the downsides of this workflow?

  • Difficult and time consuming to create good evals
  • You need to create lots of internal tooling
  • Does not capture data variations between your golden dataset and production data
  • You have to maintain a historical record of prompts run
  • Requires a mix of manual review + eval metrics

⛭ Enter the Athina Worklow... 🪄

Athina's workflow is designed for users at any stage of the AI product development lifecycle.

Athina Monitor: Demo / MVP / Production Stage

Setup time: < 5 mins

  • Run your inferences, and log data to Athina Monitor.
  • View the results on a dashboard.
    • Preserve historical data including prompt, response, cost, token usage and latency (+ more)
    • UI to manually grade your responses with 👍 / 👎

This will work for single datapoint or multiple datapoints.

Athina Evaluate: Development / Iteration Stage

Setup time: 2 mins

Now that you're really trying to focus on improving model performance, here's how you can do it:

  • Configure experiments and run evaluations programmatically
  • Run preset evals or create your own custom eval
  • Eval results are automatically logged to Athina Develop
  • Works in a python notebook – but you can also view the results on a dashboard.
  • Also preserves historical data including prompt, response, datapoints, eval metrics (+ more)