How can I improve the performance / reliability of my evals?
LLM-graded Evals will never be perfect but here are some things you can do to improve their performance, and reduce flakiness.
1. Use GPT-4 (especially if your eval task requires reasoning capabilities)
gpt-4
will perform much better than GPT 3.5 if your eval task is complex.- For simple tasks, you can use
gpt-3.5-turbo
or sometimes an even cheaper model.
2. Run the evals multiple times
Running evals multiple times, and using a majority vote, or discarding inconsistent results will mitigate the flakiness.
3. Provide custom examples
Providing some custom few-shot examples suited to your use case are likely to improve the performance of your evals further.
4. Set up custom evals
Using a completely custom eval is likely the best way to tailor your eval to work perfectly for your use case.
5. Contact Us
Email us at hello@athina.ai for help setting up a high-performing custom eval suite.