Setting up an evals test harness

26-08-2025

https://x.com/eugeneyan/status/1960148508495020234
setup evals + experiment harness
- easy to tweak config and prompts
- need to look at raw data and justify what you’re doing
workflow
- log traces -> annotate a couple hundred examples -> align llm-evaluators to ground truth -> use llm-evaluators to scale and get numbers -> visualise in spreadsheet (make things nice to look at)