Setting up an evals test harness
26-08-2025
- https://x.com/eugeneyan/status/1960148508495020234
- setup evals + experiment harness
- easy to tweak config and prompts
- need to look at raw data and justify what you're doing
- workflow
- log traces -> annotate a couple hundred examples -> align llm-evaluators to ground truth -> use llm-evaluators to scale and get numbers -> visualise in spreadsheet (make things nice to look at)