Setting up an evals test harness

26-08-2025

  • https://x.com/eugeneyan/status/1960148508495020234
  • setup evals + experiment harness
    • easy to tweak config and prompts
    • need to look at raw data and justify what you're doing
  • workflow
    • log traces -> annotate a couple hundred examples -> align llm-evaluators to ground truth -> use llm-evaluators to scale and get numbers -> visualise in spreadsheet (make things nice to look at)