Designing and evaluating metrics
27-08-2025
- https://medium.com/@seanjtaylor/designing-and-evaluating-metrics-5902ad6873bf
- measurement forms the basis of science
- investments in our ability to capture data and measure outcomes often precede step-function changes in our understanding of the world and the ability to better solve problems
- five properties of metrics
- cost
- you can measure anything if willing to pay an arbitrary cost (money, time, resources, technical debt)
- simplicity
- the worst metric is one that people mistrust, second-guess, or ignore
- faithfullness
- measurements may fail to accurately represent the thing you care about
- metrics without construct validity measure the wrong thing (human-labelled data can be misleading as different people make different observations)
- measures with sampling bias measure it for the wrong set of units (e.g. people, items, events, etc)
- precision
- transformations (taking logs, winsorizing, variance stabalising, continuous outcomes into discrete outcomes)
- normalisations (ie if both numerator and denominator are skewed then the ratio will be less noisy)
- summing or averages (esp for few uncorrelated ways of measuring the same thing)
- causal proximity
- when causal proximity is low you will unlikely move the metric with your changes because a squence of outcomes must occur (low causal proximity means metrics like profit or revenue or very ineffective)
- preferr metrics with high causal proximity, and describe a theory of change that links your actions to the desired outcomes (sacrificing faithfulness)
- cost
- metric design
- proxy metrics
- acknowledge that this may be things we don't care about, but which we can detect effects
- surrogate metrics
- estimates of long-term outcomes from short-term metrics
- metric design is iterative and cross-functional
- don't just try and get metrics which are cheap or convenient
- people believe metrics if there are a small number of examples which agree with intuitions, moving in expected directions for good or bad changes, helps build initial trust
- bad metrics should be ignored from experimental results (they reduce signal/noise)
- for many metrics there is a point of saturation, keep Goodhart's Law in mind ("When a measure becomes a target, it ceases to be a good measure") - https://en.wikipedia.org/wiki/Goodhart%27s_law
- proxy metrics