I have been reading the excellent book by John Adams titled Risk (link). This is a geographer’s treatment of a subject that is a staple of mathematics, particularly probability math. A mathematical treatment creates objects called probability distributions, which are then taken as complete representations of risk. Adams challenges that construct, bringing a social scientist’s sensitivity to the table. In particular, he points out how the mathematics of risk is undermined by measurement issues (i.e. data issues) and statistical issues. He is not invalidating math, just pointing out large cracks that are often ignored.
I will provide a more comprehensive review of the book eventually. I’m very excited by Chapter 5, titled “Measuring Risk”, and specifically the example of “traffic black spots”. This example is very instructive for anyone who is interested in the practical implications and interpretation of risk measures.
The post got long so I have split it into two parts. The second part will be posted on Monday, and it concerns a delicious bit of analysis related to traffic black spots.
Here are some key sentences related to the quality of the collected data:
- The connection between potentially fatal events [accidents] and actual fatal events is rendered extremely tenuous by the phenomonen of risk compensation…. avoiding action is taken, with the result that there are insufficient fatal accidents to produce a pattern that can serve as a reliable guide to the effect of specific safety interventions. As a consequence, safety planners seek out other measures of risk, usually – in ascending order of numbers but in decreasing order of severity – they are injury, morbidity, property damage and near misses.
- It is much easier to achieve “statistical significance” in studies of accident causation if one uses large numbers rather than small. Safety researchers therefore have an understandable preference for non-fatal accident or incident data over fatality data… in exercising this preference, they usually assume that fatalities and non-fatal incidents will be consistent proportions of total casualities.
- [Evoking some data from London] The correlation between fatality rates and injury rates is very weak. Is the weak correlation real or simply a recording phenomenon? How many injuries equal one life?
- Uncertainty in the data increases as the severity of the injury decreases. The fatality statistics are almost certainly the most accurate and reliable of the road accident statistics… the categorization and recording of injuries is generally not informed by any evidence from a medical examination…. [A British Medical Asssociation report said that] only one in four casualties classified as seriously injured are, in fact, seriously injured and many of those classified as slightly injured are in fact seriously injured….. some 30 percent of traffic accident casualties seen in hospital are not reported to the police, and that at least 70 percent of cyclist casualties go unreported.
This last point is widely applicable in the data science/analytics world. We often have large amounts of unreliable data, and small amounts of more reliable data. What we hope to have – and we don’t – is large amounts of reliable data. A lot of bad analyses results from assuming we have large amounts of reliable data.