Bayesian statistics

Less uncertain predictions

Ultrasound time-of-flight C-scan of the delaminations formed by a 12J impact on a crossply laminate (top) and the corresponding surface strain field (bottom).

Here is a challenge for you: overall this blog has a readability index of 8.6 using the Flesch Kincaid Grades, which means it should be easily understood by 14-15 year olds.  However, my editor didn’t understand the first draft of the post below and so I have revised it; but it still scores 15 using Flesch Kincaid!  So, it might require the formation of some larger scale neuronal assemblies in your brain [see my post entitled ‘Digital Hive Mind‘ on November 30th, 2016].

I wrote a couple of weeks ago about guessing the weight of a reader.  I used some national statistics and suggested how they could be updated using real data about readers’ weights with the help of Bayesian statistics [see my post entitled ‘Uncertainty about Bayesian statistics’ on July 5th, 2017].  It was an attempt to shed light on the topic of Bayesian statistics, which tends to be obscure or unknown.  I was stimulated by our own research using Bayesian statistics to predict the likelihood of failure in damaged components manufactured using composite material, such as carbon-fibre laminates used in the aerospace industry.  We are interested in the maximum load that can be carried by a carbon-fibre laminate after it has sustained some impact damage, such as might occur to an aircraft wing-skin that is hit by debris from the runway during take-off, which was the cause of the Concorde crash in Paris on July 25th, 2000.  The maximum safe load of the carbon-fibre laminate varies with the energy of the impact, as well as with the discrepancies introduced during its manufacture.  These multiple variables make our analysis more involved than I described for readers’ weights.  However, we have shown that the remaining strength of a damage laminate can be more reliably predicted from measurements of the change in the strain pattern around the damage than from direct measurements of the damage for instance, using ultrasound.

This might seem to be a counter-intuitive result.  However, it occurs because the failure of the laminate is driven by the energy available to create new surfaces as it fractures [see my blog on Griffith fracture on April 26th, 2017], and the strain pattern provides more information about the energy distribution than does the extent of the existing damage.  Why is this important – well, it offers a potentially more reliable approach to inspecting aircraft that could reduce operating costs and increase safety.

If you have stayed with me to the end, then well done!  If you want to read more, then see: Christian WJR, Patterson EA & DiazDelaO FA, Robust empirical predictions of residual performance of damaged composites with quantified uncertainties, J. Nondestruct. Eval. 36:36, 2017 (doi: 10.1007/s10921-017-0416-6).

Uncertainty about Bayesian methods

I have written before about why people find thermodynamics so hard [see my post entitled ‘Why is thermodynamics so hard?’ on February 11th, 2015] so I think it is time to mention another subject that causes difficulty: statistics.  I am worried that just mentioning the word ‘statistics’ will cause people to stop reading, such is its reputation.  Statistics is used to describe phenomena that do not have single values, like the height or weight of my readers.  I would expect the weights of my readers to be a normal distribution, that is they form a bell-shaped graph when the number of readers at each value of weight is plotted as a vertical bar from a horizontal axis representing weight.  In other words, plotting weight along the x-axis and frequency on the y-axis as in the diagram.

The normal distribution has dominated statistical practice and theory since its equation was first published by De Moivre in 1733.  The mean or average value corresponds to the peak in the bell-shaped curve and the standard deviation describes the shape of the bell, basically how fat the bell is.  That’s why we learn to calculate the mean and standard deviation in elementary statistics classes, although often no one tells us this or we quickly forget it.

If all of you told me your weight then I could plot the frequency distribution described above.  And, if I divided the y-axis, the frequency values, by the total number of readers who sent me weight information then the graph would become a probability density distribution [see my post entitled ‘Wind power‘ on August 7th, 2013].  It would tell me the probability that the reader I met last week had a weight of 70.2kg – the probability would be the height of the bell-shaped curve at 70.2kg.  The most likely weight would correspond to the peak value in the curve.

However, I don’t need any of you to send me your weights to be reasonably confident that the weight of the reader I talked to last week was 70.2kg!  I cannot be certain about it but the probability is high.  The reader was female and lived in the UK and according to the Office of National Statistics (ONS) the average weight of women in the UK is 70.2kg – so it is reasonable to assume that the peak in the bell-shaped curve for my female UK readers will coincide with the national distribution, which makes 70.2kg the most probable weight of the reader I met last week.

However, guessing the weight of a reader becomes more difficult if I don’t know where they live or I can’t access national statistics.  The Reverend Thomas Baye (1701-1761) comes to the rescue with the rule named after him.  In Bayesian statistics, I can assume that the probability density distribution of readers’ weight is the same as for the UK population and when I receive some information about your weights then I can update this probability distribution to better describe the true distribution.  I can update as often as I like and use information about the quality of the new data to control its influence on the updated distribution.  If you have got this far then we have both done well; and, I am not going lose you now by expressing Baye’s law in terms of probability, or talking about prior (that’s my initial guess using national statistics) or posterior (that’s the updated one) distributions; because I think the opaque language is one of the reasons that the use of Bayesian statistics has not become widespread.

By the way, I can never be certain about your weight; even if you tell me directly, because I don’t know whether your scales are accurate and whether you are telling the truth!  But that’s a whole different issue!