Tuesday, March 17, 2020

Like Beer? Thank a Statistics Student

Today is St. Patrick's Day, a time of shamrock ☘☘☘☘ shenanigans in the U.S. often accompanied by consuming large quantities of beer. Though most of us are practicing social-distancing today because of Covid-19 (as recommended by the CDC), it's likely that many of us will still imbibe some alcohol tonight from the comfort of our sequestered homes. It is also likely that if you are doing this, you have a favorite beer. Every time you order it, you expect it to taste the same. It's a brand, after all! It's particular flavor is exactly why you bought that type over the competitors.

If you even casually enjoy beer, you should thank your nearest statistician. Without the work of one in particular, every beer you'd ever taste, even from the same brewery, would taste different.

To explain the story, I first need to explain one of the basic concepts of statistics: the normal curve. A normal curve is the representation of the probability of a particular event happening. Say that you have 10,000 of the same-sized jars. You fill up each of these with jellybeans, and then you want to know how evenly you divided the total number of jellybeans between each jar. Every time the number of jellybeans is counted from one jar, that is counted as a data point. Eventually you'll have 10,000 data points. If you made a graph of the number of times a particular number of jellybeans was counted, you should end up with a graph that looks like a steep hill. That's the normal curve.

A visualization of the normal curve. From:
https://www.youtube.com/watch?v=McSFVzc8Swk
A standard normal curve has a mean of 0 and a standard deviation of 1. A mean is the most likely event to happen and is located directly underneath the peak of the curve, while the standard deviation is a measure of how wide the curve is. To give some visual examples, I've generated a few example curves with different means (mu) and standard deviations (sigma) using MatLab 2019 (below).

Some normal curves with different values for mu and sigma
I generated using MatLab 2019.
In a perfect world, the normal curve is what every experiment produces. Sadly, we do not live in a perfect world. One rule of statistics is that if a sample size is large enough (for example, 10,000 jellybean jars), the data will naturally make a normal curve even if a smaller sample would not do so. I don't know about you, but I don't have the time to count that many jellybeans, even if I am alone in my apartment for a week. To keep ourselves from needing to run thousands of experiments, which wastes time, money, and resources, statisticians needed to find a way to approximate the normal curve using a smaller data set.

iStock.com/Nagalski
This is where the genius of a beer-maker comes in. In 1899, William Sealy Gosset was a recent natural sciences and mathematics graduate at the University of Oxford. He had just been hired at the Guinness brewery in Dublin, Ireland as part of their initiative to develop rigorous testing protocols for their beers and solve other industrial problems. Gosset's strong background in biochemistry and statistics meant he was well-suited for these tasks. One major problem at Guinness was that the beer was inconsistent in flavor. Some batches were so terrible they simply could not be sold. Quality control was non-existent at the time, and this was costing the company money. How could this be fixed?
Photograph of British statistician William Sealy
Gosset, taken in 1908. Public domain, courtesy of
Wikipedia.

Gosset was the one who solved it, though no one knew it was him for a while. He was interested in determining the values of a general normal curve even when sample sizes were very small--sometimes as few as 3! Over years of work he identified patterns and studied the work of other statisticians, Karl Pearson being one of them (if Pearson's name sounds familiar, you've probably heard of him in the context of correlation--he invented it). Gosset eventually came up with a specific adaptation of the normal curve equations to relate it to small sample sizes. Today, this is called Student's t-test.

Gosset finally published his work in 1908 in the journal Biometrika under the pseudonym Student. Why not publish under his own name to receive credit for it? Under Guinness' rules, no one could publish their research; industry secrets and all, of course. Gosset was forced to publish the research paper under a false name. He then spent the rest of his 38-year career at Guinness, creating rigid quality control standards and brewing the dark stouts enjoyed today.

Happy St. Patrick's Day! Stay safe and healthy out there, and may the road rise to meet you. ☘☘☘☘

Sources:
https://en.wikipedia.org/wiki/Student%27s_t-distribution#History_and_etymology

Salsburg, David. The Lady Drinking Tea: How Statistics Revolutionized Science in the Twentieth Century. Henry Holt and Co. New York City, NY, USA. 2002. ISBN 13: 978-0805071344.