Origins of the Normal Distribution and its relevance in the 21st Century
Get an intuitive grounding in the Normal Distribution. Glossed over by most ML courses.
This article will help you build an intuition behind the Normal distribution. You will start to get a sense of where it may be appropriate and where it is not. Probability distributions might be a completely new concept for some readers. If probability distributions are new to you, then check out my Data Literacy for Leaders course.
Remember to share this article. Because you will always look good when sharing articles that describe mathematical concepts in a highly intuitive way.
19th Century Scenario
The year is 1801 and you are an astronomer. You measure the position of stars and planets in the night sky. You have a problem with your measurements. You get a different reading each time that you measure the position of a star or a planet. For example, you record the position of Mars. Then straight away, you go to double check the position of Mars and you get a different position. Repeated observations under the same conditions yield different results. Astronomers have been grappling this problem for hundreds of years.
By 1801, astronomers have realised that the variations in their measurements are caused by random measurement errors. The errors are caused by a combination of the varying conditions, errors in their instruments, and wobbles in the observer. Each star and planet has a “true” position relative to the Earth, but we observe [position] + [measurement error].
[observed position] = [position] + [measurement error]
There are two other important facts about the observed position of stars. Small errors are more likely than large errors. An error of x (true position + x), is just as likely as an error of -x (true position - x). The key takeaway is that errors are “small” and “unbiased”.
Note the definition of errors versus mistakes. “Errors” are part of the process. On the other hand, mistakes are when someone doesn’t do their job properly.
How do we estimate the true positions?
In the early 1800’s, Karl Friedrich Gauss showed that errors in astronomical observations follow a Normal (AKA Gaussian) distribution.
In the early 1700’s Abraham De Moivre discovered a discretised version of the Normal curve. De Moivre discovered the Normal curve when calculating the probability of getting n
heads in N
coin tosses.
The Central Limit Theorem states that sums of individual random numbers will be Gaussian - as long as the assumptions of that particular version of the Central Limit Theorem hold. These random numbers or deviations do not have to be Gaussian themselves. As a general rule, each of these individual deviations need to be “small” relative to each other. For example, Uniform distributed deviations are fine. Pareto distributed deviations are not. Each little deviation, that makes up the total measurement error for a single astronomical observation, meets those conditions. Therefore the total error in a single astronomical observation is Gaussian. The little individual deviations do not have to be Normally distributed.
Likewise, consider the idea behind Brownian motion. The total displacement, up to time T, of a particle suspended in liquid is Gaussian. The particle doesn’t know anything about the Gaussian distribution. However, all of its individual little deviations meet the conditions of the Central Limit Theorem.
How does this apply to the 21st Century?
Now you have some intuition about where the Central Limit Theorem can apply, and where it can not. The next time that someone in the office mentions “three standard deviations”, you will have an intuition as to whether the data is actually Gaussian. You will know if their “three standard deviations” heuristic is appropriate.
The popular heuristic is to label a data point as an outlier if it is “three standard deviations away from the mean”. That is fine when the data is Gaussian. For example, the final dimensions of manufactured parts are likely to be Gaussian. Hence “six sigma” makes sense for manufacturing.
However, your employer’s data is not guaranteed to be Gaussian. Especially in business - away from the natural world. For example, consider the time that it takes to deliver a data science project. Are the deviations from the estimated time “small”? Some deviations are small. For example, a you might take a sick day. Some are quite “large”. For example, your initial approach might have been so incorrect that you will have to build a completely different kind of model. Only by trying the initial, incorrect approach, could you have learnt enough to know that it was incorrect.
Imagine that you are estimating the time that a data science project will take. Are the deviations from the estimated time “unbiased”? Sometimes you discover that some tasks don’t need to be completed. More often, you are dealing with unexpected data and platform issues.
In a correctly specified regression model, where all of the underlying drivers are included, Gaussian errors are a realistic assumption. This assumption becomes less realistic as more variables are omitted. Consider a time series model to predict the demand on infrastructure. It could be a power grid or it could be cloud resources used by a web application. It is likely that your model inputs will not capture all of the underlying drivers of demand. For example, large industrial consumers of electricity being paid to turn off. Or a sudden jump in web application demand due to a shout-out from a prominent influencer.
When the data is not Normally distributed or Student-t distributed with three or more degrees of freedom, does sample standard deviation actually mean anything? When all you have is a hammer, everything looks like a nail.
The next article explores fat-tailed distributions. We had to start here before diving deeper. Subscribers will receive all future updates directly to their inbox.
Do you have an advantage in your career?
Greater knowledge leads to greater career wins. Data science is a career where you must always keep learning. This newsletter, Professional Data Science, is the source of knowledge that will supercharge your career. Because it is written by an experienced industry professional. As a subscriber, you will be ahead of the pack by leveraging my experience. Subscribers stay ahead.
This article might be the first time that you have seen this newsletter. You might be wondering what the broader newsletter is about. Follow this link to read how you can get an edge in your career.
Another career boosting tactic is sharing interesting articles with colleagues whom you like. Strengthen your relationship by sharing useful information - and they might reciprocate. More importantly, they will think higher of you for seeking out such a unique and useful article. Because you will always look good when sharing articles that describe mathematical concepts in a highly intuitive way.
Further Reading
Stahl S., The Evolution of the Normal Distribution, Mathematics Magazine, 1996
Lane, D., History of the Normal Distribution, https://onlinestatbook.com/2/normal_distribution/history_normal.html