Normal distribution is also called as Gaussian distribution or Laplace-Gauss distribution.
Normal Distribution with Python Example
Normal distribution represents a symmetric distribution where most of the observations cluster around the central peak called as mean of the distribution. The parameter used to measure the variability of observations around the mean is called as standard deviation. The probabilities for values occurring near mean are higher than the values far away from the mean. Normal distribution is a probability distribution plot. The parameters of the normal distribution plot defining the shape and the probabilities are mean and standard deviation. The area of the plot between two different points in the normal distribution plot represents the probability of the value occurring between those two points.
Here are some of the properties of normal distribution of the population:
- The points in the normal distribution are symmetric. Normal distribution can not be used to model skewed distributions.
- The mean, median and mode of normal distribution are equal.
- Half of the population is less than the mean and half is greater than the mean.
- The empirical rule of the normal distribution goes like the following: 68% of the observations fall within +/- 1 standard deviation from the mean, 95% of the observations fall within +/- 2 standard deviation from the mean and 99.7% of the observations fall within +/- 3 standard deviation from the mean.
Here is the probability density function for normal distribution:
In above function, μμ represents the mean and σσ represents the standard deviation. Given different values of random variable (x), one could calculate the probability using the above probability density function.
Here is a sample probability distribution plot representing normal distribution with a mean of 5 and standard deviation of 10. The plot is created for random variable taking values between -100 and 100.
The following code can be used to generate above normal distribution plot.
# # Create a normal distribution with mean as 5 and standard deviation as 10 # mu = 5 std = 10 snd = stats.norm(mu, std) # # Generate 1000 random values between -100, 100 # x = np.linspace(-100, 100, 1000) # # Plot the standard normal distribution for different values of random variable # falling in the range -100, 100 # plt.figure(figsize=(7.5,7.5)) plt.plot(x, snd.pdf(x)) plt.xlim(-60, 60) plt.title('Normal Distribution (Mean = 5, STD = 10)', fontsize='15') plt.xlabel('Values of Random Variable X', fontsize='15') plt.ylabel('Probability', fontsize='15') plt.show()
Here is the code representing multiple normal distribution plots which looks like the following:
The following code can be used to create above shown multiple normal distribution plots having different means and standard deviation.
# # Values of random variable # x = np.linspace(-10, 10, 100) # plt.figure(figsize=(7.5,7.5)) # # Normal distribution with mean 0 and std as 1 # plt.plot(x, stats.norm(0, 1).pdf(x)) # # Normal distribution with mean 1 and std as 0.75 # plt.plot(x, stats.norm(1, 0.75).pdf(x)) # # Normal distribution with mean 2 and std as 1.5 # plt.plot(x, stats.norm(2, 1.5).pdf(x)) plt.xlim(-10, 10) plt.title('Normal Distribution', fontsize='15') plt.xlabel('Values of Random Variable X', fontsize='15') plt.ylabel('Probability', fontsize='15') plt.show()
Standard Normal Distribution with Python Example
Standard Normal Distribution is normal distribution with mean as 0 and standard deviation as 1.
Here is the Python code and plot for standard normal distribution. Note that the standard normal distribution has a mean of 0 and standard deviation of 1. Pay attention to some of the following in the code below:
The following is the Python code used to generate the above standard normal distribution plot. Pay attention to some of the following in the code given below:
- Scipy Stats module is used to create an instance of standard normal distribution with mean as 0 and standard deviation as 1 (stats.norm)
- Probability density function pdf() is invoked on the instance of stats.norm to generate probability estimates of different values of random variable given the standard normal distribution
import numpy as np import matplotlib.pyplot as plt from scipy import stats # # Create a standard normal distribution with mean as 0 and standard deviation as 1 # mu = 0 std = 1 snd = stats.norm(mu, std) # # Generate 100 random values between -5, 5 # x = np.linspace(-5, 5, 100) # # Plot the standard normal distribution for different values of random variable # falling in the range -5, 5 # plt.figure(figsize=(7.5,7.5)) plt.plot(x, snd.pdf(x)) plt.xlim(-5, 5) plt.title('Normal Distribution', fontsize='15') plt.xlabel('Values of Random Variable X', fontsize='15') plt.ylabel('Probability', fontsize='15') plt.show()
Conclusions
Here is the summary of what you learned in this post in relation to Normal distribution:
- Normal distribution is a symmetric probability distribution with equal number of observations on either half of the mean.
- The parameters representing the shape and probabilities of the normal distribution are mean and standard deviation
- Python Scipy stats module can be used to create a normal distribution with meand and standard deviation parameters using method norm.
- Standard normal distribution is normal distribution with mean as 0 and standard deviation as 1.
- In normal distribution, 68% of observations lie within 1 standard deviation, 95% of observations lie within 2 standard deviations and 99.7% observations lie within 3 standard deviations from the mean.
Even if you are not in the field of statistics, you must have come across the term “Normal Distribution”.
A probability distribution is a statistical function that describes the likelihood of obtaining the possible values that a random variable can take. By this, we mean the range of values that a parameter can take when we randomly pick up values from it.
A probability distribution can be discrete or continuous.https://imasdk.googleapis.com/js/core/bridge3.476.0_en.html#goog_1402046895
Suppose in a city we have heights of adults between the age group of 20-30 years ranging from 4.5 ft. to 7 ft.
If we were asked to pick up 1 adult randomly and asked what his/her (assuming gender does not affect height) height would be? There’s no way to know what the height will be. But if we have the distribution of heights of adults in the city, we can bet on the most probable outcome.
What is Normal Distribution?
A Normal Distribution is also known as a Gaussian distribution or famously Bell Curve. People use both words interchangeably, but it means the same thing. It is a continuous probability distribution.
The probability density function (pdf) for Normal Distribution:
where, μ = Mean , σ = Standard deviation , x = input value.
Terminology:
- Mean – The mean is the usual average. The sum of total points divided by the total number of points.
- Standard Deviation – Standard deviation tells us how “spread out” the data is. It is a measure of how far each observed value is from the mean.
Looks daunting, isn’t it? But it is very simple.
1. Example Implementation of Normal Distribution
Let’s have a look at the code below. We’ll use numpy and matplotlib for this demonstration:
# Importing required libraries import numpy as np import matplotlib.pyplot as plt # Creating a series of data of in range of 1-50. x = np.linspace(1,50,200) #Creating a Function. def normal_dist(x , mean , sd): prob_density = (np.pi*sd) * np.exp(-0.5*((x-mean)/sd)**2) return prob_density #Calculate mean and Standard deviation. mean = np.mean(x) sd = np.std(x) #Apply function to the data. pdf = normal_dist(x,mean,sd) #Plotting the Results plt.plot(x,pdf , color = 'red') plt.xlabel('Data points') plt.ylabel('Probability Density')
2. Properties of Normal Distribution
The normal distribution density function simply accepts a data point along with a mean value and a standard deviation and throws a value which we call probability density.
We can alter the shape of the bell curve by changing the mean and standard deviation.
Changing the mean will shift the curve towards that mean value, this means we can change the position of the curve by altering the mean value while the shape of the curve remains intact.
The shape of the curve can be controlled by the value of Standard deviation. A smaller standard deviation will result in a closely bounded curve while a high value will result in a more spread out curve.
Some excellent properties of a normal distribution:
- The mean, mode, and median are all equal.
- The total area under the curve is equal to 1.
- The curve is symmetric around the mean.
Empirical rule tells us that:
- 68% of the data falls within one standard deviation of the mean.
- 95% of the data falls within two standard deviations of the mean.
- 99.7% of the data falls within three standard deviations of the mean.
It is by far one of the most important distributions in all of the Statistics. The normal distribution is magical because most of the naturally occurring phenomenon follows a normal distribution. For example, blood pressure, IQ scores, heights follow the normal distribution.
Calculating Probabilities with Normal Distribution
To find the probability of a value occurring within a range in a normal distribution, we just need to find the area under the curve in that range. i.e. we need to integrate the density function.
Since the normal distribution is a continuous distribution, the area under the curve represents the probabilities.
Before getting into details first let’s just know what a Standard Normal Distribution is.
A standard normal distribution is just similar to a normal distribution with mean = 0 and standard deviation = 1.
Z = (x-μ)/ σ
The z value above is also known as a z-score. A z-score gives you an idea of how far from the mean a data point is.
If we intend to calculate the probabilities manually we will need to lookup our z-value in a z-table to see the cumulative percentage value. Python provides us with modules to do this work for us. Let’s get into it.
1. Creating the Normal Curve
We’ll use scipy.norm
class function to calculate probabilities from the normal distribution.
Suppose we have data of the heights of adults in a town and the data follows a normal distribution, we have a sufficient sample size with mean equals 5.3 and the standard deviation is 1.
This information is sufficient to make a normal curve.
# import required libraries from scipy.stats import norm import numpy as np import matplotlib.pyplot as plt import seaborn as sb # Creating the distribution data = np.arange(1,10,0.01) pdf = norm.pdf(data , loc = 5.3 , scale = 1 ) #Visualizing the distribution sb.set_style('whitegrid') sb.lineplot(data, pdf , color = 'black') plt.xlabel('Heights') plt.ylabel('Probability Density')
The norm.pdf( )
class method requires loc
and scale
along with the data as an input argument and gives the probability density value. loc
is nothing but the mean and the scale
is the standard deviation of data. the code is similar to what we created in the prior section but much shorter.
2. Calculating Probability of Specific Data Occurance
Now, if we were asked to pick one person randomly from this distribution, then what is the probability that the height of the person will be smaller than 4.5 ft. ?
The area under the curve as shown in the figure above will be the probability that the height of the person will be smaller than 4.5 ft if chosen randomly from the distribution. Let’s see how we can calculate this in python.
The area under the curve is nothing but just the Integration of the density function with limits equals -∞ to 4.5.
norm(loc = 5.3 , scale = 1 ).cdf( 4.5 ) |
0.211855 or 21.185 %
The single line of code above finds the probability that there is a 21.18% chance that if a person is chosen randomly from the normal distribution with a mean of 5.3 and a standard deviation of 1, then the height of the person will be below 4.5 ft.
We initialize the object of class norm
with mean and standard deviation, then using .cdf( )
method passing a value up to which we need to find the cumulative probability value. The cumulative distribution function (CDF) calculates the cumulative probability for a given x-value.
Cumulative probability value from -∞ to ∞ will be equal to 1.
Now, again we were asked to pick one person randomly from this distribution, then what is the probability that the height of the person will be between 6.5 and 4.5 ft. ?
cdf_upper_limit = norm(loc = 5.3 , scale = 1).cdf(6.5) cdf_lower_limit = norm(loc = 5.3 , scale = 1).cdf(4.5) prob = cdf_upper_limit - cdf_lower_limit print(prob)
0.673074 or 67.30 %
The above code first calculated the cumulative probability value from -∞ to 6.5 and then the cumulative probability value from -∞ to 4.5. if we subtract cdf of 4.5 from cdf of 6.5 the result we get is the area under the curve between the limits 6.5 and 4.5.
Now, what if we were asked about the probability that the height of a person chosen randomly will be above 6.5ft?
It’s simple, as we know the total area under the curve equals 1, and if we calculate the cumulative probability value from -∞ to 6.5 and subtract it from 1, the result will be the probability that the height of a person chosen randomly will be above 6.5ft.
cdf_value = norm(loc = 5.3 , scale = 1). cdf(6.5)prob = 1- cdf_value print(prob) |
0.115069 or 11.50 %.
That’s a lot to sink in, but I encourage all to keep practicing this essential concept along with the implementation using python.
The complete code from above implementation:
# import required libraries from scipy.stats import norm import numpy as np import matplotlib.pyplot as plt import seaborn as sb # Creating the distribution data = np.arange(1,10,0.01) pdf = norm.pdf(data , loc = 5.3 , scale = 1 ) #Probability of height to be under 4.5 ft. prob_1 = norm(loc = 5.3 , scale = 1).cdf(4.5) print(prob_1) #probability that the height of the person will be between 6.5 and 4.5 ft. cdf_upper_limit = norm(loc = 5.3 , scale = 1).cdf(6.5) cdf_lower_limit = norm(loc = 5.3 , scale = 1).cdf(4.5) prob_2 = cdf_upper_limit - cdf_lower_limit print(prob_2) #probability that the height of a person chosen randomly will be above 6.5ft cdf_value = norm(loc = 5.3 , scale = 1).cdf(6.5) prob_3 = 1- cdf_value print(prob_3)