Box plots and Outlier Detection
- Box plots have box from LQ to UQ, with median marked.
- They portray a five-number graphical summary of the data Minimum, LQ, Median, UQ, Maximum
- Helps us to get an idea on the data distribution
- Helps us to identify the outliers easily
- 25% of the population is below first quartile,
- 75% of the population is below third quartile
- If the box is pushed to one side and some values are far away from the box then it’s a clear indication of outliers

- Some set of values far away from box, gives us a clear indication of outliers.
- In this example the minimum is 5, maximum is 120, and 75% of the values are less than 15.
- Still there are some records reaching 120. Hence a clear indication of outliers.

- Sometimes the outliers are so evident that, the box appear to be a horizontal line in box plot.

Box plots and outlier detection on Python
In [30]:
import numpy as np import matplotlib.pyplot as plt %matplotlib inline plt.boxplot(bank.balance)
Out[30]:
{'boxes': [<matplotlib.lines.Line2D at 0xcbcd400>], 'caps': [<matplotlib.lines.Line2D at 0xcbdde10>, <matplotlib.lines.Line2D at 0xcbddf28>], 'fliers': [<matplotlib.lines.Line2D at 0xccc4f98>], 'means': [], 'medians': [<matplotlib.lines.Line2D at 0xccc4780>], 'whiskers': [<matplotlib.lines.Line2D at 0xcbcdda0>, <matplotlib.lines.Line2D at 0xcbcdeb8>]}

Practice: Box plots and outlier detection
- Dataset: “./Bank Marketing/bank_market.csv”
- Draw a box plot for balance variable
- Do you suspect any outliers in balance ?
- Get relevant percentiles and see their distribution.
- Draw a box plot for age variable
- Do you suspect any outliers in age?
- Get relevant percentiles and see their distribution.
In [31]:
plt.boxplot(bank.balance)
Out[31]:
{'boxes': [<matplotlib.lines.Line2D at 0xcc78208>], 'caps': [<matplotlib.lines.Line2D at 0xcc7fc18>, <matplotlib.lines.Line2D at 0xcc7fd30>], 'fliers': [<matplotlib.lines.Line2D at 0xcc84da0>], 'means': [], 'medians': [<matplotlib.lines.Line2D at 0xcc84588>], 'whiskers': [<matplotlib.lines.Line2D at 0xcc78ba8>, <matplotlib.lines.Line2D at 0xcc78cc0>]}

outlier are present in balance variableIn [32]:
#Get relevant percentiles and see their distribution bank['balance'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
Out[32]:
0.0 -8019.0 0.1 0.0 0.2 22.0 0.3 131.0 0.4 272.0 0.5 448.0 0.6 701.0 0.7 1126.0 0.8 1859.0 0.9 3574.0 1.0 102127.0 Name: balance, dtype: float64
In [33]:
# Draw a box plot for age variable plt.boxplot(bank.age)
Out[33]:
{'boxes': [<matplotlib.lines.Line2D at 0xcf54470>], 'caps': [<matplotlib.lines.Line2D at 0xcf5be80>, <matplotlib.lines.Line2D at 0xcf5bf98>], 'fliers': [<matplotlib.lines.Line2D at 0xcf65748>], 'means': [], 'medians': [<matplotlib.lines.Line2D at 0xcf617f0>], 'whiskers': [<matplotlib.lines.Line2D at 0xcf54e10>, <matplotlib.lines.Line2D at 0xcf54f28>]}

No outliers are presentIn [34]:
#Get relevant percentiles and see their distribution bank['age'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
Out[34]:
0.0 18.0 0.1 29.0 0.2 32.0 0.3 34.0 0.4 36.0 0.5 39.0 0.6 42.0 0.7 46.0 0.8 51.0 0.9 56.0 1.0 95.0 Name: age, dtype: float64
Next post is about creating graphs in python.