Outlier Detection using Boxplot in Python

Box plots and Outlier Detection

  • Box plots have box from LQ to UQ, with median marked.
  • They portray a five-number graphical summary of the data Minimum, LQ, Median, UQ, Maximum
  • Helps us to get an idea on the data distribution
  • Helps us to identify the outliers easily
  • 25% of the population is below first quartile,
  • 75% of the population is below third quartile
  • If the box is pushed to one side and some values are far away from the box then it’s a clear indication of outliers
  • Some set of values far away from box,  gives us a clear indication of outliers.
  • In this example the minimum is 5, maximum is 120, and 75% of the values are less than 15.
  • Still there are some records reaching 120. Hence a clear indication of outliers.
  • Sometimes the outliers are so evident that, the box appear to be a horizontal line in box plot.

Box plots and outlier detection on Python

In [30]:

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline  

plt.boxplot(bank.balance)

Out[30]:

{'boxes': [<matplotlib.lines.Line2D at 0xcbcd400>],
 'caps': [<matplotlib.lines.Line2D at 0xcbdde10>,
  <matplotlib.lines.Line2D at 0xcbddf28>],
 'fliers': [<matplotlib.lines.Line2D at 0xccc4f98>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xccc4780>],
 'whiskers': [<matplotlib.lines.Line2D at 0xcbcdda0>,
  <matplotlib.lines.Line2D at 0xcbcdeb8>]}

Practice: Box plots and outlier detection

  • Dataset: “./Bank Marketing/bank_market.csv”
  • Draw a box plot for balance variable
  • Do you suspect any outliers in balance ?
  • Get relevant percentiles and see their distribution.
  • Draw a box plot for age variable
  • Do you suspect any outliers in age?
  • Get relevant percentiles and see their distribution.

In [31]:

plt.boxplot(bank.balance)

Out[31]:

{'boxes': [<matplotlib.lines.Line2D at 0xcc78208>],
 'caps': [<matplotlib.lines.Line2D at 0xcc7fc18>,
  <matplotlib.lines.Line2D at 0xcc7fd30>],
 'fliers': [<matplotlib.lines.Line2D at 0xcc84da0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xcc84588>],
 'whiskers': [<matplotlib.lines.Line2D at 0xcc78ba8>,
  <matplotlib.lines.Line2D at 0xcc78cc0>]}

outlier are present in balance variableIn [32]:

#Get relevant percentiles and see their distribution
bank['balance'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

Out[32]:

0.0     -8019.0
0.1         0.0
0.2        22.0
0.3       131.0
0.4       272.0
0.5       448.0
0.6       701.0
0.7      1126.0
0.8      1859.0
0.9      3574.0
1.0    102127.0
Name: balance, dtype: float64

In [33]:

# Draw a box plot for age variable
plt.boxplot(bank.age)

Out[33]:

{'boxes': [<matplotlib.lines.Line2D at 0xcf54470>],
 'caps': [<matplotlib.lines.Line2D at 0xcf5be80>,
  <matplotlib.lines.Line2D at 0xcf5bf98>],
 'fliers': [<matplotlib.lines.Line2D at 0xcf65748>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xcf617f0>],
 'whiskers': [<matplotlib.lines.Line2D at 0xcf54e10>,
  <matplotlib.lines.Line2D at 0xcf54f28>]}

No outliers are presentIn [34]:

#Get relevant percentiles and see their distribution
bank['age'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

Out[34]:

0.0    18.0
0.1    29.0
0.2    32.0
0.3    34.0
0.4    36.0
0.5    39.0
0.6    42.0
0.7    46.0
0.8    51.0
0.9    56.0
1.0    95.0
Name: age, dtype: float64

Next post is about creating graphs in python.

Follow Us On