This page is under development

What is an outlier?

An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset.

What are the criteria to identify an outlier?

Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile Data point that falls outside of 3 standard deviations. we can use a z score: if the z score falls outside of 3 standard deviation…

What is the reason for an outlier to exists in a dataset?

An outlier could exist in a dataset due to * Variability in the data, * An experimental measurement error

What is the impact of an outlier?

Causes serious issues for statistical analysis
Skew the data,
Significant impact on mean
Significant impact on standard deviation.

How can we identify an outlier?

using box plots
using Z score
using the IQR interquartile range

we will use a list here

import numpy as np

# Fixing random state for reproducibility
np.random.seed(1)

# generating some data
spread = np.random.randint(50, size = 100)
print(spread)

[37 43 12  8  9 11  5 15  0 16  1 12  7 45  6 25 20 37 18 20 11 42 28 29
 14  4 23 23 41 49 30 32 22 13 41  9  7 22  1  0 17  8 24 13 47 42  8 30
  7  3  6 21 49  3  4 24 49 43 12 26 16 45 41 18 15  0  4 25 47 34 23  7
 26 25 40 22  9  3 39 23 36 27 37 19 38  8 32 34 10 23 15 47 23 25  7 28
 10 46 32 24]

The seed() method is used to initialize the random number generator. The random number generator needs a number to start with (a seed value), to be able to generate a random number.

Let’s introduce outliers in our dataset:

import random
flier_high =(np.random.randint(10, size = 4) + 100) 
flier_low = (np.random.randint(10, size = 5) + 100) * -1
dataset = np.concatenate((spread, flier_high, flier_low))
random.shuffle(dataset)

Using Box Plot

import matplotlib.pyplot as plt

fig1, ax1 = plt.subplots()
ax1.set_title('Basic Plot')
ax1.boxplot(dataset)

{'whiskers': [<matplotlib.lines.Line2D object at 0x7f01447f3dd8>, <matplotlib.lines.Line2D object at 0x7f0144702470>], 'caps': [<matplotlib.lines.Line2D object at 0x7f01447027f0>, <matplotlib.lines.Line2D object at 0x7f0144702b70>], 'boxes': [<matplotlib.lines.Line2D object at 0x7f01446ecfd0>], 'medians': [<matplotlib.lines.Line2D object at 0x7f0144702ef0>], 'fliers': [<matplotlib.lines.Line2D object at 0x7f014470f2b0>], 'means': []}

plt.show()

Using Z score

Formula for Z score = (Observation — Mean)/Standard Deviation

we first import the libraries

import numpy as np
import pandas as pd

Let’s define a function that takes a list (dataset) of numeric values as an input argument. This function calculates:

the mean and standard deviation of all the values from the list
the z score for any data point in the dataset: if the z score is greater than 3, we can classify that point as an outlier. Any point outside of 3 standard deviations would be an outlier.

outliers = []
threshold=3
mean = np.mean(dataset)
std = np.std(dataset)
z = (dataset-mean)/std
print(dataset[np.abs(z)>threshold])

[-100 -108 -103 -106 -104]

import seaborn as sns
sns.set(style="whitegrid")
ax = sns.boxplot(z)
plt.show()

Using IQR

IQR tells how spread the middle values are. It can be used to tell when a value is too far from the middle. An outlier is a point which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile. we will use the same dataset

Calculate first(q1) and third quartile(q3)
Find interquartile range (q3-q1)
Find lower bound q1*1.5
Find upper bound q3*1.5
Anything that lies outside of lower and upper bound is an outlier

Let’s find the first quartile and third quartile:

q1, q3= np.percentile(dataset,[25,75])
print("First quartile: {}, third quartile: {}".format(q1, q3))

First quartile: 8.0, third quartile: 34.0

Find the IQR which is the difference between third and first quartile

iqr = q3 - q1
print("IQR is: {}".format(iqr))

IQR is: 26.0

Find the lower and upper bounds

lower_bound = q1 -(1.5 * iqr) 
upper_bound = q3 +(1.5 * iqr) 
print("The lower bound is {} and the upper bound is {}".format(lower_bound, upper_bound))

The lower bound is -31.0 and the upper bound is 73.0

print(dataset[(dataset<lower_bound) | (dataset>upper_bound)])
# or dataset>upper_bound)

[-100  101  107  101 -108 -103 -106  107 -104]