An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset.
Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile Data point that falls outside of 3 standard deviations. we can use a z score: if the z score falls outside of 3 standard deviation…
An outlier could exist in a dataset due to * Variability in the data, * An experimental measurement error
we will use a list here
import numpy as np
# Fixing random state for reproducibility
np.random.seed(1)
# generating some data
spread = np.random.randint(50, size = 100)
print(spread)
[37 43 12 8 9 11 5 15 0 16 1 12 7 45 6 25 20 37 18 20 11 42 28 29
14 4 23 23 41 49 30 32 22 13 41 9 7 22 1 0 17 8 24 13 47 42 8 30
7 3 6 21 49 3 4 24 49 43 12 26 16 45 41 18 15 0 4 25 47 34 23 7
26 25 40 22 9 3 39 23 36 27 37 19 38 8 32 34 10 23 15 47 23 25 7 28
10 46 32 24]
The seed() method is used to initialize the random number generator. The random number generator needs a number to start with (a seed value), to be able to generate a random number.
Let’s introduce outliers in our dataset:
import random
flier_high =(np.random.randint(10, size = 4) + 100)
flier_low = (np.random.randint(10, size = 5) + 100) * -1
dataset = np.concatenate((spread, flier_high, flier_low))
random.shuffle(dataset)
import matplotlib.pyplot as plt
fig1, ax1 = plt.subplots()
ax1.set_title('Basic Plot')
ax1.boxplot(dataset)
{'whiskers': [<matplotlib.lines.Line2D object at 0x7f01447f3dd8>, <matplotlib.lines.Line2D object at 0x7f0144702470>], 'caps': [<matplotlib.lines.Line2D object at 0x7f01447027f0>, <matplotlib.lines.Line2D object at 0x7f0144702b70>], 'boxes': [<matplotlib.lines.Line2D object at 0x7f01446ecfd0>], 'medians': [<matplotlib.lines.Line2D object at 0x7f0144702ef0>], 'fliers': [<matplotlib.lines.Line2D object at 0x7f014470f2b0>], 'means': []}
Formula for Z score = (Observation — Mean)/Standard Deviation
we first import the libraries
Let’s define a function that takes a list (dataset) of numeric values as an input argument. This function calculates:
outliers = []
threshold=3
mean = np.mean(dataset)
std = np.std(dataset)
z = (dataset-mean)/std
print(dataset[np.abs(z)>threshold])
[-100 -108 -103 -106 -104]
IQR tells how spread the middle values are. It can be used to tell when a value is too far from the middle. An outlier is a point which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile. we will use the same dataset
Let’s find the first quartile and third quartile:
q1, q3= np.percentile(dataset,[25,75])
print("First quartile: {}, third quartile: {}".format(q1, q3))
First quartile: 8.0, third quartile: 34.0
Find the IQR which is the difference between third and first quartile
IQR is: 26.0
Find the lower and upper bounds
lower_bound = q1 -(1.5 * iqr)
upper_bound = q3 +(1.5 * iqr)
print("The lower bound is {} and the upper bound is {}".format(lower_bound, upper_bound))
The lower bound is -31.0 and the upper bound is 73.0
[-100 101 107 101 -108 -103 -106 107 -104]