In a previous article we discussed the importance of data cleansing before starting any data science project. We described some steps that are transversal to any project, such as removing duplicates, correcting structural errors, what to do with missing data and handling outliers. In this article we will delve into the different techniques that exist for dealing with outliers.
As we already know, an outlier is one that “escapes” the normal range of values of the variable under study. They are generally produced by errors in the measurements or describe phenomena that do not represent the common functioning of what is being studied. It is important to first perform an analysis of the data before removing or replacing the outliers that may be found.
There are 2 very simple techniques that allow us to detect outliers for univariate cases: using interquartile range (IQR) and z-score. We will briefly explain what each one consists of.
The first method is a non-parametric method that has the interquartile range as its basis. The IQR is simply the difference between the third and the first quartile of the distribution of the data being studied. Recall that, to calculate a quartile, the data are ordered from lowest to highest and defined as follows:
Then, the outlier will be given by that value that is outside the range Q1-1.5*IQR, below; and Q3+1.5*IQR, above. IQR is given by IQR = Q3-Q1. The value of 1.5 is because it is the one that brings the outlier closest to a value of 3σ (assuming our data follow a Gaussian distribution). In reality, a value of 1.5 is equivalent to 2.7σ. A very good explanation of these values can be found here.
Example:
We have a list of numbers
n=(5, 7, 10, 15, 19, 21, 21, 22, 22, 23, 23, 23, 23, 23, 24, 24, 24 24, 25).
The numbers are already ordered from smallest to largest. We need to find the first and second half of the data set. For that, we calculate the median. There are 19 points, so the median would be given by 23. This separates the set of points into groups of 9 numbers.
The first quartile is given by Q1 = 19 (half of the second set of points). On the other hand, Q3 = 24. Thus, IQR = Q3 – Q1 = 5.
The lower bound is then Q1 – 1.5*IQR = 11.5, while the upper bound is given by Q3 + 1.5*IQR = 31.5.
Looking at the list of n numbers described above, there are only 3 outliers in this list: 5, 7 and 10. Visually, it makes sense. You can see that most of the numbers are in the range of 20 to 25, so 5, 7 and 10 would be a bit “out” of this range:
n=(5, 7, 10, 15, 19, 21, 21, 22, 22, 23, 23, 23, 23, 23, 24, 24, 24 24, 25).
The second method, z-score, is a parametric method for detecting outliers. It is assumed that the data follow a Gaussian distribution. The outliers will be the points that are at the tail of the distribution and, therefore, far from the mean. This threshold is given by:
zi = (xi – µ)/σ
Where xi is a point i, µ and σ are the mean and standard deviation of all points, respectively. The outlier will be, then, that point that satisfies |zi| > zumbral. Typical values of zumbral are 2.5, 3 and 3.5. The exact definition will depend on the problem at hand, and how strict you want to be with the data.
One of them is based on the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm. DBSCAN is a very robust method for segmentations, but it can also be used for outliers. Unlike K-means, in DBSCAN not all points can belong to any cluster. Therefore, when setting a distance, points outside that distance (and the clusters) are considered as outliers. An example of this can be seen here.
Another method is based on the Isolation Forest algorithm. The idea behind this algorithm is to isolate the points that are outliers. This method is based on the idea that it is more difficult to separate a “good” point than an “outlier” from a data collection, which is defined as the number of splits. This number of splits is found as follows:
The figure below shows an example of how the algorithm would work (reference). On the left, there is a point that requires many cuts to be isolated, while on the right, the figure shows a point with many fewer cuts. This example would tell us that the point on the left is an outlier (very small number of cuts), while the point on the left is not an outlier (many cuts to be isolated).
Another method is known as Local Outlier Factor (LOF). This is a non-supervised algorithm that calculates the local density of a given point with respect to its nearest neighbors. Those points that have less density than their neighbors are considered outliers.
The Mahalanobis distance can also be used to detect outliers. This distance is the distance between a point and a distribution, and not between two different points, as would be the Euclidean distance. It is defined as:
Where D is the Mahalanobis distance, x is the observation vector (a row in a dataset), m is the vector of the mean values of the independent variables (the mean of each column) and C-1 is the inverse of the covariance matrix of the independent variables.
Let us see with an example how these four techniques would be applied. For this, we will use the Boston house value dataset that comes with the Python package sklearn.
With the constructed dataset, we will build a function to calculate the Mahalanobis distance.
To obtain the minimum distance to decide whether a point is outlier or not, it is usually assumed that the dataset follows a multivariate normal distribution with K variables, and the Mahalanobis distance follows a chi-square distribution with K degrees of freedom. We can assume a reasonable significance level (2.5%, 1%, 0.01%). In this case, we choose 0.001%. That is our cutoff value.
The result of this print is 17 points. This means that from the dataset, 17 points can be considered outliers. Let’s see the case with DBSCAN.
On the other hand DBSCAN accepts as variables min_samples, which is the minimum number of points that can belong to a group, and eps is the minimum distance between two points to be considered a neighbor. This is a very important parameter, because it influences the number of clusters that can be found. The result, clusters, is a value between -1 and n for each point, where n is the number of clusters found, and -1 indicates that this point does not belong to any cluster. These are the outliers. In this case, the print results in 82.
For isolation forest, the variable contamination is the percentage of data that may be outliers. For this case, we leave it as auto, i.e., it finds it automatically. On the other hand, max_features is the maximum number of features on which to train the model.
The algorithm results in 1 for points that are not considered outliers, and -1 for those that are. In this case, 132 outliers are found.
Finally, in LOF, the contamination parameter has the same definition as for isolation forest, and n_neighbors is the maximum number of neighbors over which to calculate the density.
For this example, the result of the above code yields 51 outliers.
As we can see, the results are different for the 4 algorithms. In addition, each algorithm has parameters that can change the result. Generally, the default parameters are the ones that work best. Regardless of the method used, it is important to check the data after each iteration. Sometimes a combination of methods is needed to find and remove outliers. It is always important to understand the data in front of you, and any decision you make (whether to remove outliers or replace them with some other value) must be accompanied by a business definition, but without a doubt, cleaning the data is fundamental to achieve more accurate results in the predictions.