Data cleansing

  • Articles
  • Data Science & IA

Introduction

When starting a new data science project, it is common to start discussing which algorithms and/or techniques would be good to use. Names or concepts such as Deep Learning, CatBoost, XGBoost, K-Means, Recommendation Algorithms start to be used as possible solutions to the problem posed.

However, what is not always known is that this is only the final part of the project, and the one that, on average, takes the least time. There is one step that is very relevant, that takes more time and that many Data Scientists do not enjoy: data cleansing.

Performing a data inspection and cleaning can make the difference between the success or failure of a project. The saying “garbage in, garbage out” is well known in the data science world: an algorithm, no matter how good it is, is trained on the incoming data, and will not be able to deliver good results if the data is not good.

It is estimated that data cleaning takes between 60% and 70% of the total time of a project[1] And it is understandable that it takes so much time: we must be sure that the data cleaning is well done, because it will feed the algorithms we use to solve the problem we have.

What does “cleaning the data” consist of? In a nutshell, it consists of reviewing and verifying that all the data we have (whether numbers, text or images) make sense. If the data is about height, are the numbers consistent, are they within an acceptable range, does someone appear to be 5m tall? If I am analyzing claims from Spanish-speaking stores, is it normal for text in another language to appear? If I am studying gestures in images of people, can images of animals appear?

It is important that the data you are working with is accurate, complete, of good quality, reliable, unbiased and valid.

Data cleansing process

Since every project is different, it is difficult to have a checklist that tells us what to do step by step. However, there are basic checks that are transversal to every project (note: from now on, numerical data and variables containing names are assumed).

  • Remove duplicates or irrelevant data: Having duplicate data is more common than you might think. This usually happens at the data collection stage. When you have several sources, you are looking to put the data together, which can result in having duplicates. Therefore, it is important that after collecting data, the table you are going to work with is reviewed to discard repeated rows. Irrelevant data is data that does not influence or impact the problem you are trying to solve. For example, if you are analyzing average income in Asian countries, and you have data from European countries, it is best to eliminate the latter. This will make the analysis faster and reduce the chances of errors.
  • Correcting structural errors: This happens when transferring data or merging different databases, and you notice strange nomenclature, typographical or grammatical errors. For example, “N/A” or “Not applicable” may appear, and should be standardized. Or maybe for a person, “Brain Food”, “Brainfood”, and “Brain-Food” are the same thing, but the computer sees 3 different names. All this has to be carefully analyzed to avoid errors.
  • Correcting outliers: This is perhaps one of the most common situations a Data Scientist encounters. An outlier is a value that “escapes” from the normal range of values of the variable being studied. In the example given above of height, a person who is 5m tall is clearly an outlier. Here we are presented with 2 options: we can either eliminate this number (this would imply eliminating the entire record), or change it to, for example, the average of the heights. What to do with the outlier will depend on the problem we are working on. One important thing: not all outliers are indeed outliers. One may be reviewing supermarket sales, and we see a sale that is out of line with the average. Immediately, one might think of deleting the record, but perhaps the right thing to do is to look at the context: did the sale occur in a special period (Christmas, New Year)? Does the supermarket sell to wholesalers as well? It is very important, before making any decision, to analyze well the problem and the context of the data.
  • Handling missing data: Another common situation a Data Scientist encounters. Missing data are those that are not in the tables we are working with. There are some options for dealing with missing data, such as, for example:
    • Delete all records that contain missing data in some field: this is a good option as long as the records to be deleted are minimal. If they are a significant percentage of the data (more than 15%, for example), it is better to consider another alternative.
    • Replace missing data with values based on other observations: There are several techniques for this (using knn or nearest neighbor algorithm, predicting missing data, etc.), the most common is to replace by the average (e.g., if someone’s height is missing, you can replace by the average of heights present in the data), or by the median (in case the average is biased by some values within the data).
  • Validation: Once you are done with the above steps (and any other project-specific steps), it is important to take the time, look at the data, and answer questions such as:
    • Does the data make sense?
    • Does the data allow me to find relationships that help solve the problem?
    • Do the data help to confirm or disprove the different hypotheses being put forward?

If the data is not analyzed and cleaned properly, it can lead to false or erroneous conclusions about the problem, which in turn, can lead to incorrect business decisions.

Final thoughts

In Brain Food we know the importance of data, and therefore, for each project, we make a rigorous analysis to verify their quality and the possibility that they give us to find a solution to the problem. These analyses are done using algorithms in Python or R, supported by Power Bi to achieve a good visualization. Python has libraries that automate certain tasks and help to visualize the data, such as pandas-profiling, Sweetviz, D-Tale and Autoviz, while in R we find summarytools, explore and dataMaid, among others. These tools help us to dive into the data and find relationships that allow us to find the best solution to the problem.

Bibliography

[1] https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=35bd98ab6f63