Quick Guide to Data Cleaning with Examples

Maria Chojnowska

5 April 2023, 10 min read

thumbnail post

Data cleaning is a critical part of every data science project. There’s no doubt about it. You should know that preparing data for proper analysis is often time-consuming and challenging. Data scientists tend to spend a lot of time at this level of data exploration. Moreover, this step sometimes requires domain knowledge. Here’s what usually happens Data scientists often get data from different sources. The data in question might be inconsistent, dirty, or missing. That’s why data cleaning is so important. This is where data scientists decide how to deal with such problems and preserve as much valuable data as possible. In this article, I present a few handy data-cleaning techniques every data scientist needs to know.

What's inside

  1. Let’s start with data cleaning
  2. Merging data from two CSV files
  3. Making data consistent
  4. Deleting columns
  5. Missing values
  6. Mapping values
  7. Outliers
  8. Scaling
  9. Normalization
  10. Summarize
  11. Contact us

Let’s start with data cleaning

First things first. What is the definition of data cleaning?

It is identifying and correcting or removing errors, inconsistencies, and inaccuracies in data to improve its quality and usefulness. Data cleaning aims to ensure that data is accurate, complete, consistent, and formatted correctly for analysis.

The data I will use in this article comes from a travel company that offers trips from London to New Zealand and Australia.

We get two tables:

  • traveler (users)
  • travel (travel records)

First of all, we need to import Pandas.

import pandas as pd

If you’d like to follow this tutorial, prepare this data:

# preparing data 
traveler = pd.DataFrame({
  'user_id': [136, 284, 101, 529, 800, 823], 
  'age': [None, 38, 30, 43, 49, 28], 
  'name': ["Ann", "Ben", "Tom", "Bianca", "Caroline", "Kate"]
}) 
travel = pd.DataFrame({
  'user_id': [101, 284, 136, 800, 101, 800, 823, 529, 284], 
  'date_of_journey': [
    '2018-01-16', 
    '2017-07-13',
    '2019-10-10',
    '2018/03/20', 
    '2019-12-24', 
    '2017-10-17',
    '2016/11/02', 
    '2019/09/14', 
    '2019-08-07'
  ],
  'duration': [10, 10,7,13,7,11,14, 8, 12], 
  'destination': [
    "New Zeland", 
    "australia",
    "Australia", 
    "New_Zealand", 
    "Australia/",
    "Australia", 
    "New Zealand", 
    "Australia", 
    "New_zealand"
  ], 
  'cost': [None, 2325.0, 1760.0, 2740.0, 4000.0, 2475.0, 3140.0, 1840.0, 2910.0], 
  'currency': [None, 'EUR', 'GBP', 'GBP', 'GBP','EUR', 'GBP', 'GBP', 'GBP']
})
traveler.to_csv("traveler.csv", index=False) 
travel.to_csv("travel.csv", index=False)

Now that we’ve got our data, it’s time to do some data cleaning.

Merging data from two CSV files

Our first step is merging data from two CSV files. The data we need is separated into two tables, so we need to combine them using the user_id column in both files. Here’s how you merge this data in Pandas.

traveler = pd.read_csv("traveler.csv") 
travel = pd.read_csv("travel.csv")
data = pd.merge(travel, traveler, on='user_id')

Making data consistent

Date

To create visualizations of data in time, we need to transform it to get the same data format within the column. In our case, one column includes several date formats, so we need to use infer_datetime_format and allow Pandas to try to infer the proper format. Note that the library might not work properly if a user is too creative with the date format.

data.date_of_journey = pd.to_datetime(data.date_of_journey, infer_datetime_format=True)

Cost

When we analyze the cost and currency columns, we easily notice that costs are given in two currencies. We want our data to be consistent, so changing the cost values to (for example) GBP is a good idea.

We can do that by using Pandas slicing. First, we need to choose only the cells where the cost value is given in the currency 'EUR' and update these cells (multiplying by 0.8). In the second step, we change all the "EUR" to "GBP" in the currency column.

data.loc[
  data.currency == 'EUR', 
  ['cost']
] = data.cost \* 0.8 
data.currency.replace("EUR", "GBP", inplace=True)

Destination

As we can see, the destination column contains more unique values than expected. It should include two values, which is why we need to fix it by changing every string respectively to new_zealand or to australia. In the second cell, all categories occur in the destination column.

data.destination.describe()
# We need to get all compromised categories and try to unify them 
categories = data.destination.drop_duplicates() 
print(categories)
data.destination.loc[data.destination.str.lower().str.startswith('n', na=False)] = "new_zealand" 
data.destination.loc[data.destination.str.lower().str.startswith('a', na=False)] = "australia"

Deleting columns

If there are columns that we suppose could be dismissed from analysis, we can drop them at this point. In our example, we will drop currency since it’s the same for all records. We will also delete the name column because we don’t need this information for our analysis.

data = data.drop(['currency', 'name'], axis=1)

Missing values

We can check whether there are any missing values data.isna() in the first cell.

If we spot missing values, we can perform one of the two following actions:

  1. If we can't figure out how to fill the missing values in a way that isn’t going to be misleading, we can delete the rows that include incomplete data by using data.dropna().

  2. If we have enough knowledge to fill in the missing values, we should do it. The idea here is to preserve as much data/rows as possible. We can use many strategies to do that, but domain knowledge will be required for this task. For example, in the column that contains only positive values, we can fill the empty values with (-1) to highlight its difference. Another solution is using some arbitrarily chosen value or calculated values like mean, max, min value.

data.isna()

In our case, we’re going to fill the missing values with:

  • cost with mean travel costs for 'new_zealand'
  • age with a mean value calculated from age column
values = {
  'cost': data.cost.where(data.destination=='new_zealand').mean(), 
  'age': round(data.age.mean())
} 
print('values:', values) 
data = data.fillna(value=values, limit=1)

Mapping values

Now it’s time to map the destination column values to {'new_zealand': 0, 'australia': 1}, which is more readable for a machine than strings.

data.destination.replace(('new_zealand', 'australia'), (0,1), inplace=True)

Outliers

Outliers are observations that are significantly different from other observations, sometimes called abnormal observations or anomalies. The easiest way to notice outliers is by preparing plots. To do that, we will use scatter plot and boxplot.

In both plots, we can see that one observation differs notably from the others (marked with red). Using our business knowledge, we can decide whether such an observation is an error, or maybe we can explain it somehow and, ultimately, it’s valuable for us. For example, when analyzing health data, outliers can indicate anomalies that might be caused by disease.

In our situation, let's say that this anomaly is caused by the dates when the user was traveling (from Christmas to New Year's Eve), which probably caused the trip cost to skyrocket. We can delete it because such a situation is abnormal and misleading.

We can also use clustering, standard deviation analysis, and other strategies to detect outliers.

import numpy as np 
import matplotlib.pyplot as plt 
from matplotlib.patches import Rectangle 

plt.rcParams['figure.figsize'] = [10, 7]

x = data.duration y = data.cost z = data.destination

fig, ax = plt.subplots()

scatter = ax.scatter(x, y, c=z, s=np.pi\*100, marker='o', alpha=0.5)

# produce a legend with the unique colors from the scatter 
legend = ax.legend(\*scatter.legend_elements(), loc=1, title="Destination") 
ax.add_artist(legend) 
legend.get_texts()[0].set_text('new_zealand') 
legend.get_texts()[1].set_text('australia')

plt.title('Travel costs vs travel duration for Australia and New Zealand') 
plt.xlabel('travel duration') 
plt.ylabel('travel costs') 
plt.show()

boxplot = data.boxplot(column=['cost'], by=['destination']) plt.xticks([1, 2], ['new_zealand', 'australia'])

data = data.drop(data.index[data.cost == data.cost.max()], axis=0)

Scaling

Data scaling, also known as feature scaling (or sometimes mistakenly called a normalization - more about that below!), is a process of transforming numerical data into a standardized format that can be more easily compared and analyzed.

In data scaling, the original numerical values of features in a dataset are transformed to fall within a specific range or have a particular distribution. Data scaling aims to ensure that features are on a similar scale, which can improve the performance of machine learning algorithms, especially those that rely on distance or magnitude calculations.

Data scaling is important in many data analysis and machine learning workflows. By scaling data, you can make it easier to compare different features, identify patterns, and build more accurate models.

Sometimes scaling data can be useful or necessary because many analytics methods are sensitive to the data scale. But what does that mean? Scaling refers to changing the range of features. For example, scaling transforms data to fit a specific range like [0-1]. We can use several scaling methods here. Our pick is the Min-Max algorithm, which uses the min and max values for calculation.

We will scale the cost column.

from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler() 
scaler.fit(data[['cost']]) 
print('Scaled cost value: \\n', scaler.transform(data[['cost']]))

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6)) 
ax = plt.subplot(121) 
plt.scatter(data.duration, data.cost, s=np.pi\*100, marker='o', alpha=0.5) 
plt.title('Cost data') 
plt.xlabel('duration', multialignment='center') 
plt.ylabel('cost', multialignment='center') 
plt.grid(True)

plt.subplot(122) 
plt.scatter(data.duration, scaler.transform(data[['cost']]), 
s=np.pi\*100, marker='o', alpha=0.5) 
plt.title('Scaled cost data') 
plt.xlabel('duration', multialignment='center') 
plt.ylabel('cost', multialignment='center') 
plt.grid(True)

plt.show()

Conclusion: The scatterplots above show that the points are in the same locations, but their values have changed.

Normalization

Remember that scaling isn’t the same thing as normalization, even if sometimes these terms are confused.

Data normalization is the process of organizing data in a database to reduce redundancy and dependency while improving data integrity and consistency. Normalization aims to make the database more efficient, easier to use, and less prone to errors or inconsistencies.

Normalization involves breaking down a database into smaller, more manageable tables and establishing relationships between them. This is typically done by dividing the data into smaller units or "atomic" values and grouping related data into tables. Each table is given a primary key, uniquely identifying each row of data, and data repeated across tables is stored only once.

Normalization is a more radical process than scaling. It changes the shape of the data distribution to the bell curve ( normal distribution).

Summarize

I hope this article helps you take your first steps in data cleaning and building the foundation for further analysis. To sum it up for you:

  1. Define your goals: Before you start cleaning your data, define your goals and what you hope to achieve with your data analysis. This will help you prioritize your cleaning tasks and ensure that your data is fit for purpose.

  2. Collect your data: Collect all the relevant data you need to achieve your goals. This may involve gathering data from multiple sources or systems.

  3. Assess your data: Assess the quality of your data by checking for missing values, duplicate records, and inconsistencies. Identify any patterns or outliers that may affect your analysis.

  4. Clean your data: Clean your data by removing or correcting errors, filling in missing values, and standardizing formats. You may also need to restructure your data or combine datasets to make them more usable.

  5. Validate your data: Validate your data to ensure that it is accurate and consistent. This may involve cross-checking your data against external sources or running quality checks.

  6. Document your data cleaning process: Include any changes made in case you need to refer back to it later or share it with others.

  7. Test your data: Test your data to ensure it is fit for purpose and meets your goals. This may involve running some initial data analysis or building a prototype model.

  8. Iterate: Data cleaning is an iterative process, and you may need to repeat some or all of the steps above as you refine your data analysis.

In addition to these steps, it's also important to consider data normalization, which involves standardizing data values and units of measurement to eliminate inconsistencies and make it easier to compare different data points.

Normalization can help ensure that your data is accurate, meaningful and can facilitate more accurate analysis and interpretation.

By following these steps, you can ensure that your data is clean, accurate, and fit for purpose and that your data analysis is based on a solid foundation.

Contact us

Are you looking for more information about data cleaning and more data-related topics? Follow our company blog, where our experts share their knowledge about data science with our community, contact us through the form or at hello@sunscrapers.com.

Tags

python
business intelligence

Share

Recent posts