Quick guide to data cleaning with examples

Sunscrapers Emblem

Ewelina Szmielew

5 December 2019, 7 min read

thumbnail post

Data cleaning is a critical part of every data science project, there’s no doubt about it. But here’s something you should know: preparing data for proper analysis is often time-consuming and challenging. Data scientists tend to spend a lot of time at this level of data exploration. Moreover, this step sometimes requires domain knowledge.

Here’s what usually happens:

Data scientists often get data from different sources. The data in question might be inconsistent, dirty, or missing. That’s why data cleaning is so important. This is where data scientists make decisions about how to deal with such problems and preserve as much valuable data as possible.

In this article, I present a few handy data cleaning techniques every data scientist needs to know.

Let’s get started with data cleaning

The data I’m going to use in this article comes from a travel company that offers trips from London to New Zealand and Australia.

We get two tables:

  • traveler (users)
  • travel (travel records)

First of all, we need to import Pandas.

import pandas as pd

If you’d like to follow this tutorial, prepare this data:

# preparing data traveler = pd.DataFrame({'user_id': [136, 284, 101, 529, 800, 823], 'age': [None, 38, 30, 43, 49, 28], 'name': ["Ann", "Ben", "Tom", "Bianca", "Caroline", "Kate"]}) travel = pd.DataFrame({'user_id': [101, 284, 136, 800, 101, 800, 823, 529, 284], 'date_of_journey': ['2018-01-16', '2017-07-13','2019-10-10','2018/03/20', '2019-12-24', '2017-10-17','2016/11/02', '2019/09/14', '2019-08-07'],'duration': [10, 10,7,13,7,11,14, 8, 12], 'destination': ["New Zeland", "australia", "Australia", "New_Zealand", "Australia/","Australia", "New Zealand", "Australia", "New_zealand"], 'cost': [None, 2325.0, 1760.0, 2740.0, 4000.0, 2475.0, 3140.0, 1840.0, 2910.0], 'currency': [None, 'EUR', 'GBP', 'GBP', 'GBP','EUR', 'GBP', 'GBP', 'GBP']})

traveler.to_csv("traveler.csv", index=False) travel.to_csv("travel.csv", index=False)

Now that we’ve got our data, it’s time to do some data cleaning.

Merging data from two CSV files

Our first step is merging data from two CSV files. The data we need are separated into two tables, so we need to combine them using the user_id column located in both files. Here’s how you merge this data in Pandas.

traveler = pd.read_csv("traveler.csv") travel = pd.read_csv("travel.csv")

data = pd.merge(travel, traveler, on='user_id') data

Making data consistent


To create visualizations of data in time, we need to transform it to get the same data format within the column. In our case, one column includes several date formats so we need to use infer_datetime_format and allow Pandas to try to infer the proper format. Note that the library might not work properly if a user was too creative with the date format.

data.date_of_journey = pd.to_datetime(data.date_of_journey, infer_datetime_format=True) data


When we analyze the cost and currency columns, we easily notice that costs are given in two currencies. We want our data to be consistent, so it’s a good idea to change the cost values to (for example) GBP.

We can do that by using Pandas slicing. First, we need to choose only the cells where the cost value is given in the  currency 'EUR' and update these cells (multiplying by 0.8). In the second step, we change all the "EUR" to "GBP" in the currency column.

data.loc[data.currency == 'EUR', ['cost']] = data.cost\*0.8 data.currency.replace("EUR", "GBP", inplace=True) data


As we can see, the destination column contains more unique values than expected. It should include two values and that’s why we need to fix that by changing every string respectively to new_zealand or to australia. In the second cell, we have all categories that occur in the destination column.

# We need to get all compromised categories and try to unify them categories = data.destination.drop_duplicates() print(categories)
data.destination.loc[data.destination.str.lower().str.startswith('n', na=False)] = "new_zealand" data.destination.loc[data.destination.str.lower().str.startswith('a', na=False)] = "australia" data

Deleting columns

If there are columns that we suppose could be dismissed from analysis, we can drop them at this point. In our example, we will drop currency since it’s the same for all records. We will also delete the name column because we don’t need this information for our analysis.

data = data.drop(['currency', 'name'], axis=1) data

Missing values

We can check whether there are any missing values data.isna() in the first cell.

If we spot missing values, we can perform one of the two following actions:

  1. If we can't figure out how to fill the missing values in a way that isn’t going to be misleading, we can delete the rows that include incomplete data by using data.dropna().
  2. If we have enough knowledge to fill in the missing values, we should definitely do it. The idea here is preserving as much data/rows as we can. To do that, we can use many different strategies, but domain knowledge will be required for this task as well. For example, in the column that contains only positive values we can fill the empty values with (-1) to highlight its difference. Another solution is using some arbitrary chosen value or calculated values like: mean, max, min value.

In our case, we’re going to fill the missing values with:

  • cost with mean travel costs for 'new_zealand'
  • age with mean value calculated from age column
values = {'cost': data.cost.where(data.destination=='new_zealand').mean(), 'age': round(data.age.mean())} print('values:',values) data = data.fillna(value=values, limit=1) data

Mapping values

Now it’s time to map the destination column values to {'new_zealand': 0, 'australia': 1} which is more readable for a machine than strings.

data.destination.replace(('new_zealand', 'australia'), (0,1), inplace=True) data


Outliers are observations that are significantly different than other observations, sometimes called abnormal observations or anomalies. The easiest way to notice outliers is by preparing plots. To do that, we will use scatter plot and boxplot.

In both of the plots, we can see that one observation differs notably from the others (marked with red). Using our business knowledge, we can decide whether such an observation is an error or maybe we can explain it somehow and, ultimately, it’s valuable for us. For example, when analyzing health data, outliers can indicate anomalies that might be caused by disease.

In our situation, let's say that this anomaly is caused by the dates when the user was traveling (from Christmas to New Years Eve) and that probably caused the trip cost to skyrocket. We can decide to delete it because such a situation is abnormal and can be misleading.

To detect outliers, we can also use clustering, analyse standard deviation, and many other strategies.

import numpy as np import matplotlib.pyplot as plt from matplotlib.patches import Rectangle plt.rcParams['figure.figsize'] = [10, 7]

x = data.duration y = data.cost z = data.destination

fig, ax = plt.subplots()

scatter = ax.scatter(x, y, c=z, s=np.pi\*100, marker='o', alpha=0.5)

\# produce a legend with the unique colors from the scatter legend = ax.legend(\*scatter.legend_elements(), loc=1, title="Destination") ax.add_artist(legend) legend.get_texts()[0].set_text('new_zealand') legend.get_texts()[1].set_text('australia')

plt.title('Travel costs vs travel duration for Australia and New Zealand') plt.xlabel('travel duration') plt.ylabel('travel costs') plt.show()
boxplot = data.boxplot(column=['cost'], by=['destination']) plt.xticks([1, 2], ['new_zealand', 'australia']);
data = data.drop(data.index[data.cost == data.cost.max()], axis=0) data


Sometimes scaling data can be useful or necessary because many analytics methods are sensitive to the data scale. But what does that mean? Scaling refers to changing the range of features. For example, scaling is when we transform data to fit a specific range like [0-1].

We can use several scaling methods here. Our pick is the Min-Max algorithm, which uses the min and max value for calculation.

We will try to scale the cost column.

from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler() scaler.fit(data[['cost']]) print('Scaled cost value: \\n', scaler.transform(data[['cost']]))
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6)) ax = plt.subplot(121) plt.scatter(data.duration, data.cost, s=np.pi\*100, marker='o', alpha=0.5) plt.title('Cost data') plt.xlabel('duration', multialignment='center') plt.ylabel('cost', multialignment='center') plt.grid(True)

plt.subplot(122) plt.scatter(data.duration, scaler.transform(data[['cost']]), s=np.pi\*100, marker='o', alpha=0.5) plt.title('Scaled cost data') plt.xlabel('duration', multialignment='center') plt.ylabel('cost', multialignment='center') plt.grid(True)


CONCLUSION: On the scatterplots above we can see that the points are in the same locations, but their values have changed.


Remember that scaling isn’t the same thing as normalization, even if sometimes these terms are confused with one another. Normalization is a more radical process that changes the shape of the data distribution to the bell curve (normal distribution).

I hope this article helps you take your first steps in data cleaning and building the foundation for further analysis.

Are you ready for your next project?

Whether you need a full product, consulting, tech investment or an extended team, our experts will help you find the best solutions.

Hi there, we use cookies to provide you with an amazing experience on our site. If you continue without changing the settings, we’ll assume that you’re happy to receive all cookies on Sunscrapers website. You can change your cookie settings at any time.