Data cleaning is a critical part of every data science project, there’s no doubt about it. But here’s something you should know: preparing data for proper analysis is often time-consuming and challenging. Data scientists tend to spend a lot of time at this level of data exploration. Moreover, this step sometimes requires domain knowledge.

Here’s what usually happens:

Data scientists often get data from different sources. The data in question might be inconsistent, dirty, or missing. That’s why data cleaning is so important. This is where data scientists make decisions about how to deal with such problems and preserve as much valuable data as possible.

In this article, I present a few handy data cleaning techniques every data scientist needs to know. 

Let’s get started with data cleaning

The data I’m going to use in this article comes from a travel company that offers trips from London to New Zealand and Australia. 

We get two tables:

  • traveler (users)
  • travel (travel records)

First of all, we need to import Pandas.

1
import pandas as pd

If you’d like to follow this tutorial, prepare this data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# preparing data
traveler = pd.DataFrame({'user_id': [136, 284, 101, 529, 800, 823], 'age':
[None, 38, 30, 43, 49, 28], 'name': ["Ann", "Ben", "Tom", "Bianca", "Caroline",
"Kate"]})
travel = pd.DataFrame({'user_id': [101, 284, 136, 800, 101, 800, 823, 529, 284],
'date_of_journey': ['2018-01-16', '2017-07-13','2019-10-10','2018/03/20',
'2019-12-24', '2017-10-17','2016/11/02',
'2019/09/14', '2019-08-07'],'duration': [10, 10,7,13,7,11,14, 8, 12],
'destination': ["New Zeland", "australia", "Australia", "New_Zealand",
"Australia/","Australia", "New Zealand", "Australia", "New_zealand"], 'cost':
[None, 2325.0, 1760.0, 2740.0, 4000.0, 2475.0, 3140.0, 1840.0, 2910.0],
'currency': [None, 'EUR', 'GBP', 'GBP', 'GBP','EUR', 'GBP', 'GBP', 'GBP']})

traveler.to_csv("traveler.csv", index=False)
travel.to_csv("travel.csv", index=False)

Now that we’ve got our data, it’s time to do some data cleaning. 

1. Merging data from two CSV files

Our first step is merging data from two CSV files. The data we need are separated into two tables, so we need to combine them using the user_id column located in both files. Here’s how you merge this data in Pandas.

1
2
3
4
5
traveler = pd.read_csv("traveler.csv")
travel = pd.read_csv("travel.csv")

data = pd.merge(travel, traveler, on='user_id')
data
travel records

2. Making data consistent

2a. Date

To create visualizations of data in time, we need to transform it to get the same data format within the column. In our case, one column includes several date formats so we need to use infer_datetime_format and allow Pandas to try to infer the proper format. Note that the library might not work properly if a user was too creative with the date format.

1
2
3
data.date_of_journey = pd.to_datetime(data.date_of_journey,
infer_datetime_format=True)
data
travel records

2b. Cost

When we analyze the cost and currency columns, we easily notice that costs are given in two currencies. We want our data to be consistent, so it’s a good idea to change the cost values to (for example) GBP. 

We can do that by using Pandas slicing. First, we need to choose only the cells where the cost value is given in the  currency ‘EUR’ and update these cells (multiplying by 0.8). In the second step, we change all the “EUR” to “GBP” in the currency column.

1
2
3
data.loc[data.currency == 'EUR', ['cost']] = data.cost*0.8
data.currency.replace("EUR", "GBP", inplace=True)
data
travel records

2c. Destination

As we can see, the destination column contains more unique values than expected. It should include two values and that’s why we need to fix that by changing every string respectively to new_zealand or to australia. In the second cell, we have all categories that occur in the destination column.

1
data.destination.describe()
data
1
2
3
# We need to get all compromised categories and try to unify them
categories = data.destination.drop_duplicates()
print(categories)
destination
1
2
3
4
5
data.destination.loc[data.destination.str.lower().str.startswith('n', na=False)]
= "new_zealand"
data.destination.loc[data.destination.str.lower().str.startswith('a', na=False)]
= "australia"
data
travel records

3. Deleting columns

If there are columns that we suppose could be dismissed from analysis, we can drop them at this point. In our example, we will drop currency since it’s the same for all records. We will also delete the name column because we don’t need this information for our analysis.

1
2
data = data.drop(['currency', 'name'], axis=1)
data
travel records

4. Missing values

We can check whether there are any missing values data.isna() in the first cell.

If we spot missing values, we can perform one of the two following actions:

  1. If we can’t figure out how to fill the missing values in a way that isn’t going to be misleading, we can delete the rows that include incomplete data by using data.dropna().
  2. If we have enough knowledge to fill in the missing values, we should definitely do it. The idea here is preserving as much data/rows as we can. To do that, we can use many different strategies, but domain knowledge will be required for this task as well. For example, in the column that contains only positive values we can fill the empty values with (-1) to highlight its difference. Another solution is using some arbitrary chosen value or calculated values like: mean, max, min value.
1
data.isna()
travel records

In our case, we’re going to fill the missing values with:

  • cost with mean travel costs for ‘new_zealand’
  • age with mean value calculated from age column
1
2
3
4
5
values = {'cost': data.cost.where(data.destination=='new_zealand').mean(), 'age':
round(data.age.mean())}
print('values:',values)
data = data.fillna(value=values, limit=1)
data

5. Mapping values

Now it’s time to map the destination column values to {‘new_zealand’: 0, ‘australia’: 1} which is more readable for a machine than strings.

1
2
data.destination.replace(('new_zealand', 'australia'), (0,1), inplace=True)
data

6. Outliers

Outliers are observations that are significantly different than other observations, sometimes called abnormal observations or anomalies. The easiest way to notice outliers is by preparing plots. To do that, we will use scatter plot and boxplot.

In both of the plots, we can see that one observation differs notably from the others (marked with red). Using our business knowledge, we can decide whether such an observation is an error or maybe we can explain it somehow and, ultimately, it’s valuable for us. For example, when analyzing health data, outliers can indicate anomalies that might be caused by disease. 

In our situation, let’s say that this anomaly is caused by the dates when the user was traveling (from Christmas to New Years Eve) and that probably caused the trip cost to skyrocket. We can decide to delete it because such a situation is abnormal and can be misleading.

To detect outliers, we can also use clustering, analyse standard deviation, and many other strategies.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
plt.rcParams['figure.figsize'] = [10, 7]

x = data.duration
y = data.cost
z = data.destination

fig, ax = plt.subplots()

scatter = ax.scatter(x, y, c=z, s=np.pi*100, marker='o',  alpha=0.5)

# produce a legend with the unique colors from the scatter
legend = ax.legend(*scatter.legend_elements(),
                    loc=1, title="Destination")
ax.add_artist(legend)
legend.get_texts()[0].set_text('new_zealand')
legend.get_texts()[1].set_text('australia')

plt.title('Travel costs vs travel duration for Australia and New Zealand')
plt.xlabel('travel duration')
plt.ylabel('travel costs')
plt.show()
Travel costs vs travel duration
1
2
boxplot = data.boxplot(column=['cost'], by=['destination'])
plt.xticks([1, 2], ['new_zealand', 'australia']);
Boxplot grouped by destination
1
2
data = data.drop(data.index[data.cost == data.cost.max()], axis=0)
data
Travel records

7. Scaling

Sometimes scaling data can be useful or necessary because many analytics methods are sensitive to the data scale. But what does that mean? Scaling refers to changing the range of features. For example, scaling is when we transform data to fit a specific range like [0-1].

We can use several scaling methods here. Our pick is the Min-Max algorithm, which uses the min and max value for calculation.

We will try to scale the cost column.

1
2
3
4
5
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()
scaler.fit(data[['cost']])
print('Scaled cost value: \n', scaler.transform(data[['cost']]))
Scaled cost value
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
ax = plt.subplot(121)
plt.scatter(data.duration, data.cost, s=np.pi*100, marker='o',  alpha=0.5)
plt.title('Cost data')
plt.xlabel('duration', multialignment='center')
plt.ylabel('cost', multialignment='center')
plt.grid(True)

plt.subplot(122)
plt.scatter(data.duration, scaler.transform(data[['cost']]), s=np.pi*100,
marker='o',  alpha=0.5)
plt.title('Scaled cost data')
plt.xlabel('duration', multialignment='center')
plt.ylabel('cost', multialignment='center')
plt.grid(True)

plt.show()
Cost data and scaled cost data

CONCLUSION: On the scatterplots above we can see that the points are in the same locations, but their values have changed.

**Normalization

Remember that scaling isn’t the same thing as normalization, even if sometimes these terms are confused with one another. Normalization is a more radical process that changes the shape of the data distribution to the bell curve (normal distribution). 

I hope this article helps you take your first steps in data cleaning and building the foundation for further analysis. 

If you’d like to learn more about data science, be sure to check the data science category on our blog – it’s full of helpful resources for both beginner and advanced developers.

Avatar
Ewelina Szmielew
Backend Engineer

Ewelina is a Python developer specializing in web development. She’s interested in data science and always eager to learn new things about machine learning. She recently developed an interest in functional languages such as Haskell and Elixir. After hours, Ewelina goes for walks with her dog Tościk and explores another passion, sewing.

Data science

Data warehouses – what they are and how to classify them (Part 3)

This is the third and last post in our series about data warehouses. If you missed the previous ones, check them out here: Data warehouses – what they are [...]

Data science Python

Developing a recommendation system? Here’s why you should choose Python

It’s safe to say that recommendation systems (also called recommendation engines) have become one of the most essential elements guiding users throughout the web. Consider websites like Netflix: The [...]

Join our newsletter.

Scroll to bottom

Hi there, we use cookies to provide you with an amazing experience on our site. If you continue without changing the settings, we'll assume that you're happy to receive all cookies on the Sunscrapers website. You can change your cookie settings at any time.

Learn more