 ALL CATEGORIES

Data cleaning is a critical part of every data science project, there’s no doubt about it. But here’s something you should know: preparing data for proper analysis is often time-consuming and challenging. Data scientists tend to spend a lot of time at this level of data exploration. Moreover, this step sometimes requires domain knowledge.

Here’s what usually happens:

Data scientists often get data from different sources. The data in question might be inconsistent, dirty, or missing. That’s why data cleaning is so important. This is where data scientists make decisions about how to deal with such problems and preserve as much valuable data as possible.

In this article, I present a few handy data cleaning techniques every data scientist needs to know.

## Let’s get started with data cleaning

The data I’m going to use in this article comes from a travel company that offers trips from London to New Zealand and Australia.

We get two tables:

• traveler (users)
• travel (travel records)

First of all, we need to import Pandas.

 1 import pandas as pd

If you’d like to follow this tutorial, prepare this data:

 123456789101112131415 # preparing data traveler = pd.DataFrame({'user_id': [136, 284, 101, 529, 800, 823], 'age': [None, 38, 30, 43, 49, 28], 'name': ["Ann", "Ben", "Tom", "Bianca", "Caroline", "Kate"]}) travel = pd.DataFrame({'user_id': [101, 284, 136, 800, 101, 800, 823, 529, 284], 'date_of_journey': ['2018-01-16', '2017-07-13','2019-10-10','2018/03/20', '2019-12-24', '2017-10-17','2016/11/02', '2019/09/14', '2019-08-07'],'duration': [10, 10,7,13,7,11,14, 8, 12], 'destination': ["New Zeland", "australia", "Australia", "New_Zealand", "Australia/","Australia", "New Zealand", "Australia", "New_zealand"], 'cost': [None, 2325.0, 1760.0, 2740.0, 4000.0, 2475.0, 3140.0, 1840.0, 2910.0], 'currency': [None, 'EUR', 'GBP', 'GBP', 'GBP','EUR', 'GBP', 'GBP', 'GBP']}) traveler.to_csv("traveler.csv", index=False) travel.to_csv("travel.csv", index=False)

Now that we’ve got our data, it’s time to do some data cleaning.

## 1. Merging data from two CSV files

Our first step is merging data from two CSV files. The data we need are separated into two tables, so we need to combine them using the user_id column located in both files. Here’s how you merge this data in Pandas.

 12345 traveler = pd.read_csv("traveler.csv") travel = pd.read_csv("travel.csv") data = pd.merge(travel, traveler, on='user_id') data

## 2. Making data consistent

### 2a. Date

To create visualizations of data in time, we need to transform it to get the same data format within the column. In our case, one column includes several date formats so we need to use infer_datetime_format and allow Pandas to try to infer the proper format. Note that the library might not work properly if a user was too creative with the date format.

 123 data.date_of_journey = pd.to_datetime(data.date_of_journey, infer_datetime_format=True) data

### 2b. Cost

When we analyze the cost and currency columns, we easily notice that costs are given in two currencies. We want our data to be consistent, so it’s a good idea to change the cost values to (for example) GBP.

We can do that by using Pandas slicing. First, we need to choose only the cells where the cost value is given in the  currency ‘EUR’ and update these cells (multiplying by 0.8). In the second step, we change all the “EUR” to “GBP” in the currency column.

 123 data.loc[data.currency == 'EUR', ['cost']] = data.cost*0.8 data.currency.replace("EUR", "GBP", inplace=True) data

### 2c. Destination

As we can see, the destination column contains more unique values than expected. It should include two values and that’s why we need to fix that by changing every string respectively to new_zealand or to australia. In the second cell, we have all categories that occur in the destination column.

 1 data.destination.describe()
 123 # We need to get all compromised categories and try to unify them categories = data.destination.drop_duplicates() print(categories)
 12345 data.destination.loc[data.destination.str.lower().str.startswith('n', na=False)] = "new_zealand" data.destination.loc[data.destination.str.lower().str.startswith('a', na=False)] = "australia" data

## 3. Deleting columns

If there are columns that we suppose could be dismissed from analysis, we can drop them at this point. In our example, we will drop currency since it’s the same for all records. We will also delete the name column because we don’t need this information for our analysis.

 12 data = data.drop(['currency', 'name'], axis=1) data

## 4. Missing values

We can check whether there are any missing values data.isna() in the first cell.

If we spot missing values, we can perform one of the two following actions:

1. If we can’t figure out how to fill the missing values in a way that isn’t going to be misleading, we can delete the rows that include incomplete data by using data.dropna().
2. If we have enough knowledge to fill in the missing values, we should definitely do it. The idea here is preserving as much data/rows as we can. To do that, we can use many different strategies, but domain knowledge will be required for this task as well. For example, in the column that contains only positive values we can fill the empty values with (-1) to highlight its difference. Another solution is using some arbitrary chosen value or calculated values like: mean, max, min value.
 1 data.isna()

In our case, we’re going to fill the missing values with:

• cost with mean travel costs for ‘new_zealand’
• age with mean value calculated from age column
 12345 values = {'cost': data.cost.where(data.destination=='new_zealand').mean(), 'age': round(data.age.mean())} print('values:',values) data = data.fillna(value=values, limit=1) data

## 5. Mapping values

Now it’s time to map the destination column values to {‘new_zealand’: 0, ‘australia’: 1} which is more readable for a machine than strings.

 12 data.destination.replace(('new_zealand', 'australia'), (0,1), inplace=True) data

## 6. Outliers

Outliers are observations that are significantly different than other observations, sometimes called abnormal observations or anomalies. The easiest way to notice outliers is by preparing plots. To do that, we will use scatter plot and boxplot.

In both of the plots, we can see that one observation differs notably from the others (marked with red). Using our business knowledge, we can decide whether such an observation is an error or maybe we can explain it somehow and, ultimately, it’s valuable for us. For example, when analyzing health data, outliers can indicate anomalies that might be caused by disease.

In our situation, let’s say that this anomaly is caused by the dates when the user was traveling (from Christmas to New Years Eve) and that probably caused the trip cost to skyrocket. We can decide to delete it because such a situation is abnormal and can be misleading.

To detect outliers, we can also use clustering, analyse standard deviation, and many other strategies.

 123456789101112131415161718192021222324 import numpy as np import matplotlib.pyplot as plt from matplotlib.patches import Rectangle plt.rcParams['figure.figsize'] = [10, 7] x = data.duration y = data.cost z = data.destination fig, ax = plt.subplots() scatter = ax.scatter(x, y, c=z, s=np.pi*100, marker='o',  alpha=0.5) # produce a legend with the unique colors from the scatter legend = ax.legend(*scatter.legend_elements(),                     loc=1, title="Destination") ax.add_artist(legend) legend.get_texts().set_text('new_zealand') legend.get_texts().set_text('australia') plt.title('Travel costs vs travel duration for Australia and New Zealand') plt.xlabel('travel duration') plt.ylabel('travel costs') plt.show()
 12 boxplot = data.boxplot(column=['cost'], by=['destination']) plt.xticks([1, 2], ['new_zealand', 'australia']);
 12 data = data.drop(data.index[data.cost == data.cost.max()], axis=0) data

## 7. Scaling

Sometimes scaling data can be useful or necessary because many analytics methods are sensitive to the data scale. But what does that mean? Scaling refers to changing the range of features. For example, scaling is when we transform data to fit a specific range like [0-1].

We can use several scaling methods here. Our pick is the Min-Max algorithm, which uses the min and max value for calculation.

We will try to scale the cost column.

 12345 from sklearn import preprocessing scaler = preprocessing.MinMaxScaler() scaler.fit(data[['cost']]) print('Scaled cost value: \n', scaler.transform(data[['cost']]))
 12345678910111213141516171819 import matplotlib.pyplot as plt plt.figure(figsize=(12, 6)) ax = plt.subplot(121) plt.scatter(data.duration, data.cost, s=np.pi*100, marker='o',  alpha=0.5) plt.title('Cost data') plt.xlabel('duration', multialignment='center') plt.ylabel('cost', multialignment='center') plt.grid(True) plt.subplot(122) plt.scatter(data.duration, scaler.transform(data[['cost']]), s=np.pi*100, marker='o',  alpha=0.5) plt.title('Scaled cost data') plt.xlabel('duration', multialignment='center') plt.ylabel('cost', multialignment='center') plt.grid(True) plt.show()

CONCLUSION: On the scatterplots above we can see that the points are in the same locations, but their values have changed.

## **Normalization

Remember that scaling isn’t the same thing as normalization, even if sometimes these terms are confused with one another. Normalization is a more radical process that changes the shape of the data distribution to the bell curve (normal distribution).

I hope this article helps you take your first steps in data cleaning and building the foundation for further analysis.

If you’d like to learn more about data science, be sure to check the data science category on our blog – it’s full of helpful resources for both beginner and advanced developers.

Data science Web development

## 5 projects that will benefit from Python

One look at the TIOBE index is enough to see that Python is one of the most popular programming languages today. And no wonder: Python is simple, versatile, and [...]

• by Bartosz Grabski
• Sep, 2018
• 5 min

Data science

## Data warehouses – what they are and how to classify them (Part 3)

This is the third and last post in our series about data warehouses. If you missed the previous ones, check them out here: Data warehouses – what they are [...]

• by Sławomir Sawicki
• Nov, 2018
• 4 min

## Get insights from software experts.   Almost finished…

To complete the subscription process, please click the link in the email we’ve just sent you.

Sunscrapers Sp. z o.o.

ul. Pokorna 2/947

Warsaw 00-199

Poland

Thanks for subscribing! 