How To Use Data Cleaning Python Tools

Published:17 December 2021 - 6 min. read

Michael Nguyen Tu Image

Michael Nguyen Tu

Read more tutorials by Michael Nguyen Tu!

Data is the lifeblood of every company, and in a machine learning setting, data is generated from several sources. Data cleaning is crucial for a machine learning setting to work correctly. But how do you perform data cleaning? Data cleaning Python tools are just what you need!

In this tutorial, you will learn what data cleaning is and how to clean data with Python tools so that you can enjoy fresh and clean data.

Prerequisites

This tutorial will be a hands-on demonstration. If you’d like to follow along, be sure you have the following:

  • A Window or Linux machine – This tutorial uses Windows 10 21H1 Build 19043.
  • Jupyter Lab (version 3.12.1 is used in this tutorial) and Python 3 or higher.

Importing Data Cleaning Python Pandas Library

Python has several built-in libraries to help with data cleaning. The two most popular libraries are pandas and numpy, but you’ll be using pandas for this tutorial. Pandas library allows you to work with pandas dataframe for data analysis and manipulation.

Before you can perform data cleansing with Python pandas, import the pandas library and your dataset (CSV file) first:

Launch your JupyterLab, and then drag and drop the Pokémon dataset into your JupyterLab.

Now, run the below commands in sequence to read the dataset and display a preview of the data, so you can check if you have any import errors.

# Import the pandas library and set pd as the standard way 
# to reference pandas.
import pandas as pd
# Read the data from the dataset into your pandas dataframe.
data = pd.read_csv("pokemon.csv")
# Display a preview of the data.
data.head()
Importing pokemon.csv into JupyterLab.
Importing pokemon.csv into JupyterLab.

Removing Whitespaces in Datasets

Now that you’ve imported your dataset, you can start cleaning your data. There are many ways to clean your dataset, like removing whitespaces. Whitespaces unnecessarily increase the size of your dataset in your database and make finding duplicate data a challenge.

1. Check your dataset if there are whitespaces like what you see in the Name, Type, and Weaknesses columns below. You’ll remove these irrelevant parts of the data systematically.

Viewing Whitespaces in Dataset
Viewing Whitespaces in Dataset

2. Copy and paste the following codes to your code shell, and press Shift+Enter keys to execute the code. The code below passes the column name to the replace() function to remove leading and trailing whitespaces in your dataset.

# remove whitespaces from Name column
data["Name"].str.replace(' ', '')
# remove whitespaces from Weight column
data["Type"].str.replace(' ', '')
# remove whitespaces from Type column
data["Weaknesses"].str.replace(' ', '')

3. Finally, check your dataset again to confirm the whitespaces are gone similar to the one below.

Verifying Whitespaces are Removed
Verifying Whitespaces are Removed

Removing Duplicate Values

Whitespaces is not the only one you’ll need to look out for in a dataset. With tons of data in your dataset, you may have overlooked some duplicates. So what’s the process of detecting and removing duplicates? You’ll first look for duplicates by column name in your dataset and remove them.

Each entry in your dataset should have unique data under the Name column. But as you can see below, Blastoise has two entries, one at row 10 and another one at row 11. Since the Height column should only contain numbers, you’ll remove the entry at row 11, which has the excess inches text in its Height column.

Reviewing Duplicates in Dataset
Reviewing Duplicates in Dataset

Run the following commands to remove the first duplicate data.drop_duplicates and keep the last (keep="last") occurrence.

data.drop_duplicates(subset=["Name"], keep="last", inplace=True)
data.head(15)

As you can see below, the duplicate on row 10 is now gone, so you don’t have to worry about fixing that excess “inches” string anymore.

Verifying Duplicates are Removed
Verifying Duplicates are Removed

Filling in the Missing Values

So far, you’ve tackled removing excess data (whitespaces and duplicates), but what about missing data? With the data.info() command, you can check columns with missing data in your dataset.

From this point, filling in the missing data is crucial, or else you’ll get an error when running commands in the following sections.

1. Run the data.info() command below to check for missing values in your dataset.

data.info()

There’s a total of 151 entries in the dataset. In the output shown below, you can tell that three columns are missing data. Both the Height and Weight columns have 150 entries, and the Type column only has 149 entries.

Finding missing data
Finding missing data

2. Next, run the following command to show all entries with at least one (.any(axis=1)) missing data data.isnull().

data[data.isnull().any(axis=1)]

Notice below that the Height, Weight, and Type columns have the Not a Number (NaN) value. The NaN values indicate the columns have null or missing data.

In the output below, you can see Golbat is missing Height and Weight data that you’ll fill in on the next step, so be sure to note Golbat’s entry number (42).

Finding entries with missing data
Finding entries with missing data

3. Look for Golbat’s information on the Pokémon website on your web browser. In Golbat’s data below, you can see the Height value is 5′ 03″ (63 inches), while the Weight value is 121.3 lbs. Note the height and weight value to fill in the missing data for Golbat in your dataset.

height and weight value to fill in the missing data for Golbat in your dataset
height and weight value to fill in the missing data for Golbat in your dataset

Now, run the following commands to fill Golbat’s missing data in your dataset.

The same set of commands apply to modifying existing values in the dataset

# Pass in ID number (Golbat's entry number=42)
golbat = data.loc[42]
# Sets the Height Value
golbat["Height (in)"] = 63
# Sets the Weight
golbat["Weight (lbs)"] = 121.30

5. Run the data.loc[] command below, where 42 is the entry’s ID number, to list the entry’s data and check any empty values.

data.loc[42]

Below, you can see that Golbat’s data are all filled in completely.

Viewing Entry Data
Viewing Entry Data

After filling in the missing data and there is still at least one element missing, you should remove the entire row that’s missing data from the dataset.

6. Finally, repeat the same steps (three to five) to fill in the missing data for other entries.

Fixing Formatting Errors

Instead of missing data, another typical scenario in a dataset is formatting errors. Inaccurate records can be a pain, but no worries, you can still fix them up!

Perhaps you have an entry in your data set with words separated by dashes like the one below instead of commas and spaces. If so, running the apply() and replace() commands will do the trick.

1. Run the command below to see how the data looks in your dataframe. Replace the 104 with the entry number of the data with a formatting error.

data.loc[104]
Showing Data of Specific Entry
Showing Data of Specific Entry

2 Next, run the command below to replace dashes with commas (lambda x: x.replace(" -", ",")) in the data entry’s Weaknesses (data["Weaknesses"]) column.

data["Weaknesses"] = data["Weaknesses"].apply(lambda x: x.replace(" -", ","))

3. Rerun the data.loc[104] command as you did in step one to check for any dashes in the data.

data.loc[104]

As you can see below, the output shows commas now separate the words.

Replacing dashes with space commas
Replacing dashes with space commas

Correcting Misspelled Words

Besides formatting errors, misspelled words in a dataset can also make it hard to analyze data. The good news is that you can use some ready-made spell-checker Python libraries. But since you already have pandas installed, you don’t have to worry about installing anything else.

1. Run the following commands to list all unique words unique() in the Type column. Replace Type if you prefer to list unique words from other columns.

# Turn output into a list of unique words from the Type column
unique_type = list(data["Type"].str.split(", ", expand=True).stack().unique())
# Print out the list
unique_type

As you can see below, there are two misspelled words (Posion and Fie) that should be “Poison” and “Fire.” Now you can go through the dataset, find which rows have misspelled words, and fix them.

Finding Misspelled Words
Finding Misspelled Words

Run the below command to show all rows that contains() the word Posion from the Type column. The regex argument is set to false (regex=False) to treat the string (Posion) as a literal string and not a regular expression.

data[data["Type"].str.contains("Posion", regex=False)]

In the following output, there are four rows (Arbok(24), Nidorina(30), Nidoqueen(30) and Nidoran(32)), that have the misspelled word Posion in the Type column.

Viewing Misspelled Words
Viewing Misspelled Words

3. Now, run the commands below to replace Posion for all entries in the Type column with the word Poison.

# Replace Posion with the word Poison
data["Type"] = data["Type"].apply(lambda x: x.replace("Posion", "Poison"))
	# Lists data entries from 0-30
data.head(30)
Replacing Misspelled Word "Posion" with "Poison"
Replacing Misspelled Word “Posion” with “Poison”

If the replacement is successful, you’ll see you’ve corrected the misspelled words from “Posion” to “Poison” in entry numbers 24 and 30-32.

Verifying Misspelled Words are Corrected
Verifying Misspelled Words are Corrected

4. Finally, repeat the steps (two to three) to correct other misspelled words.

Conclusion

In this tutorial, you’ve learned how to perform data cleaning with Python in many ways for different use cases. You’ve also come to realize that pandas, a popular Python library, is just right around the corner to let you save time cleaning data.

With this newfound knowledge, why not learn more about handy pandas techniques in Python for data manipulation?

Hate ads? Want to support the writer? Get many of our tutorials packaged as an ATA Guidebook.

Explore ATA Guidebooks

Looks like you're offline!