Cleaning and preparing data for analysis is often the most time-consuming part of the data analysis process. Before any meaningful insights can be drawn, the data must be cleaned, formatted, and made consistent. In this post, we’ll explore essential techniques you can apply when preparing data with Python, Numpy, and Pandas. You’ll learn how to handle missing data, ensure consistent formatting, and perform data aggregation and analysis using a sample survey dataset.
The Jupyter notebook file associated with this blog post, which includes all of this information, can be found in our GitHub repo here. This post includes more detail in the .ipynb notebook file in the GitHub repo.
Generating Test Data with NumPy and Pandas
To illustrate common data cleaning scenarios, let’s create a sample survey dataset:
import numpy as np
import pandas as pd
# Define parameters
np.random.seed(42)
num_responses = 100
null_percentage = 0.03
num_nulls = int(num_responses * null_percentage)
# Survey data categories
survey_data = {
"Age": np.random.randint(18, 65, size=num_responses),
"Satisfaction_Level": np.random.choice(['Very Unsatisfied', 'Unsatisfied', 'Neutral', 'Satisfied', 'Very Satisfied'], size=num_responses),
"Recommend_To_Friend": np.random.choice([True, False], size=num_responses),
"Usage_Frequency": np.random.choice(['Daily', 'Weekly', 'Monthly', 'Rarely'], size=num_responses),
"Favorite_Feature": np.random.choice(['Feature A', 'Feature B', 'Feature C', 'Feature D'], size=num_responses)
}
# Create DataFrame
df = pd.DataFrame(survey_data)
# Handle NULL values
df['Recommend_To_Friend'] = df['Recommend_To_Friend'].astype(object)
# Randomly set 3% of responses to NULL
for col in df.columns:
null_indices = np.random.choice(df.index, num_nulls, replace=False)
df.loc[null_indices, col] = np.nan
df.head()
This code generates a dataset with some missing values, simulating a typical survey response scenario.
Handling Missing Data
One of the first tasks in data cleaning is handling missing values. Let’s check how many values are missing in each column:
missing_data = df.isnull().sum()
print(missing_data)
Output:
Age 3
Satisfaction_Level 3
Recommend_To_Friend 3
Usage_Frequency 3
Favorite_Feature 3
Here are three techniques to handle missing data:
1. Removing Missing Data
If only a small amount of data is missing, you can drop the rows with missing values:
df_cleaned = df.dropna()
2. Imputation (Filling Missing Data
You can fill missing values with substitutes like the mean, median, or mode:
df['Age'] = df['Age'].fillna(df['Age'].mean())
3. Flagging Missing Data
You can create a new column to flag rows where data is missing, which helps track these instances during analysis:
df['Satisfaction_Level_missing'] = df['Satisfaction_Level'].isnull()
Practical Examples of Data Cleaning
Once you’ve handled missing data, other cleaning tasks may include:
1. Ensuring Consistent Formatting
Standardizing input formats is essential. For example, if survey responses for Usage_Frequency include both “Daily” and “daily”, you can convert all entries to lowercase:
df['Usage_Frequency'] = df['Usage_Frequency'].str.lower()
2. Handling Duplicates
Check for and remove duplicate entries to prevent skewed results:
df.drop_duplicates(inplace=True)
Filtering and Sorting Data
Once your data is clean, you can filter and sort it for analysis. For example, you can filter out users who are “Unsatisfied” and sort them by age:
filtered_data = df[df['Satisfaction_Level'] == 'Unsatisfied']
sorted_data = filtered_data.sort_values(by='Age', ascending=True)
Aggregating and Summarizing Data
Aggregating data allows you to generate summary statistics, like counting respondents by satisfaction level or calculating the average age:
satisfaction_counts = df['Satisfaction_Level'].value_counts()
average_age = df['Age'].mean()
Output:
Satisfaction_Level
Very Unsatisfied 27
Neutral 21
Satisfied 20
Very Satisfied 15
Unsatisfied 13
Name: count, dtype: int64
Average Age: 40.76
Practical Data Analysis Example
With cleaned data, you can now analyze trends. Here’s how you might go about seeing if satisfaction correlates with recommending the product to a friend:
recommendation_analysis = df.groupby('Satisfaction_Level')['Recommend_To_Friend'].mean()
print(recommendation_analysis)
Output:
Satisfaction_Level
Neutral 0.33
Satisfied 0.61
Unsatisfied 0.38
Very Satisfied 0.47
Very Unsatisfied 0.46
Conclusion
Cleaning data is the first crucial step in preparing it for analysis and it is often the most time-consuming step. By following the steps in this post using Python, NumPy, and Pandas, you will have a good foundation on how to handle missing data, ensure consistency, and establish a good starting point for your analysis to provide meaningful insights.
Stay tuned for the next post, where we’ll explore creating visualizations with the cleaned data!