Learning Python: Part 10 - Preparing Data for Analysis

Cleaning and preparing data for analysis is often the most time-consuming part of the data analysis process. Before any meaningful insights can be drawn, the data must be cleaned, formatted, and made consistent. In this post, we’ll explore essential techniques you can apply when preparing data with Python, Numpy, and Pandas. You’ll learn how to handle missing data, ensure consistent formatting, and perform data aggregation and analysis using a sample survey dataset.

The Jupyter notebook file associated with this blog post, which includes all of this information, can be found in our GitHub repo here. This post includes more detail in the .ipynb notebook file in the GitHub repo.

Generating Test Data with NumPy and Pandas

To illustrate common data cleaning scenarios, let’s create a sample survey dataset:

import numpy as np
import pandas as pd

# Define parameters
np.random.seed(42)
num_responses = 100
null_percentage = 0.03
num_nulls = int(num_responses * null_percentage)

# Survey data categories
survey_data = {
    "Age": np.random.randint(18, 65, size=num_responses),
    "Satisfaction_Level": np.random.choice(['Very Unsatisfied', 'Unsatisfied', 'Neutral', 'Satisfied', 'Very Satisfied'], size=num_responses),
    "Recommend_To_Friend": np.random.choice([True, False], size=num_responses),
    "Usage_Frequency": np.random.choice(['Daily', 'Weekly', 'Monthly', 'Rarely'], size=num_responses),
    "Favorite_Feature": np.random.choice(['Feature A', 'Feature B', 'Feature C', 'Feature D'], size=num_responses)
}

# Create DataFrame
df = pd.DataFrame(survey_data)

# Handle NULL values
df['Recommend_To_Friend'] = df['Recommend_To_Friend'].astype(object)

# Randomly set 3% of responses to NULL
for col in df.columns:
    null_indices = np.random.choice(df.index, num_nulls, replace=False)
    df.loc[null_indices, col] = np.nan

df.head()

This code generates a dataset with some missing values, simulating a typical survey response scenario.

Handling Missing Data

One of the first tasks in data cleaning is handling missing values. Let’s check how many values are missing in each column:

missing_data = df.isnull().sum()
print(missing_data)

Output:

Age                    3
Satisfaction_Level     3
Recommend_To_Friend    3
Usage_Frequency        3
Favorite_Feature       3

Here are three techniques to handle missing data:

1. Removing Missing Data

If only a small amount of data is missing, you can drop the rows with missing values:

df_cleaned = df.dropna()

2. Imputation (Filling Missing Data

You can fill missing values with substitutes like the mean, median, or mode:

df['Age'] = df['Age'].fillna(df['Age'].mean())

3. Flagging Missing Data

You can create a new column to flag rows where data is missing, which helps track these instances during analysis:

df['Satisfaction_Level_missing'] = df['Satisfaction_Level'].isnull()

Practical Examples of Data Cleaning

Once you’ve handled missing data, other cleaning tasks may include:

1. Ensuring Consistent Formatting

Standardizing input formats is essential. For example, if survey responses for Usage_Frequency include both “Daily” and “daily”, you can convert all entries to lowercase:

df['Usage_Frequency'] = df['Usage_Frequency'].str.lower()

2. Handling Duplicates

Check for and remove duplicate entries to prevent skewed results:

df.drop_duplicates(inplace=True)

Filtering and Sorting Data

Once your data is clean, you can filter and sort it for analysis. For example, you can filter out users who are “Unsatisfied” and sort them by age:

filtered_data = df[df['Satisfaction_Level'] == 'Unsatisfied']
sorted_data = filtered_data.sort_values(by='Age', ascending=True)

Aggregating and Summarizing Data

Aggregating data allows you to generate summary statistics, like counting respondents by satisfaction level or calculating the average age:

satisfaction_counts = df['Satisfaction_Level'].value_counts()
average_age = df['Age'].mean()

Output:

Satisfaction_Level
Very Unsatisfied    27
Neutral             21
Satisfied           20
Very Satisfied      15
Unsatisfied         13
Name: count, dtype: int64

Average Age: 40.76

Practical Data Analysis Example

With cleaned data, you can now analyze trends. Here’s how you might go about seeing if satisfaction correlates with recommending the product to a friend:

recommendation_analysis = df.groupby('Satisfaction_Level')['Recommend_To_Friend'].mean()
print(recommendation_analysis)

Output:

Satisfaction_Level
Neutral             0.33
Satisfied           0.61
Unsatisfied         0.38
Very Satisfied      0.47
Very Unsatisfied    0.46

Conclusion

Cleaning data is the first crucial step in preparing it for analysis and it is often the most time-consuming step. By following the steps in this post using Python, NumPy, and Pandas, you will have a good foundation on how to handle missing data, ensure consistency, and establish a good starting point for your analysis to provide meaningful insights.

Stay tuned for the next post, where we’ll explore creating visualizations with the cleaned data!