Exploratory data analysis, or EDA, helps us get a foundational understanding of our dataset. Through exploring our data, we get an understanding of the data and its patterns so we can build efficient predictive models. This process takes a fair amount of time because we implement several data preprocessing tasks.
For this tutorial, we’ll need three libraries: pandas
, matplotlib
, and numpy
. You can install these using pip
or conda
with pip install pandas matplotlib numpy
or conda install pandas matplotlib numpy
. We will also need a sample CSV dataset. In this case we use the US Census Income dataset which we got from MOSTLY AI. We will cover:
- Loading a CSV File into Pandas
- Exploring a Pandas Dataframe
- Variable Analysis and Visualization
- Handling Missing Data
- Summary of the Basics of Exploratory Data Analysis
Loading a CSV File into Pandas
We start by importing the pandas
library and saving it as an abbreviated version pd
. We use pandas’ read.csv
method to load our dataset. The head()
method returns the first and last five rows of our data by default and also returns the shape of the entire dataset.
import pandas as pd
df = pd.read_csv("us-census-income.csv")
print(df.head)
After loading your desired dataset, the first step is variable Identification. These variables are the dependent target and independent predictor variables. They are the input and output variables and they change based on the analysis needs. We use the Python library, Pandas, to load our dataset and identify or modify the variables and their data types.
The result of calling head()
on our example dataset looks like the image below.
Exploring a Pandas Dataframe
We identify the variable types and get more info from the following code:
print(df.dtypes)
print(df.info())
print(df.describe())
Printing df.types()
gives an output of variable data types
Printing df.info()
gives information on the entire dataset as an object.
Printing df.describe()
returns basic statistical details like percentile, mean, std, etc. for any integer datatypes. Our data set has two integer types: age
and hours-per-week
.
We can also sort the data by any column value we choose. Using the sort_values()
method, here is an example of sorting by age in ascending order.
print(df.sort_values(by=['age'], ascending=True))
Output:
Variable Analysis and Visualization
Next, we do some variable analysis. The two most common variable analysis techniques are Univariate analysis and Bivariate analysis. The term univariate refers to analyzing one variable on its own for plotting range and distribution. Bivariate analysis finds the relationship between two variables. One way to perform variable analysis is to create charts to visualize the distribution of values for certain variables.
We use univariate analysis to find any missing values or values that are extremely different from the majority of our data. The type of univariate analysis used depends on if the variable is continuous or categorical. Continuous variables require understanding the central tendency and spread of the variable. It is measured using mean, median, mode, etc. It is visualized using a box plot or a histogram. Alternatively, we use frequency tables to understand the distribution of each category for categorical variables.
Let’s create a frequency table visualization for age distribution. This is a way to explore how many people we have in each age group in our dataset. We use the Python graph plotting library, Matplotlib, to create this histogram graph. We use plt.figure
to create a new graph/figure. Using add_subplot
will add a subplot at the first position to the 1 X 1 grid in the figure. Next, we choose which data column to include and how many bars in the histogram to show. After giving the graph a proper title and labels for the x
and y
axis, we use plt.show()
to return a visualization of the data.
import matplotlib.pyplot as plt
fig=plt.figure()
ax = fig.add_subplot(1,1,1)
ax.hist(df['age'],bins = 8, color = "lightblue", ec="purple")
plt.title('Age distribution')
plt.xlabel('Age')
plt.ylabel('# of people')
plt.show()
Using this exploratory visualization task, we can see that our dataset has most information coming from people in their late teens to mid-40s.
Handling Missing Data
Missing and outlier values in the dataset reduce a model’s fit and leads to bias since the data will not be analyzed completely. In our dataset, any missing values have been replaced with a ?
. So in our code, we replace ?
values with NaN
using Numpy’s replace()
method. Then, we use the is_null()
method to detect any empty values. Next, we use a for loop in Python to calculate the number of missing values in each column. What we get is a count of boolean types representing cells that are empty or not.
import numpy as np
df.replace("?", np.nan, inplace = True)
missing_data = df.isnull()
for column in missing_data.columns.values.tolist():
print(column)
print(missing_data[column].value_counts())
print("")
Summary of the Basics of Exploratory Data Analysis
In this post we looked at how to do basic exploratory data analysis with Python. We used the pandas
, numpy
, and matplotlib
libraries to check out our data. We used pandas
to import our data from a CSV file into a dataframe and view the data. Then, we looked at different information about the data including data types of the columns. After looking at the data with pandas
, we used matplotlib
to visualize our data by plotting it. Finally, we used numpy
to deal with missing data from the original CSV file.
Further Reading
- The Future of Applications is Intelligent
- How to Automatically Transcribe a Notion MP3 File
- Python String Manipulation
- How do Software Engineers Learn Best?
- Tensorflow Keras Optimizers and Frequent Errors
I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.
