Basics of Exploratory Data Analysis

Exploratory data analysis, or EDA, helps us get a foundational understanding of our dataset. Through exploring our data, we get an understanding of the data and its patterns so we can build efficient predictive models. This process takes a fair amount of time because we implement several data preprocessing tasks. 

For this tutorial, we’ll need three libraries: pandas, matplotlib, and numpy. You can install these using pip or conda with pip install pandas matplotlib numpy or conda install pandas matplotlib numpy. We will also need a sample CSV dataset. In this case we use the US Census Income dataset which we got from MOSTLY AI. We will cover:

  • Loading a CSV File into Pandas
  • Exploring a Pandas Dataframe
  • Variable Analysis and Visualization
  • Handling Missing Data
  • Summary of the Basics of Exploratory Data Analysis

Loading a CSV File into Pandas

We start by importing the pandas library and saving it as an abbreviated version pd. We use pandas’ read.csv method to load our dataset. The head() method returns the first and last five rows of our data by default and also returns the shape of the entire dataset.

import pandas as pd
 
df = pd.read_csv("us-census-income.csv")
 
print(df.head)

After loading your desired dataset, the first step is variable Identification. These variables are the dependent target and independent predictor variables. They are the input and output variables and they change based on the analysis needs. We use the Python library, Pandas, to load our dataset and identify or modify the variables and their data types. 

The result of calling head() on our example dataset looks like the image below.

Exploring a Pandas Dataframe

We identify the variable types and get more info from the following code:

print(df.dtypes)
print(df.info())
print(df.describe())

Printing df.types() gives an output of variable data types

Printing df.info() gives information on the entire dataset as an object.

Printing df.describe() returns basic statistical details like percentile, mean, std, etc. for any integer datatypes. Our data set has two integer types: age and hours-per-week.

We can also sort the data by any column value we choose. Using the sort_values() method, here is an example of sorting by age in ascending order.

print(df.sort_values(by=['age'], ascending=True))

Output:

Variable Analysis and Visualization

Next, we do some variable analysis. The two most common variable analysis techniques are Univariate analysis and Bivariate analysis. The term univariate refers to analyzing one variable on its own for plotting range and distribution. Bivariate analysis finds the relationship between two variables. One way to perform variable analysis is to create charts to visualize the distribution of values for certain variables. 

We use univariate analysis to find any missing values or values that are extremely different from the majority of our data. The type of univariate analysis used depends on if the variable is continuous or categorical. Continuous variables require understanding the central tendency and spread of the variable. It is measured using mean, median, mode, etc. It is visualized using a box plot or a histogram. Alternatively, we use frequency tables to understand the distribution of each category for categorical variables. 
Let’s create a frequency table visualization for age distribution. This is a way to explore how many people we have in each age group in our dataset. We use the Python graph plotting library, Matplotlib, to create this histogram graph. We use plt.figure to create a new graph/figure. Using add_subplot will add a subplot at the first position to the 1 X 1 grid in the figure. Next, we choose which data column to include and how many bars in the histogram to show. After giving the graph a proper title and labels for the x and y axis, we use plt.show() to return a visualization of the data.

import matplotlib.pyplot as plt
 
 
fig=plt.figure()
ax = fig.add_subplot(1,1,1)
ax.hist(df['age'],bins = 8, color = "lightblue", ec="purple")
plt.title('Age distribution')
plt.xlabel('Age')
plt.ylabel('# of people')
plt.show()

Using this exploratory visualization task, we can see that our dataset has most information coming from people in their late teens to mid-40s. 

Handling Missing Data

Missing and outlier values in the dataset reduce a model’s fit and leads to bias since the data will not be analyzed completely. In our dataset, any missing values have been replaced with a ?. So in our code, we replace ? values with NaN using Numpy’s replace() method. Then, we use the is_null() method to detect any empty values. Next, we use a for loop in Python to calculate the number of missing values in each column. What we get is a count of boolean types representing cells that are empty or not.

import numpy as np
df.replace("?", np.nan, inplace = True)
missing_data = df.isnull()
 
for column in missing_data.columns.values.tolist():
    print(column)
    print(missing_data[column].value_counts())
    print("")

Summary of the Basics of Exploratory Data Analysis

In this post we looked at how to do basic exploratory data analysis with Python. We used the pandas, numpy, and matplotlib libraries to check out our data. We used pandas to import our data from a CSV file into a dataframe and view the data. Then, we looked at different information about the data including data types of the columns. After looking at the data with pandas, we used matplotlib to visualize our data by plotting it. Finally, we used numpy to deal with missing data from the original CSV file.

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Yujian Tang

Leave a Reply

%d bloggers like this: