Exploratory data analysis, or EDA, helps us get a foundational understanding of our dataset. Through exploring our data, we get an understanding of the data and its patterns so we can build efficient predictive models. This process takes a fair amount of time because we implement several data preprocessing tasks.

For this tutorial, we’ll need three libraries: `pandas`

, `matplotlib`

, and `numpy`

. You can install these using `pip`

or `conda`

with `pip install pandas matplotlib numpy`

or `conda install pandas matplotlib numpy`

. We will also need a sample CSV dataset. In this case we use the US Census Income dataset which we got from MOSTLY AI. We will cover:

- Loading a CSV File into Pandas
- Exploring a Pandas Dataframe
- Variable Analysis and Visualization
- Handling Missing Data
- Summary of the Basics of Exploratory Data Analysis

## Loading a CSV File into Pandas

We start by importing the `pandas`

library and saving it as an abbreviated version `pd`

. We use pandas’ `read.csv`

method to load our dataset. The `head()`

method returns the first and last five rows of our data by default and also returns the shape of the entire dataset.

```
import pandas as pd
df = pd.read_csv("us-census-income.csv")
print(df.head)
```

After loading your desired dataset, the first step is variable Identification. These variables are the dependent target and independent predictor variables. They are the input and output variables and they change based on the analysis needs. We use the Python library, Pandas, to load our dataset and identify or modify the variables and their data types.

The result of calling `head()`

on our example dataset looks like the image below.

## Exploring a Pandas Dataframe

We identify the variable types and get more info from the following code:

```
print(df.dtypes)
print(df.info())
print(df.describe())
```

Printing `df.types()`

gives an output of variable data types

Printing `df.info()`

gives information on the entire dataset as an object.

Printing `df.describe()`

returns basic statistical details like percentile, mean, std, etc. for any integer datatypes. Our data set has two integer types: `age`

and `hours-per-week`

.

We can also sort the data by any column value we choose. Using the `sort_values()`

method, here is an example of sorting by age in ascending order.

`print(df.sort_values(by=['age'], ascending=True))`

Output:

## Variable Analysis and Visualization

Next, we do some variable analysis. The two most common variable analysis techniques are Univariate analysis and Bivariate analysis. The term univariate refers to analyzing one variable on its own for plotting range and distribution. Bivariate analysis finds the relationship between two variables. One way to perform variable analysis is to create charts to visualize the distribution of values for certain variables.

We use univariate analysis to find any missing values or values that are extremely different from the majority of our data. The type of univariate analysis used depends on if the variable is continuous or categorical. Continuous variables require understanding the central tendency and spread of the variable. It is measured using mean, median, mode, etc. It is visualized using a box plot or a histogram. Alternatively, we use frequency tables to understand the distribution of each category for categorical variables.

Let’s create a frequency table visualization for age distribution. This is a way to explore how many people we have in each age group in our dataset. We use the Python graph plotting library, Matplotlib, to create this histogram graph. We use `plt.figure`

to create a new graph/figure. Using `add_subplot`

will add a subplot at the first position to the 1 X 1 grid in the figure. Next, we choose which data column to include and how many bars in the histogram to show. After giving the graph a proper title and labels for the `x`

and `y`

axis, we use `plt.show()`

to return a visualization of the data.

```
import matplotlib.pyplot as plt
fig=plt.figure()
ax = fig.add_subplot(1,1,1)
ax.hist(df['age'],bins = 8, color = "lightblue", ec="purple")
plt.title('Age distribution')
plt.xlabel('Age')
plt.ylabel('# of people')
plt.show()
```

Using this exploratory visualization task, we can see that our dataset has most information coming from people in their late teens to mid-40s.

## Handling Missing Data

Missing and outlier values in the dataset reduce a model’s fit and leads to bias since the data will not be analyzed completely. In our dataset, any missing values have been replaced with a `?`

. So in our code, we replace `?`

values with `NaN`

using Numpy’s `replace()`

method. Then, we use the `is_null()`

method to detect any empty values. Next, we use a for loop in Python to calculate the number of missing values in each column. What we get is a count of boolean types representing cells that are empty or not.

```
import numpy as np
df.replace("?", np.nan, inplace = True)
missing_data = df.isnull()
for column in missing_data.columns.values.tolist():
print(column)
print(missing_data[column].value_counts())
print("")
```

## Summary of the Basics of Exploratory Data Analysis

In this post we looked at how to do basic exploratory data analysis with Python. We used the `pandas`

, `numpy`

, and `matplotlib`

libraries to check out our data. We used `pandas`

to import our data from a CSV file into a dataframe and view the data. Then, we looked at different information about the data including data types of the columns. After looking at the data with `pandas`

, we used `matplotlib`

to visualize our data by plotting it. Finally, we used `numpy`

to deal with missing data from the original CSV file.

## Further Reading

- The Future of Applications is Intelligent
- How to Automatically Transcribe a Notion MP3 File
- Python String Manipulation
- How do Software Engineers Learn Best?
- Tensorflow Keras Optimizers and Frequent Errors

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.