Exploratory Data Analysis using Data Visualization Techniques

Definition

Exploratory Data Analysis (EDA) refers to the process of using statistical and visualization techniques to come up with important aspects of the data for further analysis.

Reasons for EDA

To identify outliers, irrelevant data, and missing values.
To avoid creating an inaccurate training model.
To avoid creating an accurate model with the wrong data.

Let's take an example of creating a model that predicts the survival rate of breast cancer patients post-operation with research conducted at the University of Chicago between 1958-1970:

The features that will be included in the model are:

Patient’s age at the time of operation (numerical).
Year of operation (year — 1900, numerical).
A number of positive axillary nodes were detected (numerical).
Survival status (class attribute)
1: the patient survived 5 years or longer post-operation.
2: the patient died within 5 years post-operation

The steps to take for our EDA are:

I) Importing libraries and loading data

II) Understand the data (click the replit link below for code and further observation):

https://replit.com/@NivethaB2/edaandvisualization#main.py

As it can be seen, the dataset contains 305 rows and 4 columns.

Using df['survival_status'].value_counts(), the output shows:

1 224

2 81 meaning

Out of a total of 305 patients, the number of patients who survived over 5 years post-operation is nearly 3 times the number of patients who died within 5 years.

Data Preparation

The original class labels — 1 (survived 5 years and above) and 2 (died within 5 years) are not in accordance with the case.

we map survival status values 1 and 2 in the column survival_status to categorical variables ‘yes’ and ‘no’ respectively such that,
survival_status = 1 → survival_status = ‘yes’
survival_status = 2 → survival_status = ‘no’

df['survival_status'] = df['survival_status'].map({1:"yes", 2:"no"})

General statistical analysis

On average, patients got operated at the age of 63.
An average number of positive axillary nodes detected = 4.
As indicated by the 50th percentile, the median of positive axillary nodes is 1.
As indicated by the 75th percentile, 75% of the patients have less than 4 nodes detected.

If you see, there is a significant difference between the mean and the median values. This is because there are some outliers in our data and the mean is influenced by the presence of outliers.

Uni-variate Data Analysis

This kind of analysis is done by considering one variable at a time.

Let’s say our aim is to be able to correctly determine the survival status given the features — patient’s age, operation year, and positive axillary node count. Which among these 3 variables is more useful than other variables in order to distinguish between the class labels ‘yes’ and ‘no’? To answer this, we’ll plot the distribution plots (also called probability density functions or PDF plots) with each feature as a variable on X-axis. The values on the Y-axis in each case represent the normalized density.

Distribution Plots

1. Patient’s age

sns.FacetGrid(df, hue = "survival_status").map(sns.distplot, "patient_age").add_legend()
plt.show()

Among all the age groups, the patients belonging to 40-60 years of age are the highest.
There is a high overlap between the class labels. This implies that the survival status of the patient post-operation cannot be discerned from the patient’s age.

2. Operation year

sns.FacetGrid(df, hue = "survival_status").map(sns.distplot, "operation_year").add_legend()
plt.show()

There is a huge overlap between the class labels suggesting that one cannot make any distinctive conclusion regarding the survival status based solely on the operation year.

3. Number of positive axillary nodes

sns.FacetGrid(df, hue = "survival_status").map(sns.distplot, "positive_axillary_nodes").add_legend()
plt.show()

Box plots and Violin plots

Box plots display data in 5 numbers; minimum, lower quartile(25th percentile), median(50th percentile), upper quartile(75th percentile), and maximum data values.

Below is an example of a box plot:

plt.figure(figsize = (15, 4))
plt.subplot(1,3,1)
sns.boxplot(x = 'survival_status', y = 'patient_age', data = df)
plt.subplot(1,3,2)
sns.boxplot(x = 'survival_status', y = 'operation_year', data = df)
plt.subplot(1,3,3)
sns.boxplot(x = 'survival_status', y = 'positive_axillary_nodes', data = df)
plt.show()

Violin plots are more informative as compared to box plots as violin plots also represent the underlying distribution of the data. Below shows a violin plot:

plt.figure(figsize = (15, 4))
plt.subplot(1,3,1)
sns.violinplot(x = 'survival_status', y = 'patient_age', data = df)
plt.subplot(1,3,2)
sns.violinplot(x = 'survival_status', y = 'operation_year', data = df)
plt.subplot(1,3,3)
sns.violinplot(x = 'survival_status', y = 'positive_axillary_nodes', data = df)
plt.show()

Bi-variate data analysis

Pair plot

We'll plot a pair plot to visualize the relationship between the features in a pairwise manner.

sns.set_style('whitegrid')
sns.pairplot(df, hue = 'survival_status')
plt.show()

Joint plot

It provides bi-variate plots with uni-variate marginal distributions.

sns.jointplot(x = 'patient_age', y = 'positive_axillary_nodes', data = df)
plt.show()

Heatmap

It's used to obtain the feature importance in regression analysis.

Although correlated features do not impact the performance of the statistical model, it could mess up the post-modeling analysis.

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assuming 'df' is your DataFrame

# Generate the heatmap
sns.heatmap(df.corr(), cmap='YlGnBu', annot=True)

# Show the plot
plt.show()

Multivariate analysis with Contour plot

A graphical technique for representing a 3-dimensional surface by plotting constant z slices, called contours, in a 2-dimensional format. A contour plot enables us to visualize data in a two-dimensional plot.

sns.jointplot(x = 'patient_age',  y = 'operation_year' , data = df,  kind = 'kde', fill = True)
plt.show()

Article courtesy of Analytics Vidya

Brian Githinji