Quick Review of DataFrame & Matplotlib

11 minute read

In this post, I created my own reference guide to dataframe and matplotlib.

Motivation

In the previous posts, I have used the dataframe module of the pandas library quite extensively, especially when building simple Scikit-learn based ML models or analyzing data. In particular, while importing the data from Excel or CSV file formats, preprocessing the data to remove rows with missing or invalid values, and one-hot-encoding columns, I have used various functions of the dataframe module. That being said, it occurred to me that having a quick reference for the functions may be useful in the future, should I use those functions for a data analysis project or something. Even though the Pandas website already has a complete compendium of all the dataframe module functions, I thought that using such a personal reference guide would be more convenient than searching the web every time. (Quick side note, the official compilation of dataframe functions can be found at this website). Of course, I plan to update this rather short list as I learn more about dataframe functions.

For similar reasons, I decided to do the same with matplotlib, explaining in my own language, the various visualization methods of the library. Since I am planning to publish a similar future post on R, I believe that having a subsection in this post about matplotlib will facilitate differentiating between those two. So without any further due, let’s get straight to dataframe.

Review of DataFrame Functions

Importing File & Multiple Headers

Import Excel File: .read_excel()

import pandas as pd
file = pd.read_excel('C:/Users/firef/Downloads/creditCard.xlsx')
file.head()
Unnamed: 0 X1 X2 X3 X4 X5 X6 X7 X8 X9 ... X15 X16 X17 X18 X19 X20 X21 X22 X23 Y
0 ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
1 1 20000 2 2 1 24 2 2 -1 -1 ... 0 0 0 0 689 0 0 0 0 1
2 2 120000 2 2 2 26 -1 2 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
3 3 90000 2 2 2 34 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
4 4 50000 2 2 1 37 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0

5 rows × 25 columns

Specifying the header parameter header=1 in .read_excel() removes the header row.

file1 = pd.read_excel('C:/Users/firef/Downloads/creditCard.xlsx', header=1) 
file1.head()
# print(type(file1)) # file1 is a pandas.core.frame.DataFrame object
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
29995 29996 220000 1 3 1 39 0 0 0 0 ... 88004 31237 15980 8500 20000 5003 3047 5000 1000 0
29996 29997 150000 1 3 2 43 -1 -1 -1 -1 ... 8979 5190 0 1837 3526 8998 129 0 0 0
29997 29998 30000 1 2 2 37 4 3 2 -1 ... 20878 20582 19357 0 0 22000 4200 2000 3100 1
29998 29999 80000 1 3 1 41 1 -1 0 0 ... 52774 11855 48944 85900 3409 1178 1926 52964 1804 1
29999 30000 50000 1 2 1 46 0 0 0 0 ... 36535 32428 15313 2078 1800 1430 1000 1000 1000 1

5 rows × 25 columns

Since we passed in the Excel file into pd.read_excel() function, the resulting data type of file1 is pandas.core.frame.DataFrame.

type(file1)
pandas.core.frame.DataFrame

Modifying DataFrame Objects


Removing Columns: .drop('name', axis=1)

file1.drop('ID', axis=1, inplace=False).head() # inplace=False : 
LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
0 20000 2 2 1 24 2 2 -1 -1 -2 ... 0 0 0 0 689 0 0 0 0 1
1 120000 2 2 2 26 -1 2 0 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
2 90000 2 2 2 34 0 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
3 50000 2 2 1 37 0 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
4 50000 1 2 1 57 -1 0 -1 0 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0

5 rows × 24 columns

The inplace=False returns a copy of the original DataFrame object, with the desired columns removed.

file1.head()
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
0 1 20000 2 2 1 24 2 2 -1 -1 ... 0 0 0 0 689 0 0 0 0 1
1 2 120000 2 2 2 26 -1 2 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
2 3 90000 2 2 2 34 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
3 4 50000 2 2 1 37 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
4 5 50000 1 2 1 57 -1 0 -1 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0

5 rows × 25 columns

As seen above, the original DataFrame object has not been changed.


Removing Rows: axis=0

file1.drop(0, axis=0, inplace=False).head()
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
1 2 120000 2 2 2 26 -1 2 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
2 3 90000 2 2 2 34 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
3 4 50000 2 2 1 37 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
4 5 50000 1 2 1 57 -1 0 -1 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0
5 6 50000 1 1 2 37 0 0 0 0 ... 19394 19619 20024 2500 1815 657 1000 1000 800 0

5 rows × 25 columns


Retrieving Part of DataFrame Object: .Series.to_frame()

# First Step: Getting a part of the DataFrame object as a series
onlyMarriage = file1['MARRIAGE'] # gets only the MARRIAGE column
onlyMarriage.head() # type of onlyMarriage = pandas.core.series.Series

0    1
1    2
2    2
3    1
4    1
Name: MARRIAGE, dtype: int64
# Converting the series into a dataframe object

dfMarriage = pd.Series.to_frame(onlyMarriage, name="MARRIAGE")
dfMarriage # parameter `name` is the name of the new column
MARRIAGE
0 1
1 2
2 2
3 1
4 1
... ...
29995 1
29996 2
29997 2
29998 1
29999 1

30000 rows × 1 columns


One-Hot Encoding Columns: .get_dummies()

# First, check the unique values of the about-to-split column

file1['EDUCATION'].unique() 
array([2, 1, 3, 5, 4, 6, 0], dtype=int64)
# Specify the columns to be split using one-hot encoding in columns = [] parameter
pdFile = pd.get_dummies(file1, columns=['EDUCATION', 'SEX'])
pdFile.head()
ID LIMIT_BAL MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 ... default payment next month EDUCATION_0 EDUCATION_1 EDUCATION_2 EDUCATION_3 EDUCATION_4 EDUCATION_5 EDUCATION_6 SEX_1 SEX_2
0 1 20000 1 24 2 2 -1 -1 -2 -2 ... 1 0 0 1 0 0 0 0 0 1
1 2 120000 2 26 -1 2 0 0 0 2 ... 1 0 0 1 0 0 0 0 0 1
2 3 90000 2 34 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 1
3 4 50000 1 37 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 1
4 5 50000 1 57 -1 0 -1 0 0 0 ... 0 0 0 1 0 0 0 0 1 0

5 rows × 32 columns


Replacing Values In Column: .replace()

file1.head()
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
0 1 20000 2 2 1 24 2 2 -1 -1 ... 0 0 0 0 689 0 0 0 0 1
1 2 120000 2 2 2 26 -1 2 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
2 3 90000 2 2 2 34 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
3 4 50000 2 2 1 37 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
4 5 50000 1 2 1 57 -1 0 -1 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0

5 rows × 25 columns

file1['EDUCATION'].replace(to_replace=2, value=1, inplace=True)
file1.head() # Values in Education column changed to one
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
0 1 20000 2 1 1 24 2 2 -1 -1 ... 0 0 0 0 689 0 0 0 0 1
1 2 120000 2 1 2 26 -1 2 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
2 3 90000 2 1 2 34 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
3 4 50000 2 1 1 37 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
4 5 50000 1 1 1 57 -1 0 -1 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0

5 rows × 25 columns


Trimming Dataset (Downsampling): resample(), .concat()

# CAUTION: Split the DataFrame object beforehand to ensure the proportion of different values 
# before the sampling equals the proportion of different values after the sampling. 

no_default = file1[file1['default payment next month'] == 0]
default = file1[file1['default payment next month'] == 1]

no_default.head() # default payment next month column is all zero
# default.head()
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
2 3 90000 2 1 2 34 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
3 4 50000 2 1 1 37 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
4 5 50000 1 1 1 57 -1 0 -1 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0
5 6 50000 1 1 2 37 0 0 0 0 ... 19394 19619 20024 2500 1815 657 1000 1000 800 0
6 7 500000 1 1 2 29 0 0 0 0 ... 542653 483003 473944 55000 40000 38000 20239 13750 13770 0

5 rows × 25 columns

from sklearn.utils import resample

file1_no_default_downsampled = resample(no_default, replace=False, n_samples = 100, random_state=42)
file1_default_downsampled = resample(default, replace=False, n_samples = 100, random_state=42)

print("Length of the new downsampled dataframe object: ", len(file1_no_default_downsampled))

file1_no_default_downsampled.head() # resampled file1_no_default_downsampled
Length of the new downsampled dataframe object:  100
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
7510 7511 380000 2 3 2 31 -1 -1 -1 -1 ... 11147 12483 13680 9240 15233 11202 12493 13748 18061 0
15325 15326 240000 2 1 1 35 -2 -2 -2 -2 ... 5638 2582 4127 7375 4908 5638 2587 4127 4942 0
18666 18667 50000 2 1 2 23 -1 -1 -2 -2 ... 0 0 0 0 0 0 0 0 0 0
7494 7495 330000 1 1 2 32 0 0 0 0 ... 141453 112633 121242 5500 4723 5500 4000 10700 4500 0
1239 1240 80000 2 1 1 35 -1 -1 -1 -1 ... 396 396 396 9796 13443 396 396 0 396 0

5 rows × 25 columns

# Concatenate the two parts:

file1_downsampled = pd.concat([file1_no_default_downsampled, file1_default_downsampled])
len(file1_downsampled) # concatenate two pre-separated parts consisting of 100 rows each
200

Appending Rows: append()

file1.head()
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
0 1 20000 2 2 1 24 2 2 -1 -1 ... 0 0 0 0 689 0 0 0 0 1
1 2 120000 2 2 2 26 -1 2 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
2 3 90000 2 2 2 34 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
3 4 50000 2 2 1 37 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
4 5 50000 1 2 1 57 -1 0 -1 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0

5 rows × 25 columns

# adds the extraRow to the end 
extraRow = [0, 10000, 2, 2, 1, 23, 2, 2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1]
file1.loc[len(file1.index)] = extraRow
file1.tail()
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
29996 29997 150000 1 3 2 43 -1 -1 -1 -1 ... 8979 5190 0 1837 3526 8998 129 0 0 0
29997 29998 30000 1 2 2 37 4 3 2 -1 ... 20878 20582 19357 0 0 22000 4200 2000 3100 1
29998 29999 80000 1 3 1 41 1 -1 0 0 ... 52774 11855 48944 85900 3409 1178 1926 52964 1804 1
29999 30000 50000 1 2 1 46 0 0 0 0 ... 36535 32428 15313 2078 1800 1430 1000 1000 1000 1
30000 0 10000 2 2 1 23 2 2 -1 -1 ... 0 0 0 0 689 0 0 0 0 1

5 rows × 25 columns