Quick Review of DataFrame & Matplotlib

11 minute read

In this post, I created my own reference guide to dataframe and matplotlib.

Motivation

In the previous posts, I have used the dataframe module of the pandas library quite extensively, especially when building simple Scikit-learn based ML models or analyzing data. In particular, while importing the data from Excel or CSV file formats, preprocessing the data to remove rows with missing or invalid values, and one-hot-encoding columns, I have used various functions of the dataframe module. That being said, it occurred to me that having a quick reference for the functions may be useful in the future, should I use those functions for a data analysis project or something. Even though the Pandas website already has a complete compendium of all the dataframe module functions, I thought that using such a personal reference guide would be more convenient than searching the web every time. (Quick side note, the official compilation of dataframe functions can be found at this website). Of course, I plan to update this rather short list as I learn more about dataframe functions.

For similar reasons, I decided to do the same with matplotlib, explaining in my own language, the various visualization methods of the library. Since I am planning to publish a similar future post on R, I believe that having a subsection in this post about matplotlib will facilitate differentiating between those two. So without any further due, let’s get straight to dataframe.

Review of `DataFrame` Functions

Importing File & Multiple Headers

Import Excel File: `.read_excel()`

import pandas as pd

file = pd.read_excel('C:/Users/firef/Downloads/creditCard.xlsx')
file.head()

	Unnamed: 0	X1	X2	X3	X4	X5	X6	X7	X8	X9	...	X15	X16	X17	X18	X19	X20	X21	X22	X23	Y
0	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default payment next month
1	1	20000	2	2	1	24	2	2	-1	-1	...	0	0	0	0	689	0	0	0	0	1
2	2	120000	2	2	2	26	-1	2	0	0	...	3272	3455	3261	0	1000	1000	1000	0	2000	1
3	3	90000	2	2	2	34	0	0	0	0	...	14331	14948	15549	1518	1500	1000	1000	1000	5000	0
4	4	50000	2	2	1	37	0	0	0	0	...	28314	28959	29547	2000	2019	1200	1100	1069	1000	0

5 rows × 25 columns

Specifying the header parameter header=1 in .read_excel() removes the header row.

file1 = pd.read_excel('C:/Users/firef/Downloads/creditCard.xlsx', header=1) 
file1.head()
# print(type(file1)) # file1 is a pandas.core.frame.DataFrame object

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default payment next month
29995	29996	220000	1	3	1	39	0	0	0	0	...	88004	31237	15980	8500	20000	5003	3047	5000	1000	0
29996	29997	150000	1	3	2	43	-1	-1	-1	-1	...	8979	5190	0	1837	3526	8998	129	0	0	0
29997	29998	30000	1	2	2	37	4	3	2	-1	...	20878	20582	19357	0	0	22000	4200	2000	3100	1
29998	29999	80000	1	3	1	41	1	-1	0	0	...	52774	11855	48944	85900	3409	1178	1926	52964	1804	1
29999	30000	50000	1	2	1	46	0	0	0	0	...	36535	32428	15313	2078	1800	1430	1000	1000	1000	1

5 rows × 25 columns

Since we passed in the Excel file into pd.read_excel() function, the resulting data type of file1 is pandas.core.frame.DataFrame.

type(file1)

pandas.core.frame.DataFrame

Modifying `DataFrame` Objects

Removing Columns: `.drop('name', axis=1)`

file1.drop('ID', axis=1, inplace=False).head() # inplace=False : 

	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default payment next month
0	20000	2	2	1	24	2	2	-1	-1	-2	...	0	0	0	0	689	0	0	0	0	1
1	120000	2	2	2	26	-1	2	0	0	0	...	3272	3455	3261	0	1000	1000	1000	0	2000	1
2	90000	2	2	2	34	0	0	0	0	0	...	14331	14948	15549	1518	1500	1000	1000	1000	5000	0
3	50000	2	2	1	37	0	0	0	0	0	...	28314	28959	29547	2000	2019	1200	1100	1069	1000	0
4	50000	1	2	1	57	-1	0	-1	0	0	...	20940	19146	19131	2000	36681	10000	9000	689	679	0

5 rows × 24 columns

The inplace=False returns a copy of the original DataFrame object, with the desired columns removed.

file1.head()

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default payment next month
0	1	20000	2	2	1	24	2	2	-1	-1	...	0	0	0	0	689	0	0	0	0	1
1	2	120000	2	2	2	26	-1	2	0	0	...	3272	3455	3261	0	1000	1000	1000	0	2000	1
2	3	90000	2	2	2	34	0	0	0	0	...	14331	14948	15549	1518	1500	1000	1000	1000	5000	0
3	4	50000	2	2	1	37	0	0	0	0	...	28314	28959	29547	2000	2019	1200	1100	1069	1000	0
4	5	50000	1	2	1	57	-1	0	-1	0	...	20940	19146	19131	2000	36681	10000	9000	689	679	0

5 rows × 25 columns

As seen above, the original DataFrame object has not been changed.

Removing Rows: `axis=0`

file1.drop(0, axis=0, inplace=False).head()

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default payment next month
1	2	120000	2	2	2	26	-1	2	0	...	3272	3455	3261	0	1000	1000	1000	0	2000	1
2	3	90000	2	2	2	34	0	0	0	...	14331	14948	15549	1518	1500	1000	1000	1000	5000	0
3	4	50000	2	2	1	37	0	0	0	...	28314	28959	29547	2000	2019	1200	1100	1069	1000	0
4	5	50000	1	2	1	57	-1	0	-1	...	20940	19146	19131	2000	36681	10000	9000	689	679	0
5	6	50000	1	1	2	37	0	0	0	...	19394	19619	20024	2500	1815	657	1000	1000	800	0

5 rows × 25 columns

Retrieving Part of DataFrame Object: `.Series.to_frame()`

# First Step: Getting a part of the DataFrame object as a series
onlyMarriage = file1['MARRIAGE'] # gets only the MARRIAGE column
onlyMarriage.head() # type of onlyMarriage = pandas.core.series.Series

  1
  2
  2
  1
  1
Name: MARRIAGE, dtype: int64

# Converting the series into a dataframe object

dfMarriage = pd.Series.to_frame(onlyMarriage, name="MARRIAGE")
dfMarriage # parameter `name` is the name of the new column

	MARRIAGE
0	1
1	2
2	2
3	1
4	1
...	...
29995	1
29996	2
29997	2
29998	1
29999	1

30000 rows × 1 columns

One-Hot Encoding Columns: `.get_dummies()`

# First, check the unique values of the about-to-split column

file1['EDUCATION'].unique() 

array([2, 1, 3, 5, 4, 6, 0], dtype=int64)

# Specify the columns to be split using one-hot encoding in columns = [] parameter
pdFile = pd.get_dummies(file1, columns=['EDUCATION', 'SEX'])
pdFile.head()

	ID	LIMIT_BAL	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	...	default payment next month	EDUCATION_2	SEX_1	SEX_2
0	1	20000	1	24	2	2	-1	-1	-2	-2	...	1	1	0	1
1	2	120000	2	26	-1	2	0	0	0	2	...	1	1	0	1
2	3	90000	2	34	0	0	0	0	0	0	...	0	1	0	1
3	4	50000	1	37	0	0	0	0	0	0	...	0	1	0	1
4	5	50000	1	57	-1	0	-1	0	0	0	...	0	1	1	0

5 rows × 32 columns

Replacing Values In Column: `.replace()`

file1.head()

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default payment next month
0	1	20000	2	2	1	24	2	2	-1	-1	...	0	0	0	0	689	0	0	0	0	1
1	2	120000	2	2	2	26	-1	2	0	0	...	3272	3455	3261	0	1000	1000	1000	0	2000	1
2	3	90000	2	2	2	34	0	0	0	0	...	14331	14948	15549	1518	1500	1000	1000	1000	5000	0
3	4	50000	2	2	1	37	0	0	0	0	...	28314	28959	29547	2000	2019	1200	1100	1069	1000	0
4	5	50000	1	2	1	57	-1	0	-1	0	...	20940	19146	19131	2000	36681	10000	9000	689	679	0

5 rows × 25 columns

file1['EDUCATION'].replace(to_replace=2, value=1, inplace=True)
file1.head() # Values in Education column changed to one

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default payment next month
0	1	20000	2	1	1	24	2	2	-1	-1	...	0	0	0	0	689	0	0	0	0	1
1	2	120000	2	1	2	26	-1	2	0	0	...	3272	3455	3261	0	1000	1000	1000	0	2000	1
2	3	90000	2	1	2	34	0	0	0	0	...	14331	14948	15549	1518	1500	1000	1000	1000	5000	0
3	4	50000	2	1	1	37	0	0	0	0	...	28314	28959	29547	2000	2019	1200	1100	1069	1000	0
4	5	50000	1	1	1	57	-1	0	-1	0	...	20940	19146	19131	2000	36681	10000	9000	689	679	0

5 rows × 25 columns

Trimming Dataset (Downsampling): `resample()`, `.concat()`

# CAUTION: Split the DataFrame object beforehand to ensure the proportion of different values 
# before the sampling equals the proportion of different values after the sampling. 

no_default = file1[file1['default payment next month'] == 0]
default = file1[file1['default payment next month'] == 1]

no_default.head() # default payment next month column is all zero
# default.head()

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_3	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6
2	3	90000	2	1	2	34	0	0	...	14331	14948	15549	1518	1500	1000	1000	1000	5000
3	4	50000	2	1	1	37	0	0	...	28314	28959	29547	2000	2019	1200	1100	1069	1000
4	5	50000	1	1	1	57	-1	-1	...	20940	19146	19131	2000	36681	10000	9000	689	679
5	6	50000	1	1	2	37	0	0	...	19394	19619	20024	2500	1815	657	1000	1000	800
6	7	500000	1	1	2	29	0	0	...	542653	483003	473944	55000	40000	38000	20239	13750	13770

5 rows × 25 columns

from sklearn.utils import resample

file1_no_default_downsampled = resample(no_default, replace=False, n_samples = 100, random_state=42)
file1_default_downsampled = resample(default, replace=False, n_samples = 100, random_state=42)

print("Length of the new downsampled dataframe object: ", len(file1_no_default_downsampled))

file1_no_default_downsampled.head() # resampled file1_no_default_downsampled

Length of the new downsampled dataframe object:  100

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6
7510	7511	380000	2	3	2	31	-1	-1	-1	-1	...	11147	12483	13680	9240	15233	11202	12493	13748	18061
15325	15326	240000	2	1	1	35	-2	-2	-2	-2	...	5638	2582	4127	7375	4908	5638	2587	4127	4942
18666	18667	50000	2	1	2	23	-1	-1	-2	-2	...	0	0	0	0	0	0	0	0	0
7494	7495	330000	1	1	2	32	0	0	0	0	...	141453	112633	121242	5500	4723	5500	4000	10700	4500
1239	1240	80000	2	1	1	35	-1	-1	-1	-1	...	396	396	396	9796	13443	396	396	0	396

5 rows × 25 columns

# Concatenate the two parts:

file1_downsampled = pd.concat([file1_no_default_downsampled, file1_default_downsampled])
len(file1_downsampled) # concatenate two pre-separated parts consisting of 100 rows each

Appending Rows: `append()`

file1.head()

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default payment next month
0	1	20000	2	2	1	24	2	2	-1	-1	...	0	0	0	0	689	0	0	0	0	1
1	2	120000	2	2	2	26	-1	2	0	0	...	3272	3455	3261	0	1000	1000	1000	0	2000	1
2	3	90000	2	2	2	34	0	0	0	0	...	14331	14948	15549	1518	1500	1000	1000	1000	5000	0
3	4	50000	2	2	1	37	0	0	0	0	...	28314	28959	29547	2000	2019	1200	1100	1069	1000	0
4	5	50000	1	2	1	57	-1	0	-1	0	...	20940	19146	19131	2000	36681	10000	9000	689	679	0

5 rows × 25 columns

# adds the extraRow to the end 
extraRow = [0, 10000, 2, 2, 1, 23, 2, 2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1]
file1.loc[len(file1.index)] = extraRow
file1.tail()

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default payment next month
29996	29997	150000	1	3	2	43	-1	-1	-1	-1	...	8979	5190	0	1837	3526	8998	129	0	0	0
29997	29998	30000	1	2	2	37	4	3	2	-1	...	20878	20582	19357	0	0	22000	4200	2000	3100	1
29998	29999	80000	1	3	1	41	1	-1	0	0	...	52774	11855	48944	85900	3409	1178	1926	52964	1804	1
29999	30000	50000	1	2	1	46	0	0	0	0	...	36535	32428	15313	2078	1800	1430	1000	1000	1000	1
30000	0	10000	2	2	1	23	2	2	-1	-1	...	0	0	0	0	689	0	0	0	0	1

5 rows × 25 columns

Twitter Facebook LinkedIn

Junhyung Park

Quick Review of DataFrame & Matplotlib

Motivation

Review of `DataFrame` Functions

Importing File & Multiple Headers

Import Excel File: `.read_excel()`

Modifying `DataFrame` Objects

Removing Columns: `.drop('name', axis=1)`

Removing Rows: `axis=0`

Retrieving Part of DataFrame Object: `.Series.to_frame()`

One-Hot Encoding Columns: `.get_dummies()`

Replacing Values In Column: `.replace()`

Trimming Dataset (Downsampling): `resample()`, `.concat()`

Appending Rows: `append()`

You May Also Enjoy

Thoughts on OCD (Obsessive-Compulsive Disorder)

Episode with Gastritis

Reflections Concerning Biomedical Research

Reflections on 2022-2024

Junhyung Park

Motivation

Review of DataFrame Functions

Importing File & Multiple Headers

Import Excel File: .read_excel()

Modifying DataFrame Objects

Removing Columns: .drop('name', axis=1)

Removing Rows: axis=0

Retrieving Part of DataFrame Object: .Series.to_frame()

One-Hot Encoding Columns: .get_dummies()

Replacing Values In Column: .replace()

Trimming Dataset (Downsampling): resample(), .concat()

Appending Rows: append()

You May Also Enjoy

Thoughts on OCD (Obsessive-Compulsive Disorder)

Episode with Gastritis

Reflections Concerning Biomedical Research

Reflections on 2022-2024

Review of `DataFrame` Functions

Import Excel File: `.read_excel()`

Modifying `DataFrame` Objects

Removing Columns: `.drop('name', axis=1)`

Removing Rows: `axis=0`

Retrieving Part of DataFrame Object: `.Series.to_frame()`

One-Hot Encoding Columns: `.get_dummies()`

Replacing Values In Column: `.replace()`

Trimming Dataset (Downsampling): `resample()`, `.concat()`

Appending Rows: `append()`