Data Transformation with R

10 minute read

This post is a continuation of my recent blog posts on learning R. This post, along with the others, references Hadley Wickham and Garret Grolemund’s R for Data Science, which authorizes small scale, non-profit documentations of the example code from the book.

Dplyr Functions: `filter()`, `arrange()`, `mutate()`, `select()`, `summarize()`

Dataset: `nycflights13`

library(tidyverse)
# install.packages("nycflights13")
library(nycflights13)

head(flights) # flights departing NYC in 2013

year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
2013	1	1	517	515	2	830	819	11	UA	1545	N14228	EWR	IAH	227	1400	5	15	2013-01-01 05:00:00
2013	1	1	533	529	4	850	830	20	UA	1714	N24211	LGA	IAH	227	1416	5	29	2013-01-01 05:00:00
2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1089	5	40	2013-01-01 05:00:00
2013	1	1	544	545	-1	1004	1022	-18	B6	725	N804JB	JFK	BQN	183	1576	5	45	2013-01-01 05:00:00
2013	1	1	554	600	-6	812	837	-25	DL	461	N668DN	LGA	ATL	116	762	6	0	2013-01-01 06:00:00
2013	1	1	554	558	-4	740	728	12	UA	1696	N39463	EWR	ORD	150	719	5	58	2013-01-01 05:00:00

`filter()`: Filter Rows

head(filter(flights, arr_delay > 120)) # flight arrivals delayed by more than two hours(120min)

year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
2013	1	1	811	630	101	1047	830	137	MQ	4576	N531MQ	LGA	CLT	118	544	6	30	2013-01-01 06:00:00
2013	1	1	848	1835	853	1001	1950	851	MQ	3944	N942MQ	JFK	BWI	41	184	18	35	2013-01-01 18:00:00
2013	1	1	957	733	144	1056	853	123	UA	856	N534UA	EWR	BOS	37	200	7	33	2013-01-01 07:00:00
2013	1	1	1114	900	134	1447	1222	145	UA	1086	N76502	LGA	IAH	248	1416	9	0	2013-01-01 09:00:00
2013	1	1	1505	1310	115	1638	1431	127	EV	4497	N17984	EWR	RIC	63	277	13	10	2013-01-01 13:00:00
2013	1	1	1525	1340	105	1831	1626	125	B6	525	N231JB	EWR	MCO	152	937	13	40	2013-01-01 13:00:00

head(filter(flights, month==12, day==25))

year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
2013	12	25	456	500	-4	649	651	-2	US	1895	N156UW	EWR	CLT	98	529	5	0	2013-12-25 05:00:00
2013	12	25	524	515	9	805	814	-9	UA	1016	N32404	EWR	IAH	203	1400	5	15	2013-12-25 05:00:00
2013	12	25	542	540	2	832	850	-18	AA	2243	N5EBAA	JFK	MIA	146	1089	5	40	2013-12-25 05:00:00
2013	12	25	546	550	-4	1022	1027	-5	B6	939	N665JB	JFK	BQN	191	1576	5	50	2013-12-25 05:00:00
2013	12	25	556	600	-4	730	745	-15	AA	301	N3JLAA	LGA	ORD	123	733	6	0	2013-12-25 06:00:00
2013	12	25	557	600	-3	743	752	-9	DL	731	N369NB	LGA	DTW	88	502	6	0	2013-12-25 06:00:00

`arrange()`: Change the Order of Rows

df <- tibble(x = c(5,4,3,2,1, NA))
arrange(df, x) # arrange by ascending order
arrange(df, desc(x))

x
1
2
3
4
5
NA

x
5
4
3
2
1
NA

head(arrange(flights, year, month, day)) # arranges the rows by ASCENDING ORDER with respect to the year, month, and day 

year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
2013	1	1	517	515	2	830	819	11	UA	1545	N14228	EWR	IAH	227	1400	5	15	2013-01-01 05:00:00
2013	1	1	533	529	4	850	830	20	UA	1714	N24211	LGA	IAH	227	1416	5	29	2013-01-01 05:00:00
2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1089	5	40	2013-01-01 05:00:00
2013	1	1	544	545	-1	1004	1022	-18	B6	725	N804JB	JFK	BQN	183	1576	5	45	2013-01-01 05:00:00
2013	1	1	554	600	-6	812	837	-25	DL	461	N668DN	LGA	ATL	116	762	6	0	2013-01-01 06:00:00
2013	1	1	554	558	-4	740	728	12	UA	1696	N39463	EWR	ORD	150	719	5	58	2013-01-01 05:00:00

head(arrange(flights, desc(year, month, day)))

year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
2013	1	1	517	515	2	830	819	11	UA	1545	N14228	EWR	IAH	227	1400	5	15	2013-01-01 05:00:00
2013	1	1	533	529	4	850	830	20	UA	1714	N24211	LGA	IAH	227	1416	5	29	2013-01-01 05:00:00
2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1089	5	40	2013-01-01 05:00:00
2013	1	1	544	545	-1	1004	1022	-18	B6	725	N804JB	JFK	BQN	183	1576	5	45	2013-01-01 05:00:00
2013	1	1	554	600	-6	812	837	-25	DL	461	N668DN	LGA	ATL	116	762	6	0	2013-01-01 06:00:00
2013	1	1	554	558	-4	740	728	12	UA	1696	N39463	EWR	ORD	150	719	5	58	2013-01-01 05:00:00

`select()`: Select specific columns

head(select(flights, arr_time, arr_delay, flight, tailnum)) # select specific columns: arr_time, arr_delay, flight, tailnum

arr_time	arr_delay	flight	tailnum
830	11	1545	N14228
850	20	1714	N24211
923	33	1141	N619AA
1004	-18	725	N804JB
812	-25	461	N668DN
740	12	1696	N39463

head(select(flights, year:day)) # To view the end of the data, run: `tail(select(flights, year:day))`

year	month	day
2013	1	1
2013	1	1
2013	1	1
2013	1	1
2013	1	1
2013	1	1

head(select(flights, -(dep_delay:time_hour)))

year	month	day	dep_time	sched_dep_time
2013	1	1	517	515
2013	1	1	533	529
2013	1	1	542	540
2013	1	1	544	545
2013	1	1	554	600
2013	1	1	554	558

`mutate()`: Add new columns

new_tibble <- select(flights, arr_time, sched_arr_time) # create new tibble consisting of two columns: arr_time, sched_arr_time

head(mutate(new_tibble, arrival_delay = arr_time - sched_arr_time)) # mutate the new tibble by adding a new column: arr_time - sched_Arr_time

arr_time	sched_arr_time	arrival_delay
830	819	11
850	830	20
923	850	73
1004	1022	-18
812	837	-25
740	728	12

`summarize()`: Collapse/summarize data

The below code “summarizes” the flights data frame by providing the average delay of all the planes departing New York City in 2013.

summarize(flights, delay = mean(dep_delay, na.rm = TRUE))

delay
12.63907

The function group_by() allows one to summarize data by groups. In the above code, the data frame is grouped by year, month, and day, which ensures that when the summarize() function is applied with dep_delay as a parameter, the resulting summary is that of average delay per day.

by_day <- group_by(flights, year, month, day) # group the flights data set by year, month, and day
head(summarize(by_day, delay=mean(dep_delay, na.rm = TRUE)))  # create new column called "delay" as a new statistical summary 
                                                              # "delay" is the mean of the "dep_delay" column grouped by day  
                                                              # ra.rm refers to removing missing values ("not available") when calculating the mean

year	month	day	delay
2013	1	1	11.548926
2013	1	2	13.858824
2013	1	3	10.987832
2013	1	4	8.951595
2013	1	5	5.732218
2013	1	6	7.148014

To look at the relationship between distance and delay, we first group the flights by destination. This involves a three-step procedure: group, summarize, and filter.

Group

Summarize

Filter

by_dest <- group_by(flights, dest) # group by "dest" column of the flights data set. "group_by" always before "summarize()"
delay <- summarize(by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE))
                                   # by_dest = grouped data set, 
                                   # "count" is the variable (column) assigned to number of observations in the summary data
                                   # "count" is the number of flights assigned to a certain destination
                                   # "dist" is the mean distance of the flights grouped by destination, with n/a data removed.
                                   # "delay" is the mean delay time of flights grouped by destination
head(delay)

dest	count	dist	delay
ABQ	254	1826.0000	4.381890
ACK	265	199.0000	4.852273
ALB	439	143.0000	14.397129
ANC	8	3370.0000	-2.500000
ATL	17215	757.1082	11.300113
AUS	2439	1514.2530	6.019909

delay <- filter(delay, count > 20, dest != "HNL") # filter() finds the data with "count" exceeding 20, and removes dest = "HNL" 
head(delay) # "ANC," along with others such as "HNL," have been removed. 

dest	count	dist	delay
ABQ	254	1826.0000	4.381890
ACK	265	199.0000	4.852273
ALB	439	143.0000	14.397129
ATL	17215	757.1082	11.300113
AUS	2439	1514.2530	6.019909
AVL	275	583.5818	8.003831

ggplot(data = delay, mapping= aes(x = dist, y = delay)) +   # graph of delay
    geom_point(aes(size=count), alpha = 1/3) + # "geom_point" is a point geometric object.
    geom_smooth(se = FALSE) # "geom_smooth" is a geometric object that is displayed as a smooth conditional regression line. 

`geom_smooth()` using method = 'loess' and formula 'y ~ x'

The graph above shows that flights with air distance greater than approximately 1,000 kilometers tend to have less delay time, compared to those with less distance. Also, delay time appears to increase as the distance approaches 600 kilometers, although the opposite appears to be true as the distance climbs toward 2,000 kilometers.

Although this was a relatively simple visualization task, the steps above involved individually naming objects in each step, which may get very confusing as the complexity of the task increases. To solve this inefficiency problem, as well as to vastly improve readability, there is a technique called “piping”–which involves a “pipe” operator “%>%.” Pronounced “then,” the pipe operator simplifies multistep procedures, as shown below.

delays2 <- flights %>% # "then" group by dest
    group_by(dest) %>% # "then" summarize the data by adding statistical summaries--"count," "dist," and "delay"--as columns. 
    summarize(
        count = n(),
        dist = mean(distance, na.rm = TRUE),
        delay = mean(arr_delay, na.rm = TRUE),
    ) %>%             # "then" filter the data frame, or choose the rows with count > 20, and dest != "HNL"
    filter(count > 20, dest != "HNL")

head(delays2)

dest	count	dist	delay
ABQ	254	1826.0000	4.381890
ACK	265	199.0000	4.852273
ALB	439	143.0000	14.397129
ATL	17215	757.1082	11.300113
AUS	2439	1514.2530	6.019909
AVL	275	583.5818	8.003831

When using summarize() to add a new statistical summary as a column to the data set, such as in the example below, if there are any missing values in the data set, the new statistical summary will show up as “NA.” For example, for the first six days of January, 2013, there are missing data with the departure delay time (dep_delay), which is why the mean column is “NA” for those first six days.

contains_missing = flights %>% 
    group_by(year, month, day) %>%
    summarize(mean = mean(dep_delay)) # no "na.rm = TRUE" argument
                                        
head(contains_missing) 

year	month	day	mean
2013	1	1	NA
2013	1	2	NA
2013	1	3	NA
2013	1	4	NA
2013	1	5	NA
2013	1	6	NA

# create new data frame with removed missing values

not_cancelled  <- flights %>% 
    filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled2 = not_cancelled %>% 
    group_by(year, month, day) %>%
    summarize(mean = mean(dep_delay)) # no "na.rm = TRUE" argument

head(not_cancelled2)

year	month	day	mean
2013	1	1	11.435620
2013	1	2	13.677802
2013	1	3	10.907778
2013	1	4	8.965859
2013	1	5	5.732218
2013	1	6	7.145959

delays <- not_cancelled %>%
    group_by(tailnum) %>%
    summarize(
        delay = mean(arr_delay, na.rm = TRUE),
        n = n()
    )

ggplot(data = delays, mapping=aes(x=n, y=delay)) + 
    geom_point(alpha = 1/10)

`summarize()` functions: `mean()`, `sd()`, `IQR()`, `max()`,`min()`

# average arrival delay time per day

two_arr_delays = not_cancelled %>%
    group_by(year, month, day) %>%
    summarize(
        # mean arrival delay time: 
        avg_delay1 = mean(arr_delay),
        # mean arrival delay time > 0:    
        avg_delay2 = mean(arr_delay[arr_delay > 0]),
    )

head(mean_arr_delay)

year	month	day	avg_delay1	avg_delay2
2013	1	1	12.651023	32.48156
2013	1	2	12.692888	32.02991
2013	1	3	5.733333	27.66087
2013	1	4	-1.932819	28.30976
2013	1	5	-1.525802	22.55882
2013	1	6	4.236429	24.37270

`sd()`: standard deviation

# destinations with the highest standard deviation in distance
sd_distance = not_cancelled %>% 
    group_by(dest) %>%
    summarize(distance_sd = sd(distance)) %>%
    arrange(desc(distance_sd))
    
head(sd_distance)

dest	distance_sd
EGE	10.542765
SAN	10.350094
SFO	10.216017
HNL	10.004197
SEA	9.977993
LAS	9.907786

`min()`, `max()`

# first and last flights on each day

min_max_dep_time = not_cancelled %>%
    group_by(year, month, day) %>%
    summarize(
        first = min(dep_time),
        last = max(dep_time)
    )

head(min_max_dep_time)

year	month	day	first	last
2013	1	1	517	2356
2013	1	2	42	2354
2013	1	3	32	2349
2013	1	4	25	2358
2013	1	5	14	2357
2013	1	6	16	2355

`n_distinct()`: number of unique values

n_unique_carriers = not_cancelled %>%
    group_by(dest) %>%
    summarize(carriers = n_distinct(carrier)) %>%
    arrange(desc(carriers))

head(n_unique_carriers)

dest	carriers
ATL	7
BOS	7
CLT	7
ORD	7
TPA	7
AUS	6

tailnum = not_cancelled %>%
    count(tailnum, wt=distance)

head(tailnum)

tailnum	n
D942DN	3418
N0EGMQ	239143
N10156	109664
N102UW	25722
N103US	24619
N104UW	24616

popular_dest <- flights %>%
    group_by(dest)
    filter(n() > 365)

head(popular_dest)

year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
2013	1	1	517	515	2	830	819	11	UA	1545	N14228	EWR	IAH	227	1400	5	15	2013-01-01 05:00:00
2013	1	1	533	529	4	850	830	20	UA	1714	N24211	LGA	IAH	227	1416	5	29	2013-01-01 05:00:00
2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1089	5	40	2013-01-01 05:00:00
2013	1	1	544	545	-1	1004	1022	-18	B6	725	N804JB	JFK	BQN	183	1576	5	45	2013-01-01 05:00:00
2013	1	1	554	600	-6	812	837	-25	DL	461	N668DN	LGA	ATL	116	762	6	0	2013-01-01 06:00:00
2013	1	1	554	558	-4	740	728	12	UA	1696	N39463	EWR	ORD	150	719	5	58	2013-01-01 05:00:00

Twitter Facebook LinkedIn

Junhyung Park

Data Transformation with R

Dplyr Functions: `filter()`, `arrange()`, `mutate()`, `select()`, `summarize()`

Dataset: `nycflights13`

`filter()`: Filter Rows

`arrange()`: Change the Order of Rows

`select()`: Select specific columns

`mutate()`: Add new columns

`summarize()`: Collapse/summarize data

`summarize()` functions: `mean()`, `sd()`, `IQR()`, `max()`,`min()`

`sd()`: standard deviation

`min()`, `max()`

`n_distinct()`: number of unique values

You May Also Enjoy

A Brief Look into Antidepressants

Thoughts on OCD (Obsessive-Compulsive Disorder)

Episode with Gastritis

Reflections Concerning Biomedical Research

Junhyung Park

Dplyr Functions: filter(), arrange(), mutate(), select(), summarize()

Dataset: nycflights13

filter(): Filter Rows

arrange(): Change the Order of Rows

select(): Select specific columns

mutate(): Add new columns

summarize(): Collapse/summarize data

summarize() functions: mean(), sd(), IQR(), max(),min()

sd(): standard deviation

min(), max()

n_distinct(): number of unique values

You May Also Enjoy

A Brief Look into Antidepressants

Thoughts on OCD (Obsessive-Compulsive Disorder)

Episode with Gastritis

Reflections Concerning Biomedical Research

Dplyr Functions: `filter()`, `arrange()`, `mutate()`, `select()`, `summarize()`

Dataset: `nycflights13`

`filter()`: Filter Rows

`arrange()`: Change the Order of Rows

`select()`: Select specific columns

`mutate()`: Add new columns

`summarize()`: Collapse/summarize data

`summarize()` functions: `mean()`, `sd()`, `IQR()`, `max()`,`min()`

`sd()`: standard deviation

`min()`, `max()`

`n_distinct()`: number of unique values