This post is a continuation of my recent blog posts on learning R. This post, along with the others, references Hadley Wickham and Garret Grolemund’s R for Data Science, which authorizes small scale, non-profit documentations of the example code from the book.
head(filter(flights,arr_delay>120))# flight arrivals delayed by more than two hours(120min)
year
month
day
dep_time
sched_dep_time
dep_delay
arr_time
sched_arr_time
arr_delay
carrier
flight
tailnum
origin
dest
air_time
distance
hour
minute
time_hour
2013
1
1
811
630
101
1047
830
137
MQ
4576
N531MQ
LGA
CLT
118
544
6
30
2013-01-01 06:00:00
2013
1
1
848
1835
853
1001
1950
851
MQ
3944
N942MQ
JFK
BWI
41
184
18
35
2013-01-01 18:00:00
2013
1
1
957
733
144
1056
853
123
UA
856
N534UA
EWR
BOS
37
200
7
33
2013-01-01 07:00:00
2013
1
1
1114
900
134
1447
1222
145
UA
1086
N76502
LGA
IAH
248
1416
9
0
2013-01-01 09:00:00
2013
1
1
1505
1310
115
1638
1431
127
EV
4497
N17984
EWR
RIC
63
277
13
10
2013-01-01 13:00:00
2013
1
1
1525
1340
105
1831
1626
125
B6
525
N231JB
EWR
MCO
152
937
13
40
2013-01-01 13:00:00
head(filter(flights,month==12,day==25))
year
month
day
dep_time
sched_dep_time
dep_delay
arr_time
sched_arr_time
arr_delay
carrier
flight
tailnum
origin
dest
air_time
distance
hour
minute
time_hour
2013
12
25
456
500
-4
649
651
-2
US
1895
N156UW
EWR
CLT
98
529
5
0
2013-12-25 05:00:00
2013
12
25
524
515
9
805
814
-9
UA
1016
N32404
EWR
IAH
203
1400
5
15
2013-12-25 05:00:00
2013
12
25
542
540
2
832
850
-18
AA
2243
N5EBAA
JFK
MIA
146
1089
5
40
2013-12-25 05:00:00
2013
12
25
546
550
-4
1022
1027
-5
B6
939
N665JB
JFK
BQN
191
1576
5
50
2013-12-25 05:00:00
2013
12
25
556
600
-4
730
745
-15
AA
301
N3JLAA
LGA
ORD
123
733
6
0
2013-12-25 06:00:00
2013
12
25
557
600
-3
743
752
-9
DL
731
N369NB
LGA
DTW
88
502
6
0
2013-12-25 06:00:00
arrange(): Change the Order of Rows
df<-tibble(x=c(5,4,3,2,1,NA))arrange(df,x)# arrange by ascending orderarrange(df,desc(x))
x
1
2
3
4
5
NA
x
5
4
3
2
1
NA
head(arrange(flights,year,month,day))# arranges the rows by ASCENDING ORDER with respect to the year, month, and day
year
month
day
dep_time
sched_dep_time
dep_delay
arr_time
sched_arr_time
arr_delay
carrier
flight
tailnum
origin
dest
air_time
distance
hour
minute
time_hour
2013
1
1
517
515
2
830
819
11
UA
1545
N14228
EWR
IAH
227
1400
5
15
2013-01-01 05:00:00
2013
1
1
533
529
4
850
830
20
UA
1714
N24211
LGA
IAH
227
1416
5
29
2013-01-01 05:00:00
2013
1
1
542
540
2
923
850
33
AA
1141
N619AA
JFK
MIA
160
1089
5
40
2013-01-01 05:00:00
2013
1
1
544
545
-1
1004
1022
-18
B6
725
N804JB
JFK
BQN
183
1576
5
45
2013-01-01 05:00:00
2013
1
1
554
600
-6
812
837
-25
DL
461
N668DN
LGA
ATL
116
762
6
0
2013-01-01 06:00:00
2013
1
1
554
558
-4
740
728
12
UA
1696
N39463
EWR
ORD
150
719
5
58
2013-01-01 05:00:00
head(arrange(flights,desc(year,month,day)))
year
month
day
dep_time
sched_dep_time
dep_delay
arr_time
sched_arr_time
arr_delay
carrier
flight
tailnum
origin
dest
air_time
distance
hour
minute
time_hour
2013
1
1
517
515
2
830
819
11
UA
1545
N14228
EWR
IAH
227
1400
5
15
2013-01-01 05:00:00
2013
1
1
533
529
4
850
830
20
UA
1714
N24211
LGA
IAH
227
1416
5
29
2013-01-01 05:00:00
2013
1
1
542
540
2
923
850
33
AA
1141
N619AA
JFK
MIA
160
1089
5
40
2013-01-01 05:00:00
2013
1
1
544
545
-1
1004
1022
-18
B6
725
N804JB
JFK
BQN
183
1576
5
45
2013-01-01 05:00:00
2013
1
1
554
600
-6
812
837
-25
DL
461
N668DN
LGA
ATL
116
762
6
0
2013-01-01 06:00:00
2013
1
1
554
558
-4
740
728
12
UA
1696
N39463
EWR
ORD
150
719
5
58
2013-01-01 05:00:00
select(): Select specific columns
head(select(flights,arr_time,arr_delay,flight,tailnum))# select specific columns: arr_time, arr_delay, flight, tailnum
arr_time
arr_delay
flight
tailnum
830
11
1545
N14228
850
20
1714
N24211
923
33
1141
N619AA
1004
-18
725
N804JB
812
-25
461
N668DN
740
12
1696
N39463
head(select(flights,year:day))# To view the end of the data, run: `tail(select(flights, year:day))`
year
month
day
2013
1
1
2013
1
1
2013
1
1
2013
1
1
2013
1
1
2013
1
1
head(select(flights,-(dep_delay:time_hour)))
year
month
day
dep_time
sched_dep_time
2013
1
1
517
515
2013
1
1
533
529
2013
1
1
542
540
2013
1
1
544
545
2013
1
1
554
600
2013
1
1
554
558
mutate(): Add new columns
new_tibble<-select(flights,arr_time,sched_arr_time)# create new tibble consisting of two columns: arr_time, sched_arr_timehead(mutate(new_tibble,arrival_delay=arr_time-sched_arr_time))# mutate the new tibble by adding a new column: arr_time - sched_Arr_time
arr_time
sched_arr_time
arrival_delay
830
819
11
850
830
20
923
850
73
1004
1022
-18
812
837
-25
740
728
12
summarize(): Collapse/summarize data
The below code “summarizes” the flights data frame by providing the average delay of all the planes departing New York City in 2013.
The function group_by() allows one to summarize data by groups. In the above code, the data frame is grouped by year, month, and day, which ensures that when the summarize() function is applied with dep_delay as a parameter, the resulting summary is that of average delay per day.
by_day<-group_by(flights,year,month,day)# group the flights data set by year, month, and dayhead(summarize(by_day,delay=mean(dep_delay,na.rm=TRUE)))# create new column called "delay" as a new statistical summary # "delay" is the mean of the "dep_delay" column grouped by day # ra.rm refers to removing missing values ("not available") when calculating the mean
year
month
day
delay
2013
1
1
11.548926
2013
1
2
13.858824
2013
1
3
10.987832
2013
1
4
8.951595
2013
1
5
5.732218
2013
1
6
7.148014
To look at the relationship between distance and delay, we first group the flights by destination. This involves a three-step procedure: group, summarize, and filter.
Group
Summarize
Filter
by_dest<-group_by(flights,dest)# group by "dest" column of the flights data set. "group_by" always before "summarize()"delay<-summarize(by_dest,count=n(),dist=mean(distance,na.rm=TRUE),delay=mean(arr_delay,na.rm=TRUE))# by_dest = grouped data set, # "count" is the variable (column) assigned to number of observations in the summary data# "count" is the number of flights assigned to a certain destination# "dist" is the mean distance of the flights grouped by destination, with n/a data removed.# "delay" is the mean delay time of flights grouped by destinationhead(delay)
dest
count
dist
delay
ABQ
254
1826.0000
4.381890
ACK
265
199.0000
4.852273
ALB
439
143.0000
14.397129
ANC
8
3370.0000
-2.500000
ATL
17215
757.1082
11.300113
AUS
2439
1514.2530
6.019909
delay<-filter(delay,count>20,dest!="HNL")# filter() finds the data with "count" exceeding 20, and removes dest = "HNL" head(delay)# "ANC," along with others such as "HNL," have been removed.
dest
count
dist
delay
ABQ
254
1826.0000
4.381890
ACK
265
199.0000
4.852273
ALB
439
143.0000
14.397129
ATL
17215
757.1082
11.300113
AUS
2439
1514.2530
6.019909
AVL
275
583.5818
8.003831
ggplot(data=delay,mapping=aes(x=dist,y=delay))+# graph of delaygeom_point(aes(size=count),alpha=1/3)+# "geom_point" is a point geometric object.geom_smooth(se=FALSE)# "geom_smooth" is a geometric object that is displayed as a smooth conditional regression line.
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
The graph above shows that flights with air distance greater than approximately 1,000 kilometers tend to have less delay time, compared to those with less distance. Also, delay time appears to increase as the distance approaches 600 kilometers, although the opposite appears to be true as the distance climbs toward 2,000 kilometers.
Although this was a relatively simple visualization task, the steps above involved individually naming objects in each step, which may get very confusing as the complexity of the task increases. To solve this inefficiency problem, as well as to vastly improve readability, there is a technique called “piping”–which involves a “pipe” operator “%>%.” Pronounced “then,” the pipe operator simplifies multistep procedures, as shown below.
delays2<-flights%>%# "then" group by destgroup_by(dest)%>%# "then" summarize the data by adding statistical summaries--"count," "dist," and "delay"--as columns. summarize(count=n(),dist=mean(distance,na.rm=TRUE),delay=mean(arr_delay,na.rm=TRUE),)%>%# "then" filter the data frame, or choose the rows with count > 20, and dest != "HNL"filter(count>20,dest!="HNL")head(delays2)
dest
count
dist
delay
ABQ
254
1826.0000
4.381890
ACK
265
199.0000
4.852273
ALB
439
143.0000
14.397129
ATL
17215
757.1082
11.300113
AUS
2439
1514.2530
6.019909
AVL
275
583.5818
8.003831
When using summarize() to add a new statistical summary as a column to the data set, such as in the example below, if there are any missing values in the data set, the new statistical summary will show up as “NA.” For example, for the first six days of January, 2013, there are missing data with the departure delay time (dep_delay), which is why the mean column is “NA” for those first six days.
contains_missing=flights%>%group_by(year,month,day)%>%summarize(mean=mean(dep_delay))# no "na.rm = TRUE" argumenthead(contains_missing)
year
month
day
mean
2013
1
1
NA
2013
1
2
NA
2013
1
3
NA
2013
1
4
NA
2013
1
5
NA
2013
1
6
NA
# create new data frame with removed missing valuesnot_cancelled<-flights%>%filter(!is.na(dep_delay),!is.na(arr_delay))not_cancelled2=not_cancelled%>%group_by(year,month,day)%>%summarize(mean=mean(dep_delay))# no "na.rm = TRUE" argumenthead(not_cancelled2)
# average arrival delay time per daytwo_arr_delays=not_cancelled%>%group_by(year,month,day)%>%summarize(# mean arrival delay time: avg_delay1=mean(arr_delay),# mean arrival delay time > 0: avg_delay2=mean(arr_delay[arr_delay>0]),)head(mean_arr_delay)
year
month
day
avg_delay1
avg_delay2
2013
1
1
12.651023
32.48156
2013
1
2
12.692888
32.02991
2013
1
3
5.733333
27.66087
2013
1
4
-1.932819
28.30976
2013
1
5
-1.525802
22.55882
2013
1
6
4.236429
24.37270
sd(): standard deviation
# destinations with the highest standard deviation in distancesd_distance=not_cancelled%>%group_by(dest)%>%summarize(distance_sd=sd(distance))%>%arrange(desc(distance_sd))head(sd_distance)
dest
distance_sd
EGE
10.542765
SAN
10.350094
SFO
10.216017
HNL
10.004197
SEA
9.977993
LAS
9.907786
min(), max()
# first and last flights on each daymin_max_dep_time=not_cancelled%>%group_by(year,month,day)%>%summarize(first=min(dep_time),last=max(dep_time))head(min_max_dep_time)
In this post, I cover the basics of graph theory, including graph representations and simple graph search methods. In writing this post, I referenced my own ...