First Attempt at R
In this post, I will go over what I learned during my first attempt at R.
Motivation
R is one of the two main tools that data scientists and statisticians use in their work. Given its prominence in the industry, I had heard about the language countless numbers of times in books and videos alike, but over the past few days, I have finally had the chance to study it in detail. As usual, this post serves to document my progress with what I have studied, in addition to being a reference guide for my future self. (Also, just as a side note, this post was originally written in R Markdown, but later converted to .ipynb
and .md
files using Jupyter.) With that being said, let’s get to some R.
RStudio vs Jupyter
As with any newcomer to R, I first started with RStudio, studying basic grammar with the widely-known books R for Data Science and Hands-On Programming with R by Garrett Grolemund and Hadley Wickham, as well as Data Analysis with R by Hoon Park. While taking my first steps down the R journey, I first started working on this post using the syntax editor in RStudio–integrating some R code blocks and commentry in a R Markdown file. Then by using a clever trick provided by this guide, I converted (or “knitted”) the R Markdown file to a .md
file, and later test-uploaded the file to my Github repository. This worked by adding output:md_document:variant: markdown_github
to the .rmd
file’s YAML front matter, the style of which I was not familiar with.
Although RStudio worked fine, after some trial and error, I decided to use Jupyter to finish the rest of this post–which was because I found RStudio to be somewhat slow with downloading packages, as well as my penchant for Jupyter’s simple, elegant interface. To give RStudio some credit though, I also decided to come back to it later–that is, if I need to knit .rmd
files to HTML or Word, publish R files on the web using rpubs, or create applications using shinyapp.
Some Notes on R Grammar
Lists: Appending Lists
listFruit <- c('Apples', 'Apples', 'Bananas', 'Bananas', 'Pineapples', 'Pineapples', 'Oranges', 'Cucumbers')
print(unique(listFruit))
[1] "Apples" "Bananas" "Pineapples" "Oranges" "Cucumbers"
Lists: Reversing Logical Elements
logical_var <- c(FALSE, TRUE, FALSE, TRUE, TRUE)
logical_var # prints: FALSE, TRUE, FALSE, TRUE, TRUE
!logical_var # prints: T, F, T, F, F
<ol class=list-inline> <li>FALSE</li> <li>TRUE</li> <li>FALSE</li> <li>TRUE</li> <li>TRUE</li> </ol>
<ol class=list-inline> <li>TRUE</li> <li>FALSE</li> <li>TRUE</li> <li>FALSE</li> <li>FALSE</li> </ol>
Data Frame: Create & Access
Create a Data Frame
id <- c('43', '44', '45', '46')
name <- c('Bush', 'Obama', 'Trump', 'Biden')
age <- c(75,59,75,78)
mStatus <- c(T, T, T, T)
df <- data.frame(id,name,age,mStatus)
df # displays dataframe chart
str(df) # displays type / values
id | name | age | mStatus |
---|---|---|---|
43 | Bush | 75 | TRUE |
44 | Obama | 59 | TRUE |
45 | Trump | 75 | TRUE |
46 | Biden | 78 | TRUE |
'data.frame': 4 obs. of 4 variables:
$ id : Factor w/ 4 levels "43","44","45",..: 1 2 3 4
$ name : Factor w/ 4 levels "Biden","Bush",..: 2 3 4 1
$ age : num 75 59 75 78
$ mStatus: logi TRUE TRUE TRUE TRUE
Access Data in Rows, Columns
df[2,3] # 2nd row, 3rd column, value = 56
df[c(2,3), c(2,4)] # print (row, column) values = (2,2), (2,4), (3,2), (3,4)
59
name | mStatus | |
---|---|---|
2 | Obama | TRUE |
3 | Trump | TRUE |
df$name # access column data using $
df$name[3:4] # column 'name', third, fourth rows
<ol class=list-inline> <li>Bush</li> <li>Obama</li> <li>Trump</li> <li>Biden</li> </ol>
<ol class=list-inline> <li>Trump</li> <li>Biden</li> </ol>
Structure Display
str(df) # display information, such as #objects, categories, etc
'data.frame': 4 obs. of 4 variables:
$ id : Factor w/ 4 levels "43","44","45",..: 1 2 3 4
$ name : Factor w/ 4 levels "Biden","Bush",..: 2 3 4 1
$ age : num 75 59 75 78
$ mStatus: logi TRUE TRUE TRUE TRUE
id <- c('39', '40', '41', '42')
name <- c('Carter', 'Reagan', 'Bush', 'Clinton')
age <- c(96,93,94,74)
mStatus <- c(T, T, T, T)
df1 <- data.frame(id,name,age,mStatus)
df1
id | name | age | mStatus |
---|---|---|---|
39 | Carter | 96 | TRUE |
40 | Reagan | 93 | TRUE |
41 | Bush | 94 | TRUE |
42 | Clinton | 74 | TRUE |
Combine Data Frames
bothdf <- rbind(df1, df)
bothdf
id | name | age | mStatus |
---|---|---|---|
39 | Carter | 96 | TRUE |
40 | Reagan | 93 | TRUE |
41 | Bush | 94 | TRUE |
42 | Clinton | 74 | TRUE |
43 | Bush | 75 | TRUE |
44 | Obama | 59 | TRUE |
45 | Trump | 75 | TRUE |
46 | Biden | 78 | TRUE |
Retrieve: head
, tail
, min
, max
, median
, quantile
head(bothdf,3)
tail(bothdf, 3)
min(bothdf$age) # min
max(bothdf$age) # max
median(bothdf$age) # median
quantile(bothdf$age) # quartile
df3 <- bothdf
id | name | age | mStatus |
---|---|---|---|
39 | Carter | 96 | TRUE |
40 | Reagan | 93 | TRUE |
41 | Bush | 94 | TRUE |
id | name | age | mStatus | |
---|---|---|---|---|
6 | 44 | Obama | 59 | TRUE |
7 | 45 | Trump | 75 | TRUE |
8 | 46 | Biden | 78 | TRUE |
59
96
76.5
<dl class=dl-horizontal> <dt>0%</dt> <dd>59</dd> <dt>25%</dt> <dd>74.75</dd> <dt>50%</dt> <dd>76.5</dd> <dt>75%</dt> <dd>93.25</dd> <dt>100%</dt> <dd>96</dd> </dl>
Retrieve: Sections of the Data Frame
subset(df3, age > 80) # only those with age above 80.
id | name | age | mStatus |
---|---|---|---|
39 | Carter | 96 | TRUE |
40 | Reagan | 93 | TRUE |
41 | Bush | 94 | TRUE |
Add: New Column
Nationality <- c("American1", "American2", "American3", "American4")
df3$new_column <- Nationality # adds a new column, "Nationality", with alternating values
df3
id | name | age | mStatus | new_column |
---|---|---|---|---|
39 | Carter | 96 | TRUE | American1 |
40 | Reagan | 93 | TRUE | American2 |
41 | Bush | 94 | TRUE | American3 |
42 | Clinton | 74 | TRUE | American4 |
43 | Bush | 75 | TRUE | American1 |
44 | Obama | 59 | TRUE | American2 |
45 | Trump | 75 | TRUE | American3 |
46 | Biden | 78 | TRUE | American4 |
Delete: Columns
new_df3 <- df3[ , -c(3,4)] # creates a copy of the df3 data frame, deletes columns "age", "mStatus"
new_df3
id | name | new_column |
---|---|---|
39 | Carter | American1 |
40 | Reagan | American2 |
41 | Bush | American3 |
42 | Clinton | American4 |
43 | Bush | American1 |
44 | Obama | American2 |
45 | Trump | American3 |
46 | Biden | American4 |
df3[ , c(5)] <- list(NULL) # delete the column "Nationality" from the original data frame, df3
head(df3)
id | name | age | mStatus |
---|---|---|---|
39 | Carter | 96 | TRUE |
40 | Reagan | 93 | TRUE |
41 | Bush | 94 | TRUE |
42 | Clinton | 74 | TRUE |
43 | Bush | 75 | TRUE |
44 | Obama | 59 | TRUE |
Change: Column Names
colnames(df3)
<ol class=list-inline> <li>‘id’</li> <li>‘name’</li> <li>‘age’</li> <li>‘mStatus’</li> </ol>
colnames(df3) <- c("C1", "C2", "C3", "C4")
head(df3)
colnames(df3) <- c("ID", "Name", "Age", "Marital Status")
head(df3)
C1 | C2 | C3 | C4 |
---|---|---|---|
39 | Carter | 96 | TRUE |
40 | Reagan | 93 | TRUE |
41 | Bush | 94 | TRUE |
42 | Clinton | 74 | TRUE |
43 | Bush | 75 | TRUE |
44 | Obama | 59 | TRUE |
ID | Name | Age | Marital Status |
---|---|---|---|
39 | Carter | 96 | TRUE |
40 | Reagan | 93 | TRUE |
41 | Bush | 94 | TRUE |
42 | Clinton | 74 | TRUE |
43 | Bush | 75 | TRUE |
44 | Obama | 59 | TRUE |
The rest of this post dealing with visualization with R has been relocated to the next post, due to the length of this document.