First Attempt at R

5 minute read

In this post, I will go over what I learned during my first attempt at R.

Motivation

R is one of the two main tools that data scientists and statisticians use in their work. Given its prominence in the industry, I had heard about the language countless numbers of times in books and videos alike, but over the past few days, I have finally had the chance to study it in detail. As usual, this post serves to document my progress with what I have studied, in addition to being a reference guide for my future self. (Also, just as a side note, this post was originally written in R Markdown, but later converted to .ipynb and .md files using Jupyter.) With that being said, let’s get to some R.

RStudio vs Jupyter

As with any newcomer to R, I first started with RStudio, studying basic grammar with the widely-known books R for Data Science and Hands-On Programming with R by Garrett Grolemund and Hadley Wickham, as well as Data Analysis with R by Hoon Park. While taking my first steps down the R journey, I first started working on this post using the syntax editor in RStudio–integrating some R code blocks and commentry in a R Markdown file. Then by using a clever trick provided by this guide, I converted (or “knitted”) the R Markdown file to a .md file, and later test-uploaded the file to my Github repository. This worked by adding output:md_document:variant: markdown_github to the .rmd file’s YAML front matter, the style of which I was not familiar with.

Although RStudio worked fine, after some trial and error, I decided to use Jupyter to finish the rest of this post–which was because I found RStudio to be somewhat slow with downloading packages, as well as my penchant for Jupyter’s simple, elegant interface. To give RStudio some credit though, I also decided to come back to it later–that is, if I need to knit .rmd files to HTML or Word, publish R files on the web using rpubs, or create applications using shinyapp.

Some Notes on R Grammar

Lists: Appending Lists

listFruit <- c('Apples', 'Apples', 'Bananas', 'Bananas', 'Pineapples', 'Pineapples', 'Oranges', 'Cucumbers')
print(unique(listFruit))
[1] "Apples"     "Bananas"    "Pineapples" "Oranges"    "Cucumbers" 

Lists: Reversing Logical Elements

logical_var <- c(FALSE, TRUE, FALSE, TRUE, TRUE)
logical_var # prints: FALSE, TRUE, FALSE, TRUE, TRUE
!logical_var # prints: T, F, T, F, F

<ol class=list-inline> <li>FALSE</li> <li>TRUE</li> <li>FALSE</li> <li>TRUE</li> <li>TRUE</li> </ol>

<ol class=list-inline> <li>TRUE</li> <li>FALSE</li> <li>TRUE</li> <li>FALSE</li> <li>FALSE</li> </ol>

Data Frame: Create & Access

Create a Data Frame

id <- c('43', '44', '45', '46')
name <- c('Bush', 'Obama', 'Trump', 'Biden')
age <- c(75,59,75,78)
mStatus <- c(T, T, T, T)

df <- data.frame(id,name,age,mStatus)
df # displays dataframe chart
str(df) # displays type / values
idnameagemStatus
43 Bush 75 TRUE
44 Obama59 TRUE
45 Trump75 TRUE
46 Biden78 TRUE
'data.frame':	4 obs. of  4 variables:
 $ id     : Factor w/ 4 levels "43","44","45",..: 1 2 3 4
 $ name   : Factor w/ 4 levels "Biden","Bush",..: 2 3 4 1
 $ age    : num  75 59 75 78
 $ mStatus: logi  TRUE TRUE TRUE TRUE

Access Data in Rows, Columns

df[2,3] # 2nd row, 3rd column, value = 56
df[c(2,3), c(2,4)] # print (row, column) values = (2,2), (2,4), (3,2), (3,4)

59

namemStatus
2ObamaTRUE
3TrumpTRUE
df$name # access column data using $
df$name[3:4] # column 'name', third, fourth rows

<ol class=list-inline> <li>Bush</li> <li>Obama</li> <li>Trump</li> <li>Biden</li> </ol>

<summary style=display:list-item;cursor:pointer> Levels: </summary> <ol class=list-inline>
  • 'Biden'
  • 'Bush'
  • 'Obama'
  • 'Trump'
  • </ol>

    <ol class=list-inline> <li>Trump</li> <li>Biden</li> </ol>

    <summary style=display:list-item;cursor:pointer> Levels: </summary> <ol class=list-inline>
  • 'Biden'
  • 'Bush'
  • 'Obama'
  • 'Trump'
  • </ol>

    Structure Display

    str(df) # display information, such as #objects, categories, etc
    
    'data.frame':	4 obs. of  4 variables:
     $ id     : Factor w/ 4 levels "43","44","45",..: 1 2 3 4
     $ name   : Factor w/ 4 levels "Biden","Bush",..: 2 3 4 1
     $ age    : num  75 59 75 78
     $ mStatus: logi  TRUE TRUE TRUE TRUE
    
    id <- c('39', '40', '41', '42')
    name <- c('Carter', 'Reagan', 'Bush', 'Clinton')
    age <- c(96,93,94,74)
    mStatus <- c(T, T, T, T)
    
    df1 <- data.frame(id,name,age,mStatus)
    df1
    
    idnameagemStatus
    39 Carter 96 TRUE
    40 Reagan 93 TRUE
    41 Bush 94 TRUE
    42 Clinton74 TRUE

    Combine Data Frames

    bothdf <- rbind(df1, df)
    bothdf
    
    idnameagemStatus
    39 Carter 96 TRUE
    40 Reagan 93 TRUE
    41 Bush 94 TRUE
    42 Clinton74 TRUE
    43 Bush 75 TRUE
    44 Obama 59 TRUE
    45 Trump 75 TRUE
    46 Biden 78 TRUE

    Retrieve: head, tail, min, max, median, quantile

    head(bothdf,3)
    tail(bothdf, 3)
    
    min(bothdf$age) # min
    max(bothdf$age) # max
    median(bothdf$age) # median
    quantile(bothdf$age) # quartile 
    
    df3 <- bothdf
    
    idnameagemStatus
    39 Carter96 TRUE
    40 Reagan93 TRUE
    41 Bush 94 TRUE
    idnameagemStatus
    644 Obama59 TRUE
    745 Trump75 TRUE
    846 Biden78 TRUE

    59

    96

    76.5

    <dl class=dl-horizontal> <dt>0%</dt> <dd>59</dd> <dt>25%</dt> <dd>74.75</dd> <dt>50%</dt> <dd>76.5</dd> <dt>75%</dt> <dd>93.25</dd> <dt>100%</dt> <dd>96</dd> </dl>

    Retrieve: Sections of the Data Frame

    subset(df3, age > 80) # only those with age above 80. 
    
    idnameagemStatus
    39 Carter96 TRUE
    40 Reagan93 TRUE
    41 Bush 94 TRUE

    Add: New Column

    Nationality <- c("American1", "American2", "American3", "American4")
    
    df3$new_column <- Nationality # adds a new column, "Nationality", with alternating values
    df3
    
    idnameagemStatusnew_column
    39 Carter 96 TRUE American1
    40 Reagan 93 TRUE American2
    41 Bush 94 TRUE American3
    42 Clinton 74 TRUE American4
    43 Bush 75 TRUE American1
    44 Obama 59 TRUE American2
    45 Trump 75 TRUE American3
    46 Biden 78 TRUE American4

    Delete: Columns

    new_df3 <- df3[ , -c(3,4)] # creates a copy of the df3 data frame, deletes columns "age", "mStatus"
    new_df3
    
    idnamenew_column
    39 Carter American1
    40 Reagan American2
    41 Bush American3
    42 Clinton American4
    43 Bush American1
    44 Obama American2
    45 Trump American3
    46 Biden American4
    df3[ , c(5)] <- list(NULL) # delete the column "Nationality" from the original data frame, df3
    head(df3)
    
    idnameagemStatus
    39 Carter 96 TRUE
    40 Reagan 93 TRUE
    41 Bush 94 TRUE
    42 Clinton74 TRUE
    43 Bush 75 TRUE
    44 Obama 59 TRUE

    Change: Column Names

    colnames(df3)
    

    <ol class=list-inline> <li>‘id’</li> <li>‘name’</li> <li>‘age’</li> <li>‘mStatus’</li> </ol>

    colnames(df3) <- c("C1", "C2", "C3", "C4")
    head(df3)
    
    colnames(df3) <- c("ID", "Name", "Age", "Marital Status")
    head(df3)
    
    C1C2C3C4
    39 Carter 96 TRUE
    40 Reagan 93 TRUE
    41 Bush 94 TRUE
    42 Clinton74 TRUE
    43 Bush 75 TRUE
    44 Obama 59 TRUE
    IDNameAgeMarital Status
    39 Carter 96 TRUE
    40 Reagan 93 TRUE
    41 Bush 94 TRUE
    42 Clinton74 TRUE
    43 Bush 75 TRUE
    44 Obama 59 TRUE

    The rest of this post dealing with visualization with R has been relocated to the next post, due to the length of this document.

    Tags:

    Categories:

    Updated: