Visualization with R

5 minute read

In this post, I will continue on from the previous post on learning R, but will focus primarily on data visualization. As stated before, the contents of this blog post heavily rely on Hadley Wickham and Garrett Grolemund’s book, R for Data Science.

Basic Graphs

Boxplot

boxN <- c(10, 25, 30, 45, 55, 68, 80, 90, 100)

quantile(boxN) # shows min, 1st, 2nd, 3rd Quartiles, max
boxplot(boxN)

seq1 <- c(25, 77, 52, 89, 90, 55, 34, 9, 72, 86, 56, 24, 59, 23, 14, 21, 12, 13, 73)
seq2 <- c(45,  64,  20,  43,  44, 85,  51,  62,  98,  74,  88,  96,  94,  36,  65,  97,  82,  50,  30,  99, 37)

boxplot(seq1, seq2, names=c("Seq A", "Seq B"))

Histogram

hist(seq2, main="List of Random Numbers, 1 to 100", xlab="Intervals", ylab="Frequency")

message("Variance: ", var(seq2))
 
message("Standard Deviation: ", sd(seq2))
Variance: 649.990476190476
Standard Deviation25.4949107900082

Pie Chart

Grade <- c('A', 'B', 'C', 'D', 'B', 'A', 'D', 'F', 'A', 'B', 'C', 'D', 'B', 'A', 'D', 'F', 'A','B', 'C', 'D', 'B', 'A', 'D', 'F', 'A')
length(Grade)

tableG <- table(Grade)
pie(x=table(Grade))

25

pie(x=table(Grade), col=c("cyan", "lightcyan", "blue", "skyblue", "cyan"), main="Pie Chart of Grades")
pie(x=table(Grade), col=heat.colors(5))

Barplot

barplot(table(Grade)
       , names.arg=c("A", "B", "C", "D", "F")
       , main = "Distribution of Grades"
       , xlab = "Grade"
       , ylab = "Number of People"
       , col = heat.colors(5))

R4DS

Installing R Packages: tidyverse

To install the tidyverse package ecosystem, simply type: install.packages("tidyverse"). To reload the tidyverse ecosystem into the notebook, add the following codeblock: library(tidyverse).

Graphing with ggplot()

head(mpg,7) # mpg is the dataset containing info on thirty-eight cars, provided by U.S.E.P.A. 
manufacturermodeldisplyearcyltransdrvctyhwyflclass
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5)f 21 29 p compact
audi a4 2.0 2008 4 manual(m6)f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5)f 18 26 p compact
audi a4 3.1 2008 6 auto(av) f 18 27 p compact

The general form of the ggplot function is as follows: ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>)). The geom_function specifies the “geom” of the plot, or the type of geometrical object the plot will use. Here are a few examples of the geom_functions:

Plot with…

  • Points: geom_point()
  • Lines: geom_smooth()
  • Bars: geom_bar()
  • Histograms: geom_histogram()
  • Dotplot: geom_dotplot()

Scatterplot

ggplot(data = mpg) + # ggplot(data = mpg) creates an empty graph 
  geom_point(mapping = aes(x = displ, y = hwy)) # function geom_point() takes the argument "mapping": displ is mapped to x-axis, hwy is mapped to y. 

Mapping a VARIABLE (class) to an AESTHETIC (color parameter) shows the cars’ classes.

Scatterplot With Different Colors

ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, color = class)) 

# "displ" : Displacement of the cars' engines
# "hwy" : Fuel consumption efficiency

Scatterplot With Different Shapes

# mapping a VARIABLE (class) to another AESTHETIC (shape) shows the cars' classes. 
# mapping seven categorical variables (car types) to shape produces WARNING, 
# since having more than six variables can make discrimination between points difficult.   

ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, shape = class)) 

Warning message:
"The shape palette can deal with a maximum of 6 discrete values because
more than 6 becomes difficult to discriminate; you have 7. Consider
specifying shapes manually if you must have them."Warning message:
"Removed 62 rows containing missing values (geom_point)."

Scatterplot With Uniform Color

# Manually adjust the color aesthetic by specifying the argument outside of aes()
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy), color = "red") 

Subplots for Scatterplot

# splitting the plot above into seven subplots based on the cars' class. 
ggplot(data=mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_wrap(~ class, nrow = 2)

Regression Lines

ggplot(data = mpg) + 
    geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
    # the gray area are the confidence bands around each nonlinear regression line. 
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

Regression Lines + Scatterplot

library(tidyverse) # load the tidyverse collection of packages

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + # Use global mapping 
    geom_point(mapping = aes(color = class)) + # add another mapping argument using color = class, differentiates class by color 
    geom_smooth( # smooth line
        data = filter(mpg, class == "subcompact"), # "filter," or change/override the global mapping up top, 
                                                   # smooth regression line over "subcompact" class dots.
        se = TRUE                                  # "se" specifies condition for confidence bands. se = FALSE removes the confidence bands
    )
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

Barplot

library(tidyverse)

ggplot(data = diamonds) +
    geom_bar(mapping = aes(x = cut))
Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.1.1       v purrr   0.3.2  
v tibble  2.1.1       v dplyr   0.8.0.1
v tidyr   0.8.3       v stringr 1.4.0  
v readr   1.3.1       v forcats 0.4.0  
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

The function ggplot(data = diamonds) loads the diamond dataset into ggplot(), which consists of statistical information of about 54,000 diamonds. The function geom_bar() specifies that the chart will be a bar chart, with the x-axis assigned to “cut”–one of the many statistical parameters of the diamond dataset.

ggplot(data = diamonds) +
    geom_bar(mapping = aes(x = cut, fill = cut))

The following chart separates each bar into distinct sections, depending on the clarity–another category of the diamonds dataset.

ggplot(data = diamonds) +
    geom_bar(mapping = aes(x = cut, fill = clarity))

To more easily compare the proportions of diamonds within each bar, one needs to set the bars to the same height by assigning proportion to the y-axis. This is done by specifying the position parameter to fill, outside the mapping = aes() argument.

ggplot(data = diamonds) +
    geom_bar(
        mapping = aes(x = cut, fill = clarity),
        position = "fill"
    )

Notice that the bars are at the same height, and that the y-axis stands for proportion. If comparing proportions within a single bar is visually difficult, one can separate the bar into many bars with respect to the clarity.

ggplot(data = diamonds) + 
    geom_bar(
        mapping = aes(x=cut, fill=clarity),
        position = "dodge"
    )

Maps: United States, U.K, East Asia

United States

nz <- map_data("state")

ggplot(nz, aes(long, lat, group = group)) +
    geom_polygon(fill = "white", color = "black") +
    coord_quickmap()

United Kingdom

uk <- map_data("world", region = c("UK"))

ggplot(uk, aes(long, lat, group = group)) +
    geom_polygon(fill = "white", color = "black") +
    coord_quickmap()

East Asia

korea <- map_data("world", region = c("North Korea", "Japan", "South Korea"))

ggplot(korea, aes(long, lat, group = group)) +
    geom_polygon(fill = "white", color = "black") +
    coord_quickmap()

Coxcomb Chart

bar <- ggplot(data = diamonds) +
  geom_bar(
    mapping = aes(x = cut, fill = cut),
    show.legend = FALSE,
    width = 1
  ) +
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

bar + coord_flip()
bar + coord_polar()

Tags:

Categories:

Updated: