EDA with R

9 minute read

In this post, I will go over the what is perhaps the most important step in using R to analyze data–tidying. In fact, provided that many R functions work best on tidy data, I felt it necessary to learn how to correctly organize and format data. I hope that this post would be a useful reference guide for future data analysis projects using R. In addition, this post will briefly go over some exploratory data analysis examples in R for Data Science by Hadley Wickam and Garrett Grolemund.

library(tidyverse)
table1 # sample tibble
countryyearcasespopulation
Afghanistan1999 745 19987071
Afghanistan2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

The picture below, which was taken from this paper and is available for redistribution under a Creative Commons license, shows a direct comparison between a tidy and an untidy dataset.

Untidy Data (top) vs Tidy Data (bottom)

For the tidy dataset, the attributes of three observations–A, B, and C–are aligned horizontally. The columns are variables, and the rows are the observations. In general, packages in R such as dpylr work best with tidy data, which makes the process of tidying messy data crucial in any efficient analysis projects. The following are some of the many things that one could do with tidy datasets.

table1 %>%
    mutate(rate = cases / population * 1000) # add a new column called rate, with rate = cases / pop * 1000
countryyearcasespopulationrate
Afghanistan1999 745 19987071 0.0372741
Afghanistan2000 2666 20595360 0.1294466
Brazil 1999 37737 172006362 0.2193930
Brazil 2000 80488 174504898 0.4612363
China 1999 212258 1272915272 0.1667495
China 2000 213766 1280428583 0.1669488
table1 %>%
    count(year, wt = cases) # count(variable1, wt = variable2) "groups by sum" 
                            # group by year, sum by cases
yearn
1999 250740
2000 296920
library(ggplot2)
ggplot(table1, aes(year, cases)) + 
    geom_line(aes(group = country), color = 'grey50') +
    geom_point(aes(color = country))

The four components of tidying data are as follows:

  • Gathering
  • Spreading
  • Separating
  • Uniting

Move values into columns (gather)

To move column-name values into columns, use gather. The function “gathers” values that are column names and moves those values into columns. To illustrate, gather is used to change table4a into table1[1:2].

table4a # to be changed to table1[1:2]
country19992000
Afghanistan 745 2666
Brazil 37737 80488
China 212258 213766
table1[1:2]
countryyear
Afghanistan1999
Afghanistan2000
Brazil 1999
Brazil 2000
China 1999
China 2000
table4a %>%
    gather('1999', '2000', key = "year", value = "cases") # '1999' and '2000' are to be moved to columns, "cases" is added as new column 
countryyearcases
Afghanistan1999 745
Brazil 1999 37737
China 1999 212258
Afghanistan2000 2666
Brazil 2000 80488
China 2000 213766

Move column data into separate columns (spread)

table2 # to be converted to table1
             # Observe the "type" column.
             # Values in the "type" column (cases, population) are moved to separate categories.  
countryyeartypecount
Afghanistan1999 cases 745
Afghanistan1999 population 19987071
Afghanistan2000 cases 2666
Afghanistan2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583
table1
countryyearcasespopulation
Afghanistan1999 745 19987071
Afghanistan2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583
spread(table2, key = type, value = count) # observe two new columns: case, population 
countryyearcasespopulation
Afghanistan1999 745 19987071
Afghanistan2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

Separating columns (separate)

table3 # separate the data in the third column into two separate columns 
countryyearrate
Afghanistan 1999 745/19987071
Afghanistan 2000 2666/20595360
Brazil 1999 37737/172006362
Brazil 2000 80488/174504898
China 1999 212258/1272915272
China 2000 213766/1280428583
table3 %>%
    separate(rate, into = c("cases", "population"), sep = "/")
    # separates the data in the "rate" column into two separate columns: cases, population
countryyearcasespopulation
Afghanistan1999 745 19987071
Afghanistan2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583
newTable <- table3 %>%
    separate(rate, into = c("cases", "population"), sep = "/") %>%
    # separates the data in the "rate" column into two separate columns: cases, population
    separate(year, into = c("century", "year"), sep = 2) 
    # separates the year column into two columns, with two digits each

newTable
countrycenturyyearcasespopulation
Afghanistan19 99 745 19987071
Afghanistan20 00 2666 20595360
Brazil 19 99 37737 172006362
Brazil 20 00 80488 174504898
China 19 99 212258 1272915272
China 20 00 213766 1280428583

Combining columns (unite)

Uniting combines two columns into one, which makes it the opposite of separating.

newTable %>%
    unite(new, century, year) # two columns (century, year) combined with _ as the connector. 
countrynewcasespopulation
Afghanistan19_99 745 19987071
Afghanistan20_00 2666 20595360
Brazil 19_99 37737 172006362
Brazil 20_00 80488 174504898
China 19_99 212258 1272915272
China 20_00 213766 1280428583
newTable %>%
    unite(new, century, year, sep = "") # two columns (century, year) combined without any connector. 
countrynewcasespopulation
Afghanistan1999 745 19987071
Afghanistan2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

Exploratory Data Analysis

EDA, although not strictly defined, consists of three parts:

  • Visualization
  • Transformation
  • Modeling

In this post, a toy dataset called diamonds supplied by tidyverse will be used for EDA. The diamonds dataset contains information on approximately 54,000 diamonds, such as the quality of the cut and the price.

To use the dataset, first load tidyverse and diamonds.

library(tidyverse)
Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
√ ggplot2 3.1.1       √ purrr   0.3.2  
√ tibble  2.1.1       √ dplyr   0.8.0.1
√ tidyr   0.8.3       √ stringr 1.4.0  
√ readr   1.3.1       √ forcats 0.4.0  
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

Since the cut variable is categorical, use a bar chart (geom_bar) to visualize the diamonds’ distribution.

  • Variables:

    • Categorical:
      • cut
      • color
      • clarity
    • Continuous:
      • price
      • carat
      • length, width, depth (x,y,z)
ggplot(data = diamonds) +
    theme(plot.title = element_text(hjust = 0.5)) +
    ggtitle("Distribution based on Cut") + 
    geom_bar(mapping = aes(x = cut, fill = clarity)) # use `fill` to see cross-section for categorical variable
diamonds %>%
    count(cut) # count() shows the number of diamonds in each category
cutn
Fair 1610
Good 4906
Very Good12082
Premium 13791
Ideal 21551

Meanwhile, since carat is a continuous variable, use a histogram (geom_histogram) to view the distribution.

ggplot(data = diamonds) +
    theme(plot.title = element_text(hjust = 0.5)) +
    ggtitle("Distribution based on Carat") + 
    geom_histogram(mapping = aes(x = carat, fill = color), binwidth = 0.5)
diamonds %>%
    count(cut_width(carat, 0.5))
cut_width(carat, 0.5)n
[-0.25,0.25] 785
(0.25,0.75] 29498
(0.75,1.25] 15977
(1.25,1.75] 5313
(1.75,2.25] 2002
(2.25,2.75] 322
(2.75,3.25] 32
(3.25,3.75] 5
(3.75,4.25] 4
(4.25,4.75] 1
(4.75,5.25] 1

To view a smaller subset of the data satisfying a certain condition or several conditions, use filter().

smaller <- diamonds %>%
    filter(carat < 2 && color == "E")

ggplot(data = smaller, mapping = aes(x = carat, color = clarity)) +
    geom_histogram(binwidth = 0.1)

To view multiple distributions over a single variable, one must put multiple histograms over one another, which would obviously be hard to visualize. This is where geom_freqpoly comes in–by replacing histograms with lines, visualization is made much easier.

library(tidyverse)

ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
    theme(plot.title = element_text(hjust = 0.5)) +
    ggtitle("Distribution of Cut per Carat") + 
    geom_freqpoly(binwidth = 0.1)

Replacing Values

When preprocessing data, there will almost always be strange data–data whose values do not make any sense. When convinced enough that those pieces of data are invalid, two options remain: one, simply drop the rows containing the strange data, or two, replace those values with NA (not available). The book recommends that the second option, for the reason that in some cases, there may not be enough data to simply delete those rows.

tibbleW <- tribble(
    ~w, ~x, ~y, ~z, 
    "a", 2, 1, "q",
    "b", 3, 100, "w",
    "c", 4, 1, "x",
    "d", 5, 110, "y",
) 

head(tibbleW)
wxyz
a 2 1q
b 3 100w
c 4 1x
d 5 110y
tibbleW1 <- tibbleW %>%
    mutate(y = ifelse(y < 10, NA, y)) # look to the `y` column, replace any value in the y column less than 10 with "NA"

head(tibbleW1)
wxyz
a 2 NAq
b 3 100w
c 4 NAx
d 5 110y

Covariation

To visualize the relationship between two variables–one categorical and the other continuous–use geom_freqpoly()to graph the distribution of observations based on the continuous variable, for each categorical variable.

The following example shows the diamonds’ distribution of price (continuous), categorized by the cut quality (categorical).

 ggplot(data = diamonds) +
    theme(plot.title = element_text(hjust = 0.5)) +
    ggtitle("Price Distribution per Cut") + 
    geom_freqpoly(mapping = aes(x = price, color = cut), binwidth = 500)

In determining the relationship between price and quality, however, one must observe the fact that there are a different number of diamonds per cut category.

ggplot(diamonds) +
    geom_bar(mapping = aes(x = cut))

In the above geom_freqpoly graph, the y axis is set to be “count”, while the x axis is “price”. The most visibly striking aspect of the graph is the tall yellow peak at about 1,200 dillars for price. This yellow peak is taller than the other distributions’ peaks at about the same price–far taller, in fact. This gives the illusion that the percentage of diamonds that occupy that 1,200 peak in price is also far greater for the ideal category than it is for the good category.

This assumption is invalid though, and for two reasons. First, the y axis is “count,” not “percentage” or “density,” and second, there are a lot more diamonds in the ideal category than the good category. In short, one cannot compare the difference in distribution by comparing the difference in count.

Another problem with the graph is that one cannot easily compare the mean diamond price of each cut quality. The graph may deceive people into believing that the 1,200 dollar yellow peak implies a mean price closer to 1,200 dollars for ideal-cut diamonds than other mean prices. However, this is not necessarily true, given the difference in the area under each graph corresponding to the diamond count for each cut quality.

ggplot(
    data = diamonds,
        mapping = aes(x = price, y = ..density..) # notice the argument "y = ..density"
    ) +

    theme(plot.title = element_text(hjust = 0.5)) +
    ggtitle("Price Density Distribution per Cut") +

    geom_freqpoly(mapping = aes(color = cut), binwidth = 500)

The graph above, in which the area under each line is the same, suggests that fair-cut diamonds have an average price closer to 2,500 dollars, rather than 1,200 dollars. In fact, it appears as though fair-cut diamonds have the highest mean price among the diamonds of each cut quality.

s1 <- subset(diamonds, cut=='Fair')
# head(s1[7])
s1[7] %>%
   summarize(mean = mean(price)) # average price of fairly cut diamonds = $4358

idealD1 <- diamonds %>%
    filter(cut == "Good") %>%
    summarize(mean = mean(price))
idealD1 # average price of good cut diamonds = $3457 

idealD3 <- diamonds %>%
    filter(cut == "Premium") %>%
    summarize(mean = mean(price))
idealD3 # average price of premium cut diamonds = $3457 

idealD4 <- diamonds %>%
    filter(cut == "Ideal") %>%
    summarize(mean = mean(price))
idealD4 # average price of ideally cut diamonds = $3457 

mean
4358.758
mean
3928.864
mean
4584.258
mean
3457.542
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
    geom_boxplot()

To order the boxplots in increasing median values, add a x = reorder() argument to mapping = aes().

# Ordered by increasing median values 

ggplot(data = diamonds) +
    geom_boxplot(mapping = aes(
                    x = reorder(cut, price, FUN = median),  
                    y = price))

To get a more visually striking picture of the distribution of diamonds in terms of cut and color, use a heatmap-like geom object called “geom_tile.”

diamonds %>%
    count(color, cut) %>%
    ggplot(mapping = aes(x = color, y = cut)) + 
        geom_tile(mapping = aes(fill = n)) 

From the geom_tile diagram above, it appears as though ideal-cut diamonds with color G are the majority, followed by the diamonds in the adjacent cut categories, and finally the fair-cut diamonds.

Tags:

Categories:

Updated: