EDA with R

9 minute read

In this post, I will go over the what is perhaps the most important step in using R to analyze data–tidying. In fact, provided that many R functions work best on tidy data, I felt it necessary to learn how to correctly organize and format data. I hope that this post would be a useful reference guide for future data analysis projects using R. In addition, this post will briefly go over some exploratory data analysis examples in R for Data Science by Hadley Wickam and Garrett Grolemund.

library(tidyverse)
table1 # sample tibble

country	year	cases	population
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

The picture below, which was taken from this paper and is available for redistribution under a Creative Commons license, shows a direct comparison between a tidy and an untidy dataset.

For the tidy dataset, the attributes of three observations–A, B, and C–are aligned horizontally. The columns are variables, and the rows are the observations. In general, packages in R such as dpylr work best with tidy data, which makes the process of tidying messy data crucial in any efficient analysis projects. The following are some of the many things that one could do with tidy datasets.

table1 %>%
    mutate(rate = cases / population * 1000) # add a new column called rate, with rate = cases / pop * 1000

country	year	cases	population	rate
Afghanistan	1999	745	19987071	0.0372741
Afghanistan	2000	2666	20595360	0.1294466
Brazil	1999	37737	172006362	0.2193930
Brazil	2000	80488	174504898	0.4612363
China	1999	212258	1272915272	0.1667495
China	2000	213766	1280428583	0.1669488

table1 %>%
    count(year, wt = cases) # count(variable1, wt = variable2) "groups by sum" 
                            # group by year, sum by cases

year	n
1999	250740
2000	296920

library(ggplot2)
ggplot(table1, aes(year, cases)) + 
    geom_line(aes(group = country), color = 'grey50') +
    geom_point(aes(color = country))

The four components of tidying data are as follows:

Gathering
Spreading
Separating
Uniting

Move values into columns (`gather`)

To move column-name values into columns, use gather. The function “gathers” values that are column names and moves those values into columns. To illustrate, gather is used to change table4a into table1[1:2].

table4a # to be changed to table1[1:2]

country	1999	2000
Afghanistan	745	2666
Brazil	37737	80488
China	212258	213766

table1[1:2]

country	year
Afghanistan	1999
Afghanistan	2000
Brazil	1999
Brazil	2000
China	1999
China	2000

table4a %>%
    gather('1999', '2000', key = "year", value = "cases") # '1999' and '2000' are to be moved to columns, "cases" is added as new column 

country	year	cases
Afghanistan	1999	745
Brazil	1999	37737
China	1999	212258
Afghanistan	2000	2666
Brazil	2000	80488
China	2000	213766

Move column data into separate columns (`spread`)

table2 # to be converted to table1
             # Observe the "type" column.
             # Values in the "type" column (cases, population) are moved to separate categories.  

country	year	type	count
Afghanistan	1999	cases	745
Afghanistan	1999	population	19987071
Afghanistan	2000	cases	2666
Afghanistan	2000	population	20595360
Brazil	1999	cases	37737
Brazil	1999	population	172006362
Brazil	2000	cases	80488
Brazil	2000	population	174504898
China	1999	cases	212258
China	1999	population	1272915272
China	2000	cases	213766
China	2000	population	1280428583

table1

country	year	cases	population
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

spread(table2, key = type, value = count) # observe two new columns: case, population 

country	year	cases	population
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

Separating columns (`separate`)

table3 # separate the data in the third column into two separate columns 

country	year	rate
Afghanistan	1999	745/19987071
Afghanistan	2000	2666/20595360
Brazil	1999	37737/172006362
Brazil	2000	80488/174504898
China	1999	212258/1272915272
China	2000	213766/1280428583

table3 %>%
    separate(rate, into = c("cases", "population"), sep = "/")
    # separates the data in the "rate" column into two separate columns: cases, population

country	year	cases	population
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

newTable <- table3 %>%
    separate(rate, into = c("cases", "population"), sep = "/") %>%
    # separates the data in the "rate" column into two separate columns: cases, population
    separate(year, into = c("century", "year"), sep = 2) 
    # separates the year column into two columns, with two digits each

newTable

country	century	year	cases	population
Afghanistan	19	99	745	19987071
Afghanistan	20	00	2666	20595360
Brazil	19	99	37737	172006362
Brazil	20	00	80488	174504898
China	19	99	212258	1272915272
China	20	00	213766	1280428583

Combining columns (`unite`)

Uniting combines two columns into one, which makes it the opposite of separating.

newTable %>%
    unite(new, century, year) # two columns (century, year) combined with _ as the connector. 

country	new	cases	population
Afghanistan	19_99	745	19987071
Afghanistan	20_00	2666	20595360
Brazil	19_99	37737	172006362
Brazil	20_00	80488	174504898
China	19_99	212258	1272915272
China	20_00	213766	1280428583

newTable %>%
    unite(new, century, year, sep = "") # two columns (century, year) combined without any connector. 

country	new	cases	population
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

Exploratory Data Analysis

EDA, although not strictly defined, consists of three parts:

Visualization
Transformation
Modeling

In this post, a toy dataset called diamonds supplied by tidyverse will be used for EDA. The diamonds dataset contains information on approximately 54,000 diamonds, such as the quality of the cut and the price.

To use the dataset, first load tidyverse and diamonds.

library(tidyverse)

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
√ ggplot2 3.1.1       √ purrr   0.3.2  
√ tibble  2.1.1       √ dplyr   0.8.0.1
√ tidyr   0.8.3       √ stringr 1.4.0  
√ readr   1.3.1       √ forcats 0.4.0  
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

Since the cut variable is categorical, use a bar chart (geom_bar) to visualize the diamonds’ distribution.

Variables:
- Categorical:
  - cut
  - color
  - clarity
- Continuous:
  - price
  - carat
  - length, width, depth (x,y,z)

ggplot(data = diamonds) +
    theme(plot.title = element_text(hjust = 0.5)) +
    ggtitle("Distribution based on Cut") + 
    geom_bar(mapping = aes(x = cut, fill = clarity)) # use `fill` to see cross-section for categorical variable

diamonds %>%
    count(cut) # count() shows the number of diamonds in each category

cut	n
Fair	1610
Good	4906
Very Good	12082
Premium	13791
Ideal	21551

Meanwhile, since carat is a continuous variable, use a histogram (geom_histogram) to view the distribution.

ggplot(data = diamonds) +
    theme(plot.title = element_text(hjust = 0.5)) +
    ggtitle("Distribution based on Carat") + 
    geom_histogram(mapping = aes(x = carat, fill = color), binwidth = 0.5)

diamonds %>%
    count(cut_width(carat, 0.5))

cut_width(carat, 0.5)	n
[-0.25,0.25]	785
(0.25,0.75]	29498
(0.75,1.25]	15977
(1.25,1.75]	5313
(1.75,2.25]	2002
(2.25,2.75]	322
(2.75,3.25]	32
(3.25,3.75]	5
(3.75,4.25]	4
(4.25,4.75]	1
(4.75,5.25]	1

To view a smaller subset of the data satisfying a certain condition or several conditions, use filter().

smaller <- diamonds %>%
    filter(carat < 2 && color == "E")

ggplot(data = smaller, mapping = aes(x = carat, color = clarity)) +
    geom_histogram(binwidth = 0.1)

To view multiple distributions over a single variable, one must put multiple histograms over one another, which would obviously be hard to visualize. This is where geom_freqpoly comes in–by replacing histograms with lines, visualization is made much easier.

library(tidyverse)

ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
    theme(plot.title = element_text(hjust = 0.5)) +
    ggtitle("Distribution of Cut per Carat") + 
    geom_freqpoly(binwidth = 0.1)

Replacing Values

When preprocessing data, there will almost always be strange data–data whose values do not make any sense. When convinced enough that those pieces of data are invalid, two options remain: one, simply drop the rows containing the strange data, or two, replace those values with NA (not available). The book recommends that the second option, for the reason that in some cases, there may not be enough data to simply delete those rows.

tibbleW <- tribble(
    ~w, ~x, ~y, ~z, 
    "a", 2, 1, "q",
    "b", 3, 100, "w",
    "c", 4, 1, "x",
    "d", 5, 110, "y",
) 

head(tibbleW)

w	x	y	z
a	2	1	q
b	3	100	w
c	4	1	x
d	5	110	y

tibbleW1 <- tibbleW %>%
    mutate(y = ifelse(y < 10, NA, y)) # look to the `y` column, replace any value in the y column less than 10 with "NA"

head(tibbleW1)

w	x	y	z
a	2	NA	q
b	3	100	w
c	4	NA	x
d	5	110	y

Covariation

To visualize the relationship between two variables–one categorical and the other continuous–use geom_freqpoly()to graph the distribution of observations based on the continuous variable, for each categorical variable.

The following example shows the diamonds’ distribution of price (continuous), categorized by the cut quality (categorical).

 ggplot(data = diamonds) +
    theme(plot.title = element_text(hjust = 0.5)) +
    ggtitle("Price Distribution per Cut") + 
    geom_freqpoly(mapping = aes(x = price, color = cut), binwidth = 500)

In determining the relationship between price and quality, however, one must observe the fact that there are a different number of diamonds per cut category.

ggplot(diamonds) +
    geom_bar(mapping = aes(x = cut))

In the above geom_freqpoly graph, the y axis is set to be “count”, while the x axis is “price”. The most visibly striking aspect of the graph is the tall yellow peak at about 1,200 dillars for price. This yellow peak is taller than the other distributions’ peaks at about the same price–far taller, in fact. This gives the illusion that the percentage of diamonds that occupy that 1,200 peak in price is also far greater for the ideal category than it is for the good category.

This assumption is invalid though, and for two reasons. First, the y axis is “count,” not “percentage” or “density,” and second, there are a lot more diamonds in the ideal category than the good category. In short, one cannot compare the difference in distribution by comparing the difference in count.

Another problem with the graph is that one cannot easily compare the mean diamond price of each cut quality. The graph may deceive people into believing that the 1,200 dollar yellow peak implies a mean price closer to 1,200 dollars for ideal-cut diamonds than other mean prices. However, this is not necessarily true, given the difference in the area under each graph corresponding to the diamond count for each cut quality.

ggplot(
    data = diamonds,
        mapping = aes(x = price, y = ..density..) # notice the argument "y = ..density"
    ) +

    theme(plot.title = element_text(hjust = 0.5)) +
    ggtitle("Price Density Distribution per Cut") +

    geom_freqpoly(mapping = aes(color = cut), binwidth = 500)

The graph above, in which the area under each line is the same, suggests that fair-cut diamonds have an average price closer to 2,500 dollars, rather than 1,200 dollars. In fact, it appears as though fair-cut diamonds have the highest mean price among the diamonds of each cut quality.

s1 <- subset(diamonds, cut=='Fair')
# head(s1[7])
s1[7] %>%
   summarize(mean = mean(price)) # average price of fairly cut diamonds = $4358

idealD1 <- diamonds %>%
    filter(cut == "Good") %>%
    summarize(mean = mean(price))
idealD1 # average price of good cut diamonds = $3457 

idealD3 <- diamonds %>%
    filter(cut == "Premium") %>%
    summarize(mean = mean(price))
idealD3 # average price of premium cut diamonds = $3457 

idealD4 <- diamonds %>%
    filter(cut == "Ideal") %>%
    summarize(mean = mean(price))
idealD4 # average price of ideally cut diamonds = $3457 

mean
4358.758

mean
3928.864

mean
4584.258

mean
3457.542

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
    geom_boxplot()

To order the boxplots in increasing median values, add a x = reorder() argument to mapping = aes().

# Ordered by increasing median values 

ggplot(data = diamonds) +
    geom_boxplot(mapping = aes(
                    x = reorder(cut, price, FUN = median),  
                    y = price))

To get a more visually striking picture of the distribution of diamonds in terms of cut and color, use a heatmap-like geom object called “geom_tile.”

diamonds %>%
    count(color, cut) %>%
    ggplot(mapping = aes(x = color, y = cut)) + 
        geom_tile(mapping = aes(fill = n)) 

From the geom_tile diagram above, it appears as though ideal-cut diamonds with color G are the majority, followed by the diamonds in the adjacent cut categories, and finally the fair-cut diamonds.

Twitter Facebook LinkedIn

Junhyung Park

EDA with R

Move values into columns (`gather`)

Move column data into separate columns (`spread`)

Separating columns (`separate`)

Combining columns (`unite`)

Exploratory Data Analysis

Replacing Values

Covariation

You May Also Enjoy

Graph Theory I

Recursion I

SVD, PCA

Distributions II

Junhyung Park

Move values into columns (gather)

Move column data into separate columns (spread)

Separating columns (separate)

Combining columns (unite)

Exploratory Data Analysis

Replacing Values

Covariation

You May Also Enjoy

Graph Theory I

Recursion I

SVD, PCA

Distributions II

Move values into columns (`gather`)

Move column data into separate columns (`spread`)

Separating columns (`separate`)

Combining columns (`unite`)