ESY

Why dplyr?

The different tasks you can do with dplyr can be done with base R and other r-packages. So why learn dplyr? In my opinion, you should learn dplyr because it

Is easy to use and understand
Is fast and efficient
Simplifies data manipulation
Fits within broader philosophy of data science (e.g. the tidyverse)

dplyr, along with other packages from the tidyverse, were designed to work well together because they share a common “grammar” and philosophy. The most important principle of dplyr is that all functions within the package take a data.frame as input and return a data.frame as output. This simple consistency makes it possible to reason about what different functions might be doing with data. More importantly, it means that once you learn one function, you can learn other functions with relative ease.

I would encourage you to read R for Data Science if you want to dive deeper in to the philosophy of data science from the tidyverse perspective.

dplyr Verbs

Below, you will find 3 topics that I think are the most important aspects of dplyr:

Single-table verbs
Summary and grouping functions
Two-table verbs

Single table verbs will help you slice and dice and create new variables in your data. I’ve focused on the 5 most common and most widely-used single-table verbs. Summary verbs help you create useful summaries of your data quickly and help you make these summaries according to groups. Two-table verbs help you systematically merge two datasets and make it clear what the result of your merge will look like.

The functions described below are the main workhorse functions in dplyr but there are many others. I would encourage you to visit the dplyr website to see more tutorials and a complete function reference.

Setup & Example Data

Below is the code needed to set up our R session. We’ll need dplyr from the tidyverse package (you can also do library(dplyr)). I prefer to load the tidyverse package because it automatically loads a number of useful packages in a single line of code. We also need the psych package in order to grab the data we need. If you don’t have it run install.packages("psych").

# Load packages
library(tidyverse) 
library(psych)
library(psychTools)

# Datasets from the psych package
data("bfi")
data("bfi.dictionary")

# Convert datasets to tibbles
bfi <- as_tibble(bfi)

This dataset is a sample of 2,800 observations for which 25 personality self report items were collected from the International Personality Item Pool (IPIP) as a part of the SAPA (see the psych package description of for more details).

To get a sense for these data, we can use glimpse() from dplyr to print out the dimensions of the dataset, the variables and their types, and the first few observations of each variable.

bfi data

glimpse(bfi)

## Rows: 2,800
## Columns: 28
## $ A1        <int> 2, 2, 5, 4, 2, 6, 2, 4, 4, 2, 4, 2, 5, 5, 4, 4, 4, 5, 4, 4, …
## $ A2        <int> 4, 4, 4, 4, 3, 6, 5, 3, 3, 5, 4, 5, 5, 5, 5, 3, 6, 5, 4, 4, …
## $ A3        <int> 3, 5, 5, 6, 3, 5, 5, 1, 6, 6, 5, 5, 5, 5, 2, 6, 6, 5, 5, 6, …
## $ A4        <int> 4, 2, 4, 5, 4, 6, 3, 5, 3, 6, 6, 5, 6, 6, 2, 6, 2, 4, 4, 5, …
## $ A5        <int> 4, 5, 4, 5, 5, 5, 5, 1, 3, 5, 5, 5, 4, 6, 1, 3, 5, 5, 3, 5, …
## $ C1        <int> 2, 5, 4, 4, 4, 6, 5, 3, 6, 6, 4, 5, 5, 4, 5, 5, 4, 5, 5, 1, …
## $ C2        <int> 3, 4, 5, 4, 4, 6, 4, 2, 6, 5, 3, 4, 4, 4, 5, 5, 4, 5, 4, 1, …
## $ C3        <int> 3, 4, 4, 3, 5, 6, 4, 4, 3, 6, 5, 5, 3, 4, 5, 5, 4, 5, 5, 1, …
## $ C4        <int> 4, 3, 2, 5, 3, 1, 2, 2, 4, 2, 3, 4, 2, 2, 2, 3, 4, 4, 4, 5, …
## $ C5        <int> 4, 4, 5, 5, 2, 3, 3, 4, 5, 1, 2, 5, 2, 1, 2, 5, 4, 3, 6, 6, …
## $ E1        <int> 3, 1, 2, 5, 2, 2, 4, 3, 5, 2, 1, 3, 3, 2, 3, 1, 1, 2, 1, 1, …
## $ E2        <int> 3, 1, 4, 3, 2, 1, 3, 6, 3, 2, 3, 3, 3, 2, 4, 1, 2, 2, 2, 1, …
## $ E3        <int> 3, 6, 4, 4, 5, 6, 4, 4, NA, 4, 2, 4, 3, 4, 3, 6, 5, 4, 4, 4,…
## $ E4        <int> 4, 4, 4, 4, 4, 5, 5, 2, 4, 5, 5, 5, 2, 6, 6, 6, 5, 6, 5, 5, …
## $ E5        <int> 4, 3, 5, 4, 5, 6, 5, 1, 3, 5, 4, 4, 4, 5, 5, 4, 5, 6, 5, 6, …
## $ N1        <int> 3, 3, 4, 2, 2, 3, 1, 6, 5, 5, 3, 4, 1, 1, 2, 4, 4, 6, 5, 5, …
## $ N2        <int> 4, 3, 5, 5, 3, 5, 2, 3, 5, 5, 3, 5, 2, 1, 4, 5, 4, 5, 6, 5, …
## $ N3        <int> 2, 3, 4, 2, 4, 2, 2, 2, 2, 5, 4, 3, 2, 1, 2, 4, 4, 5, 5, 5, …
## $ N4        <int> 2, 5, 2, 4, 4, 2, 1, 6, 3, 2, 2, 2, 2, 2, 2, 5, 4, 4, 5, 1, …
## $ N5        <int> 3, 5, 3, 1, 3, 3, 1, 4, 3, 4, 3, NA, 2, 1, 3, 5, 5, 4, 2, 1,…
## $ O1        <int> 3, 4, 4, 3, 3, 4, 5, 3, 6, 5, 5, 4, 4, 5, 5, 6, 5, 5, 4, 4, …
## $ O2        <int> 6, 2, 2, 3, 3, 3, 2, 2, 6, 1, 3, 6, 2, 3, 2, 6, 1, 1, 2, 1, …
## $ O3        <int> 3, 4, 5, 4, 4, 5, 5, 4, 6, 5, 5, 4, 4, 4, 5, 6, 5, 4, 2, 5, …
## $ O4        <int> 4, 3, 5, 3, 3, 6, 6, 5, 6, 5, 6, 5, 5, 4, 5, 3, 6, 5, 4, 3, …
## $ O5        <int> 3, 3, 2, 5, 3, 1, 1, 3, 1, 2, 3, 4, 2, 4, 5, 2, 3, 4, 2, 2, …
## $ gender    <int> 1, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, …
## $ education <int> NA, NA, NA, NA, NA, 3, NA, 2, 1, NA, 1, NA, NA, NA, 1, NA, N…
## $ age       <int> 16, 18, 17, 17, 17, 21, 18, 19, 19, 17, 21, 16, 16, 16, 17, …

Single Table Verbs

The first set of dplyr verbs that we will talk about and use are single-table verbs. Single-table operations are the most common and widely used verbs used in data manipulation. When I say “data manipulation”, I am referring to:

selecting the relevant columns from a larger dataset (i.e. variables)
renaming variables with more useful labels
filtering the relevant rows (i.e. observations or cases) to the ones you want to analyze
arranging or sorting the observations in ways that help you inspect your data
creating new variables based on existing variables (e.g. creating scale scores from a set of items).

These operations are likely very familiar to you but, at least in my beginning experiences with R, it was not always clear how they were executed in R. On top of this issue, it wasn’t clear to me that this process was or could be systematic. dplyr makes these operations more explicit and helps you think about how to do such operations systematically. In fact, the so-called 5 most important verbs of dplyrdo exactly what they sound like:

function	description
`select()`	Select relevant columns of your data
`rename()`	Rename the columns of your data
`filter()`	Filter your data according to logical statements
`arrange()`	Sort your data on a certain column, ascending or descending
`mutate()`	Create new variables and add them to your dataset

Below are some examples of ways that I use these verbs in my work:

Select

Selecting variables is probably one of the most powerful dplyr operations. All that you need to do in order to select variables in a dataset is simply write out thier names (unquoted), like so: select(data,var1,var2,var3). This code will select var1,var2, and var3 from the dataset data and give you a new dataset that only contains those columns.

You can also “deselect” columns. For example, let’s say you didn’t want var1,var2, and var3 in your data but you wanted to keep everything else. Simply write the following code: select(data,-var1,-var2,-var3). The - will drop those columns and give all other columns in your data.

You can also do more complicated things (see below). For example, you can select all columns that have a particular prefix, suffix, or contain a particular word or certain letter sequences. Take a look at the example:

Example

First, because I am not terribly familiar with the BFI dataset from the psych package, I want to figure out which variables I actually need to look at personality. Luckily, the psych package provides a bfi.dictionary for this exact purpose.

bfi.dictionary %>% 
  rownames_to_column() %>% 
  rename(bfi_item = rowname) %>% 
  as_tibble()

## # A tibble: 28 × 8
##    bfi_item ItemLabel Item            Giant3  Big6      Little12  Keying IPIP100
##    <chr>    <fct>     <fct>           <fct>   <fct>     <fct>      <int> <fct>  
##  1 A1       q_146     Am indifferent… Cohesi… Agreeabl… Compassi…     -1 B5:A   
##  2 A2       q_1162    Inquire about … Cohesi… Agreeabl… Compassi…      1 B5:A   
##  3 A3       q_1206    Know how to co… Cohesi… Agreeabl… Compassi…      1 B5:A   
##  4 A4       q_1364    Love children.  Cohesi… Agreeabl… Compassi…      1 B5:A   
##  5 A5       q_1419    Make people fe… Cohesi… Agreeabl… Compassi…      1 B5:A   
##  6 C1       q_124     Am exacting in… Stabil… Conscien… Orderlin…      1 B5:C   
##  7 C2       q_530     Continue until… Stabil… Conscien… Orderlin…      1 B5:C   
##  8 C3       q_619     Do things acco… Stabil… Conscien… Orderlin…      1 B5:C   
##  9 C4       q_626     Do things in a… Stabil… Conscien… Industri…     -1 B5:C   
## 10 C5       q_1949    Waste my time.  Stabil… Conscien… Industri…     -1 B5:C   
## # … with 18 more rows

After searching throught the codebook, I was able to deduce that all the personality variables have capital letter prefix for the trait that they measure with a trailing digit indicating the item number. Now I can quickly select subsets of these items depending on my needs.

Select by variable name

Let’s say I just want to select a couple variables. This is the most straightforward way to use select. For example, I can select gender, age, and the 5 items that measure Agreeableness like so:

# Spell out the variables you want to select
bfi %>% select(gender, age, A1, A2, A3, A4, A5)

## # A tibble: 2,800 × 7
##    gender   age    A1    A2    A3    A4    A5
##     <int> <int> <int> <int> <int> <int> <int>
##  1      1    16     2     4     3     4     4
##  2      2    18     2     4     5     2     5
##  3      2    17     5     4     5     4     4
##  4      2    17     4     4     6     5     5
##  5      1    17     2     3     3     4     5
##  6      2    21     6     6     5     6     5
##  7      1    18     2     5     5     3     5
##  8      1    19     4     3     1     5     1
##  9      1    19     4     3     6     3     3
## 10      2    17     2     5     6     6     5
## # … with 2,790 more rows

Alternatively, let’s say I want everything but gender, age, and education:

# select everything but gender, age, and education 
bfi %>% select(-gender, -age, -education)

## # A tibble: 2,800 × 25
##       A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2    E3
##    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1     2     4     3     4     4     2     3     3     4     4     3     3     3
##  2     2     4     5     2     5     5     4     4     3     4     1     1     6
##  3     5     4     5     4     4     4     5     4     2     5     2     4     4
##  4     4     4     6     5     5     4     4     3     5     5     5     3     4
##  5     2     3     3     4     5     4     4     5     3     2     2     2     5
##  6     6     6     5     6     5     6     6     6     1     3     2     1     6
##  7     2     5     5     3     5     5     4     4     2     3     4     3     4
##  8     4     3     1     5     1     3     2     4     2     4     3     6     4
##  9     4     3     6     3     3     6     6     3     4     5     5     3    NA
## 10     2     5     6     6     5     6     5     6     2     1     2     2     4
## # … with 2,790 more rows, and 12 more variables: E4 <int>, E5 <int>, N1 <int>,
## #   N2 <int>, N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
## #   O4 <int>, O5 <int>

Or, what if you know you just want the first few columns of the dataset and you don’t want to type their names?

# select the first 5 colunmns
bfi %>% select(1:5)

## # A tibble: 2,800 × 5
##       A1    A2    A3    A4    A5
##    <int> <int> <int> <int> <int>
##  1     2     4     3     4     4
##  2     2     4     5     2     5
##  3     5     4     5     4     4
##  4     4     4     6     5     5
##  5     2     3     3     4     5
##  6     6     6     5     6     5
##  7     2     5     5     3     5
##  8     4     3     1     5     1
##  9     4     3     6     3     3
## 10     2     5     6     6     5
## # … with 2,790 more rows

Select by string matches

Using select with variable names is powerful but can involve a lot of typing if you need to select many variables. An even more powerful way to select is to utilize “select helpers”. These include (descriptions from select_helpers help page):

starts_with(): Starts with a prefix.
ends_with(): Ends with a suffix.
contains(): Contains a literal string.
num_range(): Matches a numerical range like x01, x02, x03.
one_of(): Matches variable names in a character vector.
matches(): Matches a regular expression.
everything(): Matches all variables.
last_col(): Select last variable, possibly with an offset.

I use starts_with(), ends_with(), and contains() most frequently. However, matches() is the most powerful as it allows you to leverage regular expressions. A full discussion of regular experssions is beyond the scope of this post. In brief, regular expressions allow you to do complex string pattern matching.

Examples

For the following examples, I will be printing out the results of using different select_helpers. Note that I am going to print out column names (with names()) rather than the whole dataset for brevity.

starts_with(): Let’s say we only want Conscientiousness items:

bfi %>% select(starts_with("C")) %>% names()

## [1] "C1" "C2" "C3" "C4" "C5"

ends_with(): how about the first item of each scale?:

bfi %>% select(ends_with("1")) %>% names()

## [1] "A1" "C1" "E1" "N1" "O1"

contains(): how about all the Openness items?:

bfi %>% select(contains("O")) %>% names()

## [1] "O1"        "O2"        "O3"        "O4"        "O5"        "education"

Note that in this case contains() wasn’t great because education contains an “o”. We can fix that by specifying we want a capitol “O”.

bfi %>% select(contains("O", ignore.case = F)) %>% names()

## [1] "O1" "O2" "O3" "O4" "O5"

num_range(): how about the last three items of Emotional Stability?:

bfi %>% select(num_range(prefix = "N", 3:5)) %>% names()

## [1] "N3" "N4" "N5"

one_of(): Let’s say you wanted the last item of Emotional Stabilty but you weren’t sure if there is 5 or 6 items:

bfi %>% select(one_of("N5","N6")) %>% names()

## Warning: Unknown columns: `N6`

## [1] "N5"

matches(): Let’s do something more complicated. How about finding all of the Conscientiousness items, the last 2 items of Emotional Stability, and the first three items of Openness?

bfi %>% select(matches("^O[1-3]|^N[4-5]|^C")) %>% names()

##  [1] "C1" "C2" "C3" "C4" "C5" "N4" "N5" "O1" "O2" "O3"

Rename

Renaming variables is pretty simply in dplyr. Simply type rename(data, new.name.1 = old.name.1,new.name.2 = old.name.2). Here, you provide a new name (e.g. new.name.1) and set it equal to the old name in the data (e.g. old.name.1). See the example to see how rename() works.

Note: that the more complicated examples are there to inspire you to learn how to systematically and programatically change many variables all at once.

Examples

# Old names
names(bfi)

##  [1] "A1"        "A2"        "A3"        "A4"        "A5"        "C1"       
##  [7] "C2"        "C3"        "C4"        "C5"        "E1"        "E2"       
## [13] "E3"        "E4"        "E5"        "N1"        "N2"        "N3"       
## [19] "N4"        "N5"        "O1"        "O2"        "O3"        "O4"       
## [25] "O5"        "gender"    "education" "age"

Rename just a couple of variables:

# Simple rename
bfi %>% 
  rename(Agreeableness_1 = A1, Agreeableness_2 = A2) %>% 
  names()

##  [1] "Agreeableness_1" "Agreeableness_2" "A3"              "A4"             
##  [5] "A5"              "C1"              "C2"              "C3"             
##  [9] "C4"              "C5"              "E1"              "E2"             
## [13] "E3"              "E4"              "E5"              "N1"             
## [17] "N2"              "N3"              "N4"              "N5"             
## [21] "O1"              "O2"              "O3"              "O4"             
## [25] "O5"              "gender"          "education"       "age"

Rename a set of variables using a rename_at() (combination of select and rename):

# More complicated rename
bfi %>% 
  rename_at(vars(matches("^A\\d")), funs(paste0("Agreeableness_",1:5))) %>% 
  names()

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

##  [1] "Agreeableness_1" "Agreeableness_2" "Agreeableness_3" "Agreeableness_4"
##  [5] "Agreeableness_5" "C1"              "C2"              "C3"             
##  [9] "C4"              "C5"              "E1"              "E2"             
## [13] "E3"              "E4"              "E5"              "N1"             
## [17] "N2"              "N3"              "N4"              "N5"             
## [21] "O1"              "O2"              "O3"              "O4"             
## [25] "O5"              "gender"          "education"       "age"

labels <- bfi.dictionary %>% filter(str_detect(ItemLabel,"\\d$")) %>% pull(Big6) %>% as.character

# Even more complicated rename
bfi %>% 
  rename_at(vars(matches("\\d$")), funs(str_replace(.,"^[[:upper:]]",labels))) %>% 
  names()

##  [1] "Agreeableness1"       "Agreeableness2"       "Agreeableness3"      
##  [4] "Agreeableness4"       "Agreeableness5"       "Conscientiousness1"  
##  [7] "Conscientiousness2"   "Conscientiousness3"   "Conscientiousness4"  
## [10] "Conscientiousness5"   "Extraversion1"        "Extraversion2"       
## [13] "Extraversion3"        "Extraversion4"        "Extraversion5"       
## [16] "Emotional Stability1" "Emotional Stability2" "Emotional Stability3"
## [19] "Emotional Stability4" "Emotional Stability5" "Openness1"           
## [22] "Openness2"            "Openness3"            "Openness4"           
## [25] "Openness5"            "gender"               "education"           
## [28] "age"

Arrange

Arranging columns is also very straight forward. Simply indicate which variable you want to use to arrange the data: arrange(data,column.to.arrange.by). You can specify a column wrapped in desc() to have it ordered in descending order instead.

Example

bfi %>% arrange(age)

## # A tibble: 2,800 × 28
##       A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2    E3
##    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1     2     5     5     4     5     5     3     5     2     4     2     5     5
##  2     1     4     5    NA     5     4     1     6     5     3     1     1    NA
##  3     1     6     6     6     6     5     6     5     1     1     1     1     5
##  4     1     6     6     6     5     2     5     5     5     2     1     1     5
##  5    NA     6     4    NA     4     4    NA     6    NA     1     1     5    NA
##  6     1     6     6     5     2     4     2     2     5     4     6     1     5
##  7     4     4     2     4     4     4     4     4     3     4     3     5     2
##  8     2     5     3     2     2     4     2     5     5     5     6     6     4
##  9     1     6     6     6     6     6     6     6     1     1     1     1     6
## 10     4     5     5     4     2     2     2     4     3     1     5     5     2
## # … with 2,790 more rows, and 15 more variables: E4 <int>, E5 <int>, N1 <int>,
## #   N2 <int>, N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
## #   O4 <int>, O5 <int>, gender <int>, education <int>, age <int>

bfi %>% arrange(desc(age))

## # A tibble: 2,800 × 28
##       A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2    E3
##    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1     1     3     4     4     4     4     1     5     3     2     6     2     1
##  2     4     3     1     1     5     6     6     4     1     1     6     2     6
##  3     1     6     6     4     6     2     4     4     1     2     2     1     6
##  4     2     4     5     6     6     5     3     5     2     2     2     4     4
##  5     1     5     6     5     6     4     3     2     4     5     2     1     2
##  6     2     4     4     3     4     3     2     3     4     6     6     5     4
##  7     2     4     4     4     4     5     5     4     1     2     4     4     4
##  8     1     4     3     6     5     5     4     2     4     5     5     5     1
##  9     2     6     6     6     5     5     5     5     1     3     2     2     5
## 10     5     4     2     6     4     5     6     4     2     5     1     6     4
## # … with 2,790 more rows, and 15 more variables: E4 <int>, E5 <int>, N1 <int>,
## #   N2 <int>, N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
## #   O4 <int>, O5 <int>, gender <int>, education <int>, age <int>

Filter

Filtering data is an operation that you will undoubtedly need to use all the time. You filter data anytime you need to create some subset of a larger data set. To perform this operation you need to supply filter() with a logical expression. This expression will be applied to the dataset and only rows that meet your criteria (i.e. evaluate to TRUE after your logical expression), will be kept. Take a look at the example:

Example

This dataset is pretty big (N = 2800). I might want to use everyone in this dataset but it’s reasonable to see how certain research questions may not require the entire sample. For example, maybe I only want to look at adults who are younger than 65. This could be because 65 and younger adults are likely not retired, or maybe after 40 is a meaningful cutoff for certain questions.

Whatever the case, you can quickly subset your data using filter(). Below I use the expression Age < 40 inside my call to filter(). This expression will help filter() figure out which individuals are younger than 65 and only keep those individuals.

bfi %>% 
  filter(age <= 40) %>% 
  select(age,everything()) %>% 
  arrange(desc(age))

## # A tibble: 2,358 × 28
##      age    A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2
##    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1    40     1     5     5     6     5     4     4     4     3     4     4     3
##  2    40     1     5     5     5     2     6    NA     6     1     1     3     2
##  3    40     1     5    NA     5     6     6     6     1     1     1     6     1
##  4    40     1     6     4     6     6     5     4     5     1     2     5     1
##  5    40     4     4     4    NA     5     4     4     4    NA     3     4     2
##  6    40     1     5     5     6     5     5     4     4     4     4     1     2
##  7    40     3     5     5     6     5     5     4     5     2     3     2     4
##  8    40     3     6     5     6     6     5     6     5     1     3     1     2
##  9    40     1     6     6     6     6     1     5     6     1     1     1     6
## 10    40     1     6     4     6     6     5     6     4     3     4     3     2
## # … with 2,348 more rows, and 15 more variables: E3 <int>, E4 <int>, E5 <int>,
## #   N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>,
## #   O3 <int>, O4 <int>, O5 <int>, gender <int>, education <int>

Mutate

Mutate is the final (mainstream) verb among the single-table dplyr verbs. It’s a little more complicated than the others but I still think it’s intuitive. The point of mutate() is to create new variables based on existing variables and add them to your data.

To use mutate() simply give your new variable a name followed by an =. Then, express how you want to calculate your new variable. See below for examples:

Example

To see how mutate() works, let’s create composite scores for each personality trait.

bfi %>% 
  rowwise() %>% # make sure to calculate means across rows not columns
  mutate(
    Neuroticism       = mean(c(A1,A2,A3,A4,A5),na.rm=T),
    Extraversion      = mean(c(C1,C2,C3,C4,C5),na.rm=T),
    Openness          = mean(c(E1,E2,E3,E4,E5),na.rm=T),
    Conscientiousness = mean(c(N1,N2,N3,N4,N5),na.rm=T),
    Agreeableness     = mean(c(O1,O2,O3,O4,O5),na.rm=T)
  ) %>% 
  select(Neuroticism, Extraversion, Openness, Conscientiousness, Agreeableness)

## # A tibble: 2,800 × 5
## # Rowwise: 
##    Neuroticism Extraversion Openness Conscientiousness Agreeableness
##          <dbl>        <dbl>    <dbl>             <dbl>         <dbl>
##  1         3.4          3.2     3.4                2.8           3.8
##  2         3.6          4       3                  3.8           3.2
##  3         4.4          4       3.8                3.6           3.6
##  4         4.8          4.2     4                  2.8           3.6
##  5         3.4          3.6     3.6                3.2           3.2
##  6         5.6          4.4     4                  3             3.8
##  7         4            3.6     4.2                1.4           3.8
##  8         2.8          3       3.2                4.2           3.4
##  9         3.8          4.8     3.75               3.6           5  
## 10         4.8          4       3.6                4.2           3.6
## # … with 2,790 more rows

Putting it all together

Now that you have been introduced to the most important single-table dplyr verbs, let’s see how we might complete all of these steps in a single chain of function calls:

bfi %>% 
  filter(age <= 40) %>% 
  rowwise() %>% 
  mutate(Neuroticism       = mean(c(A1,A2,A3,A4,A5),na.rm=T),
         Extraversion      = mean(c(C1,C2,C3,C4,C5),na.rm=T),
         Openness          = mean(c(E1,E2,E3,E4,E5),na.rm=T),
         Conscientiousness = mean(c(N1,N2,N3,N4,N5),na.rm=T),
         Agreeableness     = mean(c(O1,O2,O3,O4,O5),na.rm=T)) %>% 
  select(age,education,gender,Neuroticism, Extraversion, Openness, Conscientiousness, Agreeableness) %>% 
  rename_all(tolower) %>% 
  arrange(desc(age))

## # A tibble: 2,358 × 8
## # Rowwise: 
##      age education gender neuroticism extraversion openness conscientiousness
##    <int>     <int>  <int>       <dbl>        <dbl>    <dbl>             <dbl>
##  1    40         3      2        4.4          3.8      3.8               2.6 
##  2    40         5      2        3.6          3.5      2.8               4.25
##  3    40         3      2        4.25         3        2.8               2   
##  4    40         3      1        4.6          3.4      4.2               1.4 
##  5    40         5      2        4.25         3.75     3.75              4   
##  6    40         1      2        4.4          4.2      3.4               4.4 
##  7    40         3      2        4.8          3.8      3.8               4.6 
##  8    40         2      2        5.2          4        4                 2.6 
##  9    40         3      2        5            2.8      3.6               1.6 
## 10    40         3      2        4.6          4.4      3.6               5.8 
## # … with 2,348 more rows, and 1 more variable: agreeableness <dbl>

Summarizing and Grouping

Summarizing data can be tedious. It involves taking raw data and turning those data into useful summary statistics (e.g. means, standard deviations, minimum and maximun values, ranges, etc.). Furthermore, it’s often useful to create such summaries within subgroups. For example, you may want to create summary values for each condition of an experiment or some other grouping variable.

dplyr has a set of functions that specifically handle these operations and make it very easy and systematic to create the summaries you want to create.

Summarize

Summarizing in dplyr works the same way as mutate(). Using the function summarize(), we can specify a data set we want to summarize, give the name of the summary variable we want to create, and then a specific operation to perform. For example, if we wanted to find the mean of a single variable in a dataset we might write summarize(data, summary.variable = mean(var1)). The result of this function will be a single value: the mean of var1.

Example - Simple Summaries

Here, I want to know, for the whole dataset, what the mean, median, standard deviation, minimun and maximum ages in the BFI dataset.

bfi %>% 
  summarize(
    mean    = mean(age,na.rm=T),
    median  = median(age,na.rm=T),
    sd      = sd(age,na.rm=T),
    min     = min(age,na.rm=T),
    max     = max(age,na.rm=T)
)

## # A tibble: 1 × 5
##    mean median    sd   min   max
##   <dbl>  <dbl> <dbl> <int> <int>
## 1  28.8     26  11.1     3    86

Group By

The power of summarize() becomes much greater when you use it in conjunction with group_by(). The point of group_by() is to group data into categories and perform operations on them. For example, maybe we want to know the mean of a particular variable but within a particular group category. We might right group_by(data, grouping.variable) %>% summarize(mean = mean(var1)). This will become more clear in the example below:

Example - Grouped Summaries

To see the utility of group_by() and summarize() let’s suppose we wanted to know all the same summary statistics for age but within each education level We could simply add one line to our already written code to make this happen seemlessly:

bfi %>% 
  group_by(education) %>% 
  summarize(
    mean    = mean(age,na.rm=T),
    median  = median(age,na.rm=T),
    sd      = sd(age,na.rm=T),
    min     = min(age,na.rm=T),
    max     = max(age,na.rm=T)
)

## # A tibble: 6 × 6
##   education  mean median    sd   min   max
##       <int> <dbl>  <dbl> <dbl> <int> <int>
## 1         1  25.1     20 10.4     14    62
## 2         2  31.5     27 12.2     17    86
## 3         3  27.2     24  9.45    11    63
## 4         4  33.0     30 10.3     18    70
## 5         5  35.3     32 11.0      3    74
## 6        NA  18.0     16  8.52     9    61

The possibilities are quite broad once you start getting used the logic of grouping and summarizing variables. For example, you can make summary variables based on multiple grouping variables. Take a look:

bfi %>% 
  group_by(education,gender) %>% 
  summarize(
    mean    = mean(age,na.rm=T),
    median  = median(age,na.rm=T),
    sd      = sd(age,na.rm=T),
    min     = min(age,na.rm=T),
    max     = max(age,na.rm=T)
)

## `summarise()` has grouped output by 'education'. You can override using the `.groups` argument.

## # A tibble: 12 × 7
## # Groups:   education [6]
##    education gender  mean median    sd   min   max
##        <int>  <int> <dbl>  <dbl> <dbl> <int> <int>
##  1         1      1  25.2     20  9.82    15    53
##  2         1      2  25.1     20 10.8     14    62
##  3         2      1  31.5     27 12.3     18    65
##  4         2      2  31.5     27 12.3     17    86
##  5         3      1  25.4     22  8.13    16    60
##  6         3      2  28.0     25  9.83    11    63
##  7         4      1  33.2     30 10.7     20    70
##  8         4      2  32.9     30 10.2     18    59
##  9         5      1  33.9     30 12.0      3    74
## 10         5      2  36.1     34 10.3     19    66
## 11        NA      1  18.9     17  9.37    12    55
## 12        NA      2  17.4     16  7.98     9    61

Two-Table Verbs

Two-table verbs are dplyr functions that use two datasets and do something with them. Most commonly, these two-table verbs are used to merge data. However, as we will see, merging data is not necessarily a simple task and many problems arise when attempting even the simpliest of merges.

In general, there are two types of joins:

Mutating joins, ones that add more variables to your data
Filtering joings, ones that operate only on the observations of the data and do not add any new variables to your data.

When using join functions, you will be explicitly supplying 2 data.frames and specific columns to match by. For example, you might want to merge two datasets from different timepoints in a longitudinal study. Thus, you will merge these datasets using a key, such as participant ID.

Example Data

To show how two-table verbs work, I will need another dataset to merge with the BFI data. The dataset we will use is a sample of 1,525 subjects from the Synthetic Aperture Personality Assessment (SAPA) web based personality assessment project (see the psych pacakge description for more details). The dataset contains variables that measure cognitive performance. Below is the code I used to get these data into R:

# load the data from psych package
data("ability")

# convert it to a tibble
ability <- as_tibble(ability)

# take a look at the variables
glimpse(ability)

## Rows: 1,525
## Columns: 16
## $ reason.4  <dbl> 0, 0, 0, 1, NA, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,…
## $ reason.16 <dbl> 0, 0, 1, NA, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1,…
## $ reason.17 <dbl> 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, NA, 1,…
## $ reason.19 <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, …
## $ letter.7  <dbl> 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, NA, 1, 0, 1, 0, 1, 1, 1, 1, 1,…
## $ letter.33 <dbl> 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, …
## $ letter.34 <dbl> 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, …
## $ letter.58 <dbl> 0, 0, 0, 0, NA, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1,…
## $ matrix.45 <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, …
## $ matrix.46 <dbl> 0, 0, 1, NA, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ matrix.47 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, …
## $ matrix.55 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, …
## $ rotate.3  <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ rotate.4  <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, NA, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,…
## $ rotate.6  <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, …
## $ rotate.8  <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, …

Matching values

The whole idea of joining datasets is predicated on the assumption that tables contain at least some of the same observations. In our case, we want to join the BFI data with the ability data so that we can look at participants who have completed personality and ability items. Our datasets do not actually have the same observations or at there is no column of unique observation identifiers to help us join the tables. As such, we are going to make some observation identifiers ourselves.

# Set seed so you get the same results as me
set.seed(1)

# make ID numbers for the 2800 observations in the BFI data
bfi_fake_ids <- bfi %>% 
  mutate(id = 1:n())

# make ID numbers for the 1525 observations in the ability data based on the BFI IDs
ability_fake_ids <- ability %>% 
  mutate(id = c(sample(bfi_fake_ids$id,1000,replace = F),3001:3525)) # make some IDs from bfi and some new ones

Mutating Joins

Remember, mutating joins merge together two datasets. They are ‘mutating’ because the resulting merged dataset will contain more variables.

Inner Join

Inner joins (using inner_join()) will always return a data set that contains observations that exist in both data sets. As such, if I do an inner_join() using the BFI and ability data, the newly joined dataset should only contain observations that match based on ID numbers:

inner_join(bfi_fake_ids, ability_fake_ids, by = c("id" = "id"))

## # A tibble: 1,000 × 45
##       A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2    E3
##    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1     2     4     3     4     4     2     3     3     4     4     3     3     3
##  2     4     3     6     3     3     6     6     3     4     5     5     3    NA
##  3     5     5     5     6     4     5     4     3     2     2     3     3     3
##  4     5     5     5     6     6     4     4     4     2     1     2     2     4
##  5     4     5     2     2     1     5     5     5     2     2     3     4     3
##  6     4     6     6     2     5     4     4     4     4     4     1     2     5
##  7     4     4     5     4     3     5     4     5     4     6     1     2     4
##  8     1     6     6     1     5     5     4     4     2     3     1     2     4
##  9     2     4     4     4     3     6     5     6     1     1     2     4     4
## 10     2     5     1     3     5     5     4     5     2     5     1     2     6
## # … with 990 more rows, and 32 more variables: E4 <int>, E5 <int>, N1 <int>,
## #   N2 <int>, N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
## #   O4 <int>, O5 <int>, gender <int>, education <int>, age <int>, id <int>,
## #   reason.4 <dbl>, reason.16 <dbl>, reason.17 <dbl>, reason.19 <dbl>,
## #   letter.7 <dbl>, letter.33 <dbl>, letter.34 <dbl>, letter.58 <dbl>,
## #   matrix.45 <dbl>, matrix.46 <dbl>, matrix.47 <dbl>, matrix.55 <dbl>,
## #   rotate.3 <dbl>, rotate.4 <dbl>, rotate.6 <dbl>, rotate.8 <dbl>

Left & Right Join

Left Join

A left_join() keeps all observations from the data.frame on the left and grabs only the observations from the right data.frame that match the left:

left_join(bfi_fake_ids, ability_fake_ids, by = c("id" = "id"))

## # A tibble: 2,800 × 45
##       A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2    E3
##    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1     2     4     3     4     4     2     3     3     4     4     3     3     3
##  2     2     4     5     2     5     5     4     4     3     4     1     1     6
##  3     5     4     5     4     4     4     5     4     2     5     2     4     4
##  4     4     4     6     5     5     4     4     3     5     5     5     3     4
##  5     2     3     3     4     5     4     4     5     3     2     2     2     5
##  6     6     6     5     6     5     6     6     6     1     3     2     1     6
##  7     2     5     5     3     5     5     4     4     2     3     4     3     4
##  8     4     3     1     5     1     3     2     4     2     4     3     6     4
##  9     4     3     6     3     3     6     6     3     4     5     5     3    NA
## 10     2     5     6     6     5     6     5     6     2     1     2     2     4
## # … with 2,790 more rows, and 32 more variables: E4 <int>, E5 <int>, N1 <int>,
## #   N2 <int>, N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
## #   O4 <int>, O5 <int>, gender <int>, education <int>, age <int>, id <int>,
## #   reason.4 <dbl>, reason.16 <dbl>, reason.17 <dbl>, reason.19 <dbl>,
## #   letter.7 <dbl>, letter.33 <dbl>, letter.34 <dbl>, letter.58 <dbl>,
## #   matrix.45 <dbl>, matrix.46 <dbl>, matrix.47 <dbl>, matrix.55 <dbl>,
## #   rotate.3 <dbl>, rotate.4 <dbl>, rotate.6 <dbl>, rotate.8 <dbl>

Notice that the number of observations is equal to the number observations in the left hand dataset:

# joined data
left_join(bfi_fake_ids, ability_fake_ids, by = c("id" = "id")) %>% nrow()

## [1] 2800

# data on the left hand side
bfi_fake_ids %>% nrow()

## [1] 2800

Right Join

A right_join() keeps all observations from the data.frame on the right and grabs only the observations from the left data.frame that match the right:

right_join(bfi_fake_ids, ability_fake_ids, by = c("id" = "id"))

## # A tibble: 1,525 × 45
##       A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2    E3
##    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1     2     4     3     4     4     2     3     3     4     4     3     3     3
##  2     4     3     6     3     3     6     6     3     4     5     5     3    NA
##  3     5     5     5     6     4     5     4     3     2     2     3     3     3
##  4     5     5     5     6     6     4     4     4     2     1     2     2     4
##  5     4     5     2     2     1     5     5     5     2     2     3     4     3
##  6     4     6     6     2     5     4     4     4     4     4     1     2     5
##  7     4     4     5     4     3     5     4     5     4     6     1     2     4
##  8     1     6     6     1     5     5     4     4     2     3     1     2     4
##  9     2     4     4     4     3     6     5     6     1     1     2     4     4
## 10     2     5     1     3     5     5     4     5     2     5     1     2     6
## # … with 1,515 more rows, and 32 more variables: E4 <int>, E5 <int>, N1 <int>,
## #   N2 <int>, N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
## #   O4 <int>, O5 <int>, gender <int>, education <int>, age <int>, id <int>,
## #   reason.4 <dbl>, reason.16 <dbl>, reason.17 <dbl>, reason.19 <dbl>,
## #   letter.7 <dbl>, letter.33 <dbl>, letter.34 <dbl>, letter.58 <dbl>,
## #   matrix.45 <dbl>, matrix.46 <dbl>, matrix.47 <dbl>, matrix.55 <dbl>,
## #   rotate.3 <dbl>, rotate.4 <dbl>, rotate.6 <dbl>, rotate.8 <dbl>

Notice that the number of observations is equal to the number observations in the right hand dataset:

# joined data
right_join(bfi_fake_ids, ability_fake_ids, by = c("id" = "id")) %>% nrow()

## [1] 1525

# data on the left hand side
ability_fake_ids %>% nrow()

## [1] 1525

Full Join

A full_join() keeps all observations from both the left and right data.frames, regardless of matches:

full_join(bfi_fake_ids,ability_fake_ids, by = c("id"="id"))

## # A tibble: 3,325 × 45
##       A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2    E3
##    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1     2     4     3     4     4     2     3     3     4     4     3     3     3
##  2     2     4     5     2     5     5     4     4     3     4     1     1     6
##  3     5     4     5     4     4     4     5     4     2     5     2     4     4
##  4     4     4     6     5     5     4     4     3     5     5     5     3     4
##  5     2     3     3     4     5     4     4     5     3     2     2     2     5
##  6     6     6     5     6     5     6     6     6     1     3     2     1     6
##  7     2     5     5     3     5     5     4     4     2     3     4     3     4
##  8     4     3     1     5     1     3     2     4     2     4     3     6     4
##  9     4     3     6     3     3     6     6     3     4     5     5     3    NA
## 10     2     5     6     6     5     6     5     6     2     1     2     2     4
## # … with 3,315 more rows, and 32 more variables: E4 <int>, E5 <int>, N1 <int>,
## #   N2 <int>, N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
## #   O4 <int>, O5 <int>, gender <int>, education <int>, age <int>, id <int>,
## #   reason.4 <dbl>, reason.16 <dbl>, reason.17 <dbl>, reason.19 <dbl>,
## #   letter.7 <dbl>, letter.33 <dbl>, letter.34 <dbl>, letter.58 <dbl>,
## #   matrix.45 <dbl>, matrix.46 <dbl>, matrix.47 <dbl>, matrix.55 <dbl>,
## #   rotate.3 <dbl>, rotate.4 <dbl>, rotate.6 <dbl>, rotate.8 <dbl>

Notice that the number of observations is 3325. This number represents the total number of unique people that are either in the left hand or right hand dataset or both datasets.

Filtering Joins

Remember, filtering joins only affect the observations in your data, they don’t add any new variables. You might want to do a filtering join if you want to work with the obseravtions that appear in another dataset but you are not actually interested in using any of the variables in the other dataset. You might also do a filtering join to figure out why a join didn’t work.

Semi Join

A semi_join() simply keeps all observations that appear in left dataset that have a match in the right dataset. This is exactly the same as inner_join() except we didn’t add any variables to the dataset:

semi_join(bfi_fake_ids, ability_fake_ids, by = c("id" = "id"))

## # A tibble: 1,000 × 29
##       A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2    E3
##    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1     2     4     3     4     4     2     3     3     4     4     3     3     3
##  2     4     3     6     3     3     6     6     3     4     5     5     3    NA
##  3     5     5     5     6     4     5     4     3     2     2     3     3     3
##  4     5     5     5     6     6     4     4     4     2     1     2     2     4
##  5     4     5     2     2     1     5     5     5     2     2     3     4     3
##  6     4     6     6     2     5     4     4     4     4     4     1     2     5
##  7     4     4     5     4     3     5     4     5     4     6     1     2     4
##  8     1     6     6     1     5     5     4     4     2     3     1     2     4
##  9     2     4     4     4     3     6     5     6     1     1     2     4     4
## 10     2     5     1     3     5     5     4     5     2     5     1     2     6
## # … with 990 more rows, and 16 more variables: E4 <int>, E5 <int>, N1 <int>,
## #   N2 <int>, N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
## #   O4 <int>, O5 <int>, gender <int>, education <int>, age <int>, id <int>

Notice how the number of observations between the two joins are equal:

semi_join(bfi_fake_ids, ability_fake_ids, by = c("id" = "id")) %>% nrow()

## [1] 1000

inner_join(bfi_fake_ids, ability_fake_ids, by = c("id" = "id")) %>% nrow()

## [1] 1000

Anti Join

An anti_join() drops all the rows in the left dataset that have a match in the right dataset. In our case, the anti_join() will give us a dataset with all the participants that completed the BFI but did not complete the cognitive assessment for our ability dataset:

anti_join(bfi_fake_ids, ability_fake_ids, by = c("id" = "id"))

## # A tibble: 1,800 × 29
##       A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2    E3
##    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1     2     4     5     2     5     5     4     4     3     4     1     1     6
##  2     5     4     5     4     4     4     5     4     2     5     2     4     4
##  3     4     4     6     5     5     4     4     3     5     5     5     3     4
##  4     2     3     3     4     5     4     4     5     3     2     2     2     5
##  5     6     6     5     6     5     6     6     6     1     3     2     1     6
##  6     2     5     5     3     5     5     4     4     2     3     4     3     4
##  7     4     3     1     5     1     3     2     4     2     4     3     6     4
##  8     2     5     6     6     5     6     5     6     2     1     2     2     4
##  9     4     4     5     6     5     4     3     5     3     2     1     3     2
## 10     2     5     5     5     5     5     4     5     4     5     3     3     4
## # … with 1,790 more rows, and 16 more variables: E4 <int>, E5 <int>, N1 <int>,
## #   N2 <int>, N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
## #   O4 <int>, O5 <int>, gender <int>, education <int>, age <int>, id <int>

Notice that the number of observations is 1800. This number represents the total number of people that completed the BFI assessment but did not complete the cognitive assessment.

anti_join(ability_fake_ids,bfi_fake_ids, by = c("id" = "id"))

## # A tibble: 525 × 17
##    reason.4 reason.16 reason.17 reason.19 letter.7 letter.33 letter.34 letter.58
##       <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>     <dbl>     <dbl>
##  1        1         1         1         0        1         1         1         0
##  2        1         1         1         1        1         1         1         1
##  3        1         1         1         0        0         0         0         0
##  4        1         1         1         1        1         1         1         1
##  5        0         0         0         1        0         1         1         0
##  6        1        NA         1        NA        1         1         1         0
##  7        1         1         1         1        1         0         1         0
##  8        1         1         1         1        0         0         1         0
##  9        0         1         1         1        1         1         1         0
## 10        1         1         1         1        1         0         1         1
## # … with 515 more rows, and 9 more variables: matrix.45 <dbl>, matrix.46 <dbl>,
## #   matrix.47 <dbl>, matrix.55 <dbl>, rotate.3 <dbl>, rotate.4 <dbl>,
## #   rotate.6 <dbl>, rotate.8 <dbl>, id <int>

Notice that the number of observations is 525. This number represents the total number of people that completed the cognitive assessment but did not complete the BFI assessment.

dplyr verbs

A quick introduction dplyr's single table verbs, summarizing and grouping, and two table verbs

Why dplyr?

dplyr Verbs

Setup & Example Data

bfi data

Single Table Verbs

Select

Example

Select by variable name

Select by string matches

Examples

Rename

Examples

Arrange

Example

Filter

Example

Mutate

Example

Putting it all together

Summarizing and Grouping

Summarize

Example - Simple Summaries

Group By

Example - Grouped Summaries

Two-Table Verbs

Example Data

Matching values

Mutating Joins

Inner Join

Left & Right Join

Left Join

Right Join

Full Join

Filtering Joins

Semi Join

Anti Join