Introducing dplyr

We will be working with the dplyr package today. dplyr is useful when working with tabular data to rename and select columns, and summarize insights from your data. dplyr is part of the tidyverse - a larger collection of packages designed for data science.

We will very briefly rely on readr package as well.

You might want to reference UVA Library’s Michele Claibourn’s Data Exploration and Wrangling page. That offers a useful, quick overview of the main dplyr functions.

The dplyr workshop materials are available on GitHub.

Our data: Albemarle County Homes

We will be working with Albemarle County real estate data. The original data came from the Albemarle County Office of Geographic Services. The RDS team has already cleaned this data from its original, raw, messy form. If you are interested in following the steps from raw to final form, check out the PhD Plus series for Exploratory Data Analysis and Data Preparation.

Quick note: R Markdown

We are going to run this session in R Markdown. R Markdown is great because it contains both your code and the narration that a reader needs to understand your work, all in one file. This also makes it MUCH easier for someone else to understand your work, or for you to understand your own work six months in the future. Simply re-run the code in your R Markdown (.rmd) file, and export the result as html, pdf, or other file types.

For this session, we will run code chunks by clicking on the green, right-pointing triangle arrow, or click into the code chunk and type CTRL/CMD+SHIFT+Enter to run the entire code chunk or CTRL/CMD+Enter to run a line or a highlighted section.

You can optionally clear the output by clicking the X in the upper right corner.

We will get some red text after loading our packages – that’s fine. It is just a regular message.

Load libraries, read in dataset

Let’s start by loading our packages:

library(dplyr) # "a consistent set of verbs that help you solve the most common data manipulation challenges"

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr) # "a fast and friendly way to read rectangular data"

Let’s read in our data with the readr::read_csv() function. We will also use here::here() for simple file management. The here package offers a simple way to work with your files, especially in an R Markdown setting.

Note from a file management perspective, we have our raw data in a folder designated specifically for raw data.

homes <- read_csv("../data-raw/albemarle_homes_2020.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   tmp = col_character(),
##   usecode = col_character(),
##   condition = col_character(),
##   cooling = col_character(),
##   lastsaledate1 = col_character(),
##   esdistrict = col_character(),
##   msdistrict = col_character(),
##   `High School District` = col_character(),
##   lastsaledate = col_character(),
##   condition2 = col_character()
## )
## i Use `spec()` for the full column specifications.

Let’s inspect and preview our data.

# inspect the data
homes

## # A tibble: 31,228 x 27
##    tmp   yearbuilt yearremodeled usecode condition finsqft cooling bedroom
##    <chr>     <dbl>         <dbl> <chr>   <chr>       <dbl> <chr>     <dbl>
##  1 3.00~      1939            NA Single~ Fair         1596 No Cen~       3
##  2 0030~      1981            NA Single~ Average      1616 No Cen~       6
##  3 3.00~      1999            NA Single~ Average       480 No Cen~       0
##  4 3.00~      1754            NA Single~ Substand~     828 No Cen~       1
##  5 3.00~      1880            NA Single~ Fair         1224 No Cen~       2
##  6 5.00~      1769          1988 Single~ Average      4875 Centra~       4
##  7 0050~      1818          1991 Single~ Average      5573 Centra~       6
##  8 6.00~      1935          1994 Single~ Average      1120 No Cen~       3
##  9 6.00~      1880          1978 Single~ Average      1458 Centra~       3
## 10 0060~      2005            NA Single~ Average      1888 Centra~       3
## # ... with 31,218 more rows, and 19 more variables: fullbath <dbl>,
## #   halfbath <dbl>, totalrooms <dbl>, lotsize <dbl>, landvalue <dbl>,
## #   improvementsvalue <dbl>, totalvalue <dbl>, lastsaleprice <dbl>,
## #   lastsaledate1 <chr>, esdistrict <chr>, msdistrict <chr>, `High School
## #   District` <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## #   condition2 <chr>, remodel <dbl>, fp <dbl>, landuse <dbl>

# alternative ways to preview the data
# View(homes)
# glimpse(homes)
# str(homes)
# head(homes)

Rename columns

Let’s tidy up our column names. It is helpful to have simple, meaningful names, that are formatted consistently. Sometimes you have to type column names out in R, so make it easy on yourself.

Note that I will use “column” and “variable” interchangeably. Later, I will also use “row” and “observation” interchangeably.

In R, a general rule is avoid spaces and dashes in your variable names, and start your variable names with a letter.

Let’s review and rename column names.

# review columns names
names(homes)

##  [1] "tmp"                  "yearbuilt"            "yearremodeled"       
##  [4] "usecode"              "condition"            "finsqft"             
##  [7] "cooling"              "bedroom"              "fullbath"            
## [10] "halfbath"             "totalrooms"           "lotsize"             
## [13] "landvalue"            "improvementsvalue"    "totalvalue"          
## [16] "lastsaleprice"        "lastsaledate1"        "esdistrict"          
## [19] "msdistrict"           "High School District" "censustract"         
## [22] "lastsaledate"         "age"                  "condition2"          
## [25] "remodel"              "fp"                   "landuse"

These names look good, except for that High School District - it is inconsistent with the format of the other district variables. It has spaces which means you have to wrap it in quotes to call it, and that’s annoying. Let’s rename it, and we will save the object this time.

# rename problematic column name
# use this format: new_name = old_name
homes <- rename(homes, hsdistrict = "High School District")
names(homes)

##  [1] "tmp"               "yearbuilt"         "yearremodeled"    
##  [4] "usecode"           "condition"         "finsqft"          
##  [7] "cooling"           "bedroom"           "fullbath"         
## [10] "halfbath"          "totalrooms"        "lotsize"          
## [13] "landvalue"         "improvementsvalue" "totalvalue"       
## [16] "lastsaleprice"     "lastsaledate1"     "esdistrict"       
## [19] "msdistrict"        "hsdistrict"        "censustract"      
## [22] "lastsaledate"      "age"               "condition2"       
## [25] "remodel"           "fp"                "landuse"

Select: select columns

To select a few columns in a data frame, use the select() function. The first argument in the function is the data frame (homes), and everything following that are columns to keep.

select(homes, totalvalue, yearbuilt, finsqft) # select three variables

## # A tibble: 31,228 x 3
##    totalvalue yearbuilt finsqft
##         <dbl>     <dbl>   <dbl>
##  1     428600      1939    1596
##  2     352600      1981    1616
##  3     334100      1999     480
##  4     340600      1754     828
##  5     610100      1880    1224
##  6     933400      1769    4875
##  7    2419200      1818    5573
##  8    1616800      1935    1120
##  9     578100      1880    1458
## 10     322300      2005    1888
## # ... with 31,218 more rows

select(homes, lotsize:totalvalue) # select variables between lotsize and totval

## # A tibble: 31,228 x 4
##    lotsize landvalue improvementsvalue totalvalue
##      <dbl>     <dbl>             <dbl>      <dbl>
##  1   56.0     347900             80700     428600
##  2    5.17    140300            212300     352600
##  3   80.8     312500             21600     334100
##  4   80.7     327100             13500     340600
##  5   65       497000            113100     610100
##  6    5.10    106400            827000     933400
##  7  454.     1617000            802200    2419200
##  8  266.     1502900            113900    1616800
##  9   74.5     436100            142000     578100
## 10    5       109800            212500     322300
## # ... with 31,218 more rows

select(homes, contains("year")) # contains a word

## # A tibble: 31,228 x 2
##    yearbuilt yearremodeled
##        <dbl>         <dbl>
##  1      1939            NA
##  2      1981            NA
##  3      1999            NA
##  4      1754            NA
##  5      1880            NA
##  6      1769          1988
##  7      1818          1991
##  8      1935          1994
##  9      1880          1978
## 10      2005            NA
## # ... with 31,218 more rows

select(homes, ends_with("district")) # ends with a word

## # A tibble: 31,228 x 3
##    esdistrict   msdistrict hsdistrict       
##    <chr>        <chr>      <chr>            
##  1 Broadus Wood Jouett     Albemarle        
##  2 Broadus Wood Jouett     Albemarle        
##  3 Broadus Wood Jouett     Albemarle        
##  4 Broadus Wood Jouett     Albemarle        
##  5 Broadus Wood Jouett     Albemarle        
##  6 Crozet       Henley     Western Albemarle
##  7 Crozet       Henley     Western Albemarle
##  8 Broadus Wood Jouett     Albemarle        
##  9 Broadus Wood Jouett     Albemarle        
## 10 Broadus Wood Jouett     Albemarle        
## # ... with 31,218 more rows

Here’s a slightly more complex example: select the variables that I know we will be reviewing (because it’s convenient to see them first in these displays) (e.g., totalvalue), add everything else back in, remove column condition2, and save to our homes object.

(condition2 is one of the variables in our dataframe that we don’t need. It is a duplicate of the condition variable, that tells us if the home is in “excellent,” “fair,” etc. condition.)

homes <- select(homes, totalvalue, esdistrict, finsqft, halfbath, everything(), -condition2)
homes

## # A tibble: 31,228 x 26
##    totalvalue esdistrict finsqft halfbath tmp   yearbuilt yearremodeled usecode
##         <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>         <dbl> <chr>  
##  1     428600 Broadus W~    1596        0 3.00~      1939            NA Single~
##  2     352600 Broadus W~    1616        0 0030~      1981            NA Single~
##  3     334100 Broadus W~     480        0 3.00~      1999            NA Single~
##  4     340600 Broadus W~     828        0 3.00~      1754            NA Single~
##  5     610100 Broadus W~    1224        0 3.00~      1880            NA Single~
##  6     933400 Crozet        4875        0 5.00~      1769          1988 Single~
##  7    2419200 Crozet        5573        1 0050~      1818          1991 Single~
##  8    1616800 Broadus W~    1120        0 6.00~      1935          1994 Single~
##  9     578100 Broadus W~    1458        0 6.00~      1880          1978 Single~
## 10     322300 Broadus W~    1888        0 0060~      2005            NA Single~
## # ... with 31,218 more rows, and 18 more variables: condition <chr>,
## #   cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## #   lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## #   lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## #   hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## #   remodel <dbl>, fp <dbl>, landuse <dbl>

Your turn #1

Using the homes data, keep the condition, cooling, and bedroom variables only. (Do not update the data frame.)

select(homes, condition, cooling, bedroom)

## # A tibble: 31,228 x 3
##    condition   cooling        bedroom
##    <chr>       <chr>            <dbl>
##  1 Fair        No Central Air       3
##  2 Average     No Central Air       6
##  3 Average     No Central Air       0
##  4 Substandard No Central Air       1
##  5 Fair        No Central Air       2
##  6 Average     Central Air          4
##  7 Average     Central Air          6
##  8 Average     No Central Air       3
##  9 Average     Central Air          3
## 10 Average     Central Air          3
## # ... with 31,218 more rows

Quick review of our data

First, let’s review our data with the summary() function:

# get a summary of dataset variables
summary(homes)

##    totalvalue       esdistrict           finsqft        halfbath     
##  Min.   :   4300   Length:31228       Min.   : 144   Min.   :0.0000  
##  1st Qu.: 245800   Class :character   1st Qu.:1386   1st Qu.:0.0000  
##  Median : 344050   Mode  :character   Median :1860   Median :1.0000  
##  Mean   : 429529                      Mean   :2087   Mean   :0.6323  
##  3rd Qu.: 500200                      3rd Qu.:2573   3rd Qu.:1.0000  
##  Max.   :7859000                      Max.   :9649   Max.   :5.0000  
##                                                      NA's   :110     
##      tmp              yearbuilt    yearremodeled     usecode         
##  Length:31228       Min.   :1668   Min.   :   2    Length:31228      
##  Class :character   1st Qu.:1973   1st Qu.:1992    Class :character  
##  Mode  :character   Median :1989   Median :2002    Mode  :character  
##                     Mean   :1981   Mean   :1998                      
##                     3rd Qu.:2002   3rd Qu.:2008                      
##                     Max.   :2019   Max.   :2019                      
##                     NA's   :952    NA's   :28929                     
##   condition           cooling             bedroom          fullbath    
##  Length:31228       Length:31228       Min.   : 0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.: 3.000   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median : 3.000   Median :2.000  
##                                        Mean   : 3.386   Mean   :2.324  
##                                        3rd Qu.: 4.000   3rd Qu.:3.000  
##                                        Max.   :12.000   Max.   :9.000  
##                                        NA's   :1        NA's   :1      
##    totalrooms        lotsize            landvalue       improvementsvalue
##  Min.   : 0.000   Min.   :   0.0000   Min.   :      0   Min.   :      0  
##  1st Qu.: 6.000   1st Qu.:   0.1620   1st Qu.:  80000   1st Qu.: 154600  
##  Median : 7.000   Median :   0.6855   Median : 117000   Median : 225100  
##  Mean   : 7.073   Mean   :   4.7997   Mean   : 142752   Mean   : 286777  
##  3rd Qu.: 8.000   3rd Qu.:   2.7480   3rd Qu.: 151900   3rd Qu.: 340025  
##  Max.   :84.000   Max.   :1067.1500   Max.   :6651000   Max.   :4953700  
##  NA's   :9                                                               
##  lastsaleprice      lastsaledate1       msdistrict         hsdistrict       
##  Min.   :       0   Length:31228       Length:31228       Length:31228      
##  1st Qu.:       0   Class :character   Class :character   Class :character  
##  Median :  184000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :  246097                                                           
##  3rd Qu.:  351000                                                           
##  Max.   :11925000                                                           
##  NA's   :1                                                                  
##   censustract    lastsaledate            age            remodel       
##  Min.   :101.0   Length:31228       Min.   :  0.00   Min.   :0.00000  
##  1st Qu.:103.0   Class :character   1st Qu.: 17.00   1st Qu.:0.00000  
##  Median :107.0   Mode  :character   Median : 30.00   Median :0.00000  
##  Mean   :107.5                      Mean   : 37.71   Mean   :0.07362  
##  3rd Qu.:111.0                      3rd Qu.: 46.00   3rd Qu.:0.00000  
##  Max.   :114.0                      Max.   :351.00   Max.   :1.00000  
##  NA's   :5                                                            
##        fp            landuse       
##  Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :1.0000   Median :0.00000  
##  Mean   :0.7296   Mean   :0.05229  
##  3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000  
##

Here’s a quick review of the school names in our dataset. We can use the distinct() function to select only “distinct” or unique rows in a data frame.

# review high school, middle school, and elementary school names
distinct(homes, hsdistrict)

## # A tibble: 4 x 1
##   hsdistrict       
##   <chr>            
## 1 Albemarle        
## 2 Western Albemarle
## 3 Monticello       
## 4 Unassigned

distinct(homes, msdistrict)

## # A tibble: 6 x 1
##   msdistrict
##   <chr>     
## 1 Jouett    
## 2 Henley    
## 3 Sutherland
## 4 Burley    
## 5 Unassigned
## 6 Walton

distinct(homes, esdistrict)

## # A tibble: 16 x 1
##    esdistrict      
##    <chr>           
##  1 Broadus Wood    
##  2 Crozet          
##  3 Meriwether Lewis
##  4 Baker-Butler    
##  5 Stony Point     
##  6 Stone-Robinson  
##  7 Unassigned      
##  8 Brownsville     
##  9 Greer           
## 10 Agnor-Hurt      
## 11 Woodbrook       
## 12 Hollymead       
## 13 Murray          
## 14 Red Hill        
## 15 Cale            
## 16 Scottsville

Your turn #2

Using the homes data, display the unique rows in the condition variable.

distinct(homes, condition)

## # A tibble: 8 x 1
##   condition  
##   <chr>      
## 1 Fair       
## 2 Average    
## 3 Substandard
## 4 Excellent  
## 5 Unknown    
## 6 Poor       
## 7 NULL       
## 8 Good

Filter: filter rows

To return specific rows, usefilter(). When we use filter(), we can use various logical tests:

==, >, >=, <, <=, !=
&, |
x %in% y (x is a member of y)
is.na(), !is.na()

filter(homes, esdistrict == "Stone-Robinson") # Look at Stone-Robinson esdistrict homes only

## # A tibble: 3,593 x 26
##    totalvalue esdistrict finsqft halfbath tmp   yearbuilt yearremodeled usecode
##         <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>         <dbl> <chr>  
##  1     105300 Stone-Rob~    2840        0 3.60~      1920            NA Single~
##  2     244100 Stone-Rob~    1614        0 3.60~      1979            NA Single~
##  3     214900 Stone-Rob~    1418        0 3.60~      1976            NA Single~
##  4     139600 Stone-Rob~    1276        0 0360~      1754          1981 Single~
##  5     185700 Stone-Rob~    2050        1 0360~      1975            NA Single~
##  6     602500 Stone-Rob~    2396        0 3.60~      1871            NA Single~
##  7     424100 Stone-Rob~    2694        0 0360~      1997            NA Single~
##  8     677900 Stone-Rob~    2453        1 3.60~      1989            NA Single~
##  9    1268100 Stone-Rob~    5603        2 3.60~      2006            NA Single~
## 10     241700 Stone-Rob~    1889        0 3.60~      1910            NA Single~
## # ... with 3,583 more rows, and 18 more variables: condition <chr>,
## #   cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## #   lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## #   lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## #   hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## #   remodel <dbl>, fp <dbl>, landuse <dbl>

filter(homes, yearbuilt > 2011) # homes built after 2011

## # A tibble: 2,765 x 26
##    totalvalue esdistrict finsqft halfbath tmp   yearbuilt yearremodeled usecode
##         <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>         <dbl> <chr>  
##  1     586200 Broadus W~    2993        0 0070~      2017            NA Single~
##  2     284300 Broadus W~    1456        0 0070~      2018            NA Single~
##  3     965600 Broadus W~    2832        1 7.00~      2018            NA Single~
##  4     366100 Meriwethe~    2070        0 0070~      2018            NA Single~
##  5     191300 Broadus W~    1363        0 0070~      2013            NA Single~
##  6     563300 Meriwethe~    2325        0 0070~      2016            NA Single~
##  7     786300 Broadus W~    1908        1 0080~      2014            NA Single~
##  8     217300 Broadus W~    2014        0 0080~      2013            NA Single~
##  9     507200 Broadus W~    2184        0 8.00~      2015            NA Single~
## 10     487500 Broadus W~    2314        0 0080~      2018            NA Single~
## # ... with 2,755 more rows, and 18 more variables: condition <chr>,
## #   cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## #   lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## #   lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## #   hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## #   remodel <dbl>, fp <dbl>, landuse <dbl>

filter(homes, condition %in% c("Poor", "Substandard"))

## # A tibble: 476 x 26
##    totalvalue esdistrict finsqft halfbath tmp   yearbuilt yearremodeled usecode
##         <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>         <dbl> <chr>  
##  1     340600 Broadus W~     828        0 3.00~      1754            NA Single~
##  2     199000 Crozet        1538        0 6.00~      1900            NA Single~
##  3      74200 Crozet        1104        0 6.00~      1900            NA Single~
##  4     151400 Crozet        1272        0 6.00~      1960            NA Single~
##  5     123700 Meriwethe~    1739        0 7.00~      1910            NA Single~
##  6     213400 Meriwethe~    1598        0 7.00~      1920            NA Single~
##  7     128100 Meriwethe~     904        0 7.00~      1950            NA Single~
##  8     165400 Broadus W~     863        0 7.00~      1754            NA Single~
##  9      92800 Broadus W~     832        0 0070~      1940            NA Single~
## 10     822200 Broadus W~    1936        0 8.00~      1900            NA Single~
## # ... with 466 more rows, and 18 more variables: condition <chr>,
## #   cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## #   lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## #   lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## #   hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## #   remodel <dbl>, fp <dbl>, landuse <dbl>

Your turn #3

Using the homes data, keep the observations where finsqft is less than 1000. (Do not save the object.)

filter(homes, finsqft < 1000)

## # A tibble: 1,709 x 26
##    totalvalue esdistrict finsqft halfbath tmp   yearbuilt yearremodeled usecode
##         <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>         <dbl> <chr>  
##  1     334100 Broadus W~     480        0 3.00~      1999            NA Single~
##  2     340600 Broadus W~     828        0 3.00~      1754            NA Single~
##  3      93600 Crozet         800        0 0060~      1960            NA Single~
##  4     124400 Crozet         975        0 0060~      1966            NA Single~
##  5     382100 Crozet         960        0 6.00~      1972            NA Single~
##  6     371600 Crozet         671        0 0060~      1996            NA Single~
##  7     109500 Crozet         618        0 6.00~      1940            NA Single~
##  8     914000 Crozet         864        0 6.00~      1936          1992 Single~
##  9     135100 Crozet         952        0 0060~      1895            NA Single~
## 10      94400 Crozet         884        0 0060~      1950            NA Single~
## # ... with 1,699 more rows, and 18 more variables: condition <chr>,
## #   cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## #   lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## #   lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## #   hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## #   remodel <dbl>, fp <dbl>, landuse <dbl>

Filter for missings

Let’s look at homes missing and not missing half-bath data.

filter(homes, is.na(halfbath)) # homes missing halfbath data

## # A tibble: 110 x 26
##    totalvalue esdistrict finsqft halfbath tmp   yearbuilt yearremodeled usecode
##         <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>         <dbl> <chr>  
##  1     398800 Broadus W~    1512       NA 0090~        NA            NA Single~
##  2     362000 Broadus W~    1500       NA 0100~        NA            NA Single~
##  3     261900 Broadus W~    1120       NA 0180~        NA            NA Single~
##  4     303100 Broadus W~    1160       NA 0180~        NA            NA Single~
##  5     912700 Broadus W~    3737       NA 0190~        NA            NA Single~
##  6     550900 Broadus W~    2324       NA 0190~        NA            NA Single~
##  7     514200 Broadus W~    2356       NA 0190~        NA            NA Single~
##  8     284400 Broadus W~    1288       NA 0190~      2017            NA Single~
##  9     413900 Broadus W~    3488       NA 0190~        NA            NA Single~
## 10     499700 Broadus W~    2572       NA 0190~        NA            NA Single~
## # ... with 100 more rows, and 18 more variables: condition <chr>,
## #   cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## #   lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## #   lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## #   hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## #   remodel <dbl>, fp <dbl>, landuse <dbl>

filter(homes, !is.na(halfbath)) # homes not missing halfbath data

## # A tibble: 31,118 x 26
##    totalvalue esdistrict finsqft halfbath tmp   yearbuilt yearremodeled usecode
##         <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>         <dbl> <chr>  
##  1     428600 Broadus W~    1596        0 3.00~      1939            NA Single~
##  2     352600 Broadus W~    1616        0 0030~      1981            NA Single~
##  3     334100 Broadus W~     480        0 3.00~      1999            NA Single~
##  4     340600 Broadus W~     828        0 3.00~      1754            NA Single~
##  5     610100 Broadus W~    1224        0 3.00~      1880            NA Single~
##  6     933400 Crozet        4875        0 5.00~      1769          1988 Single~
##  7    2419200 Crozet        5573        1 0050~      1818          1991 Single~
##  8    1616800 Broadus W~    1120        0 6.00~      1935          1994 Single~
##  9     578100 Broadus W~    1458        0 6.00~      1880          1978 Single~
## 10     322300 Broadus W~    1888        0 0060~      2005            NA Single~
## # ... with 31,108 more rows, and 18 more variables: condition <chr>,
## #   cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## #   lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## #   lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## #   hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## #   remodel <dbl>, fp <dbl>, landuse <dbl>

Your turn #4

Using the homes data, keep the observations where totalrooms is missing. (Do not save the object.)

filter(homes, is.na(totalrooms))

## # A tibble: 9 x 26
##   totalvalue esdistrict finsqft halfbath tmp   yearbuilt yearremodeled usecode
##        <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>         <dbl> <chr>  
## 1     471000 Baker-But~    2243       NA 0210~        NA            NA Single~
## 2     375600 Baker-But~    1822        1 032G~      2018            NA Single~
## 3     130000 Baker-But~    1832        1 032G~      2019            NA Single~
## 4     220100 Agnor-Hurt    1576        0 0450~        NA            NA Duplex 
## 5     569300 Agnor-Hurt    3084        1 062F~        NA            NA Single~
## 6     503000 Agnor-Hurt    2548        1 062F~      2018            NA Single~
## 7     558300 Agnor-Hurt    2906        1 062G~        NA            NA Single~
## 8     525600 Agnor-Hurt    1875        0 062G~        NA            NA Single~
## 9     484200 Agnor-Hurt    2587        1 062H~      2017            NA Single~
## # ... with 18 more variables: condition <chr>, cooling <chr>, bedroom <dbl>,
## #   fullbath <dbl>, totalrooms <dbl>, lotsize <dbl>, landvalue <dbl>,
## #   improvementsvalue <dbl>, lastsaleprice <dbl>, lastsaledate1 <chr>,
## #   msdistrict <chr>, hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>,
## #   age <dbl>, remodel <dbl>, fp <dbl>, landuse <dbl>

Value matching with %in%

The %in% value matching operator can be used to match conditions provided in a vector using the c() function. You can think of %in% as is a member of. It can replace multiple conditional statements linked together with OR (i.e., with a vertical bar |).

Let’s focus on homes that are in the elementary school districts of two schools: Stone-Robinson and Cale. We can do this with a vertical bar | which is the OR operator.

filter(homes, esdistrict == "Stone-Robinson" | esdistrict == "Cale")

## # A tibble: 6,269 x 26
##    totalvalue esdistrict finsqft halfbath tmp   yearbuilt yearremodeled usecode
##         <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>         <dbl> <chr>  
##  1     105300 Stone-Rob~    2840        0 3.60~      1920            NA Single~
##  2     244100 Stone-Rob~    1614        0 3.60~      1979            NA Single~
##  3     214900 Stone-Rob~    1418        0 3.60~      1976            NA Single~
##  4     139600 Stone-Rob~    1276        0 0360~      1754          1981 Single~
##  5     185700 Stone-Rob~    2050        1 0360~      1975            NA Single~
##  6     602500 Stone-Rob~    2396        0 3.60~      1871            NA Single~
##  7     424100 Stone-Rob~    2694        0 0360~      1997            NA Single~
##  8     677900 Stone-Rob~    2453        1 3.60~      1989            NA Single~
##  9    1268100 Stone-Rob~    5603        2 3.60~      2006            NA Single~
## 10     241700 Stone-Rob~    1889        0 3.60~      1910            NA Single~
## # ... with 6,259 more rows, and 18 more variables: condition <chr>,
## #   cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## #   lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## #   lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## #   hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## #   remodel <dbl>, fp <dbl>, landuse <dbl>

That’s fine when we only have two matches, but it is easy to see how that could get tedious and lengthy if you have more statements you want to match. That’s where the value matching operator %in% comes in:

filter(homes, esdistrict %in% c("Stone-Robinson", "Cale"))

## # A tibble: 6,269 x 26
##    totalvalue esdistrict finsqft halfbath tmp   yearbuilt yearremodeled usecode
##         <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>         <dbl> <chr>  
##  1     105300 Stone-Rob~    2840        0 3.60~      1920            NA Single~
##  2     244100 Stone-Rob~    1614        0 3.60~      1979            NA Single~
##  3     214900 Stone-Rob~    1418        0 3.60~      1976            NA Single~
##  4     139600 Stone-Rob~    1276        0 0360~      1754          1981 Single~
##  5     185700 Stone-Rob~    2050        1 0360~      1975            NA Single~
##  6     602500 Stone-Rob~    2396        0 3.60~      1871            NA Single~
##  7     424100 Stone-Rob~    2694        0 0360~      1997            NA Single~
##  8     677900 Stone-Rob~    2453        1 3.60~      1989            NA Single~
##  9    1268100 Stone-Rob~    5603        2 3.60~      2006            NA Single~
## 10     241700 Stone-Rob~    1889        0 3.60~      1910            NA Single~
## # ... with 6,259 more rows, and 18 more variables: condition <chr>,
## #   cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## #   lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## #   lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## #   hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## #   remodel <dbl>, fp <dbl>, landuse <dbl>

Your turn #5

Using the value matching operator, filter the homes dataset so only homes with condition of “Poor” or “Substandard” remain. (Do not save the object.)

filter(homes, condition %in% c("Poor", "Substandard"))

## # A tibble: 476 x 26
##    totalvalue esdistrict finsqft halfbath tmp   yearbuilt yearremodeled usecode
##         <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>         <dbl> <chr>  
##  1     340600 Broadus W~     828        0 3.00~      1754            NA Single~
##  2     199000 Crozet        1538        0 6.00~      1900            NA Single~
##  3      74200 Crozet        1104        0 6.00~      1900            NA Single~
##  4     151400 Crozet        1272        0 6.00~      1960            NA Single~
##  5     123700 Meriwethe~    1739        0 7.00~      1910            NA Single~
##  6     213400 Meriwethe~    1598        0 7.00~      1920            NA Single~
##  7     128100 Meriwethe~     904        0 7.00~      1950            NA Single~
##  8     165400 Broadus W~     863        0 7.00~      1754            NA Single~
##  9      92800 Broadus W~     832        0 0070~      1940            NA Single~
## 10     822200 Broadus W~    1936        0 8.00~      1900            NA Single~
## # ... with 466 more rows, and 18 more variables: condition <chr>,
## #   cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## #   lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## #   lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## #   hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## #   remodel <dbl>, fp <dbl>, landuse <dbl>

Tip!

Remember:

select() returns specific columns
filter() returns specific rows

Arrange: sort rows

Use arrange() to sort the observations. Let’s take a closer look at which homes have the highest totalvalue.

#look for max last sale price
arrange(homes, totalvalue)

## # A tibble: 31,228 x 26
##    totalvalue esdistrict finsqft halfbath tmp   yearbuilt yearremodeled usecode
##         <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>         <dbl> <chr>  
##  1       4300 Scottsvil~     480        0 128A~      1754            NA Single~
##  2       9600 Stone-Rob~     896        0 0650~      1754            NA Single~
##  3      11800 Brownsvil~     862        0 054A~      1950            NA Single~
##  4      12600 Meriwethe~     672        0 2.90~      1965            NA Single~
##  5      15300 Red Hill      1024        0 1090~      1960            NA Single~
##  6      15900 Scottsvil~    1476        0 128A~      1920            NA Single~
##  7      16400 Brownsvil~    2152        0 085A~      1900            NA Single~
##  8      20200 Red Hill       624        0 1080~      1956            NA Single~
##  9      20400 Red Hill       912        0 1260~      1970            NA Single~
## 10      20600 Scottsvil~     829        0 128A~      1754            NA Single~
## # ... with 31,218 more rows, and 18 more variables: condition <chr>,
## #   cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## #   lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## #   lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## #   hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## #   remodel <dbl>, fp <dbl>, landuse <dbl>

Oops, we wanted the highest values at top. Let’s try the desc() helper function.

arrange(homes, desc(totalvalue))

## # A tibble: 31,228 x 26
##    totalvalue esdistrict finsqft halfbath tmp   yearbuilt yearremodeled usecode
##         <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>         <dbl> <chr>  
##  1    7859000 Stony Poi~    7680        2 3.50~      1860            NA Single~
##  2    7766800 Murray        7264        1 060E~      1939          2000 Single~
##  3    6755600 Stone-Rob~    1906        0 0920~      1754            NA Single~
##  4    6448900 Murray        6035        2 060E~      1942          2000 Single~
##  5    5980300 Stone-Rob~    2947        0 0910~      1754            NA Single~
##  6    5886700 Murray        5657        2 0590~      2009            NA Single~
##  7    5353100 Red Hill      7110        0 1200~      2014            NA Single~
##  8    5284500 Murray        6270        2 060E~      1932          2011 Single~
##  9    5251800 Red Hill      7194        3 1210~      1740          1994 Single~
## 10    5123000 Murray        8467        2 060E~      1941          2008 Single~
## # ... with 31,218 more rows, and 18 more variables: condition <chr>,
## #   cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## #   lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## #   lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## #   hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## #   remodel <dbl>, fp <dbl>, landuse <dbl>

Your turn #6

Using the homes data, sort the observations so the smallest values for finsqft are on top. (Do not save the object.)

arrange(homes, finsqft)

## # A tibble: 31,228 x 26
##    totalvalue esdistrict finsqft halfbath tmp   yearbuilt yearremodeled usecode
##         <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>         <dbl> <chr>  
##  1     228800 Brownsvil~     144        0 3.80~      1952            NA Single~
##  2      51500 Scottsvil~     176        0 130A~        NA            NA Single~
##  3     562600 Meriwethe~     221        0 4.10~      1754            NA Single~
##  4     107800 Brownsvil~     231        0 0380~      1995            NA Single~
##  5     320600 Stone-Rob~     272        0 6.60~      1900            NA Single~
##  6     144400 Crozet         300        0 0140~      1955            NA Single~
##  7     475500 Murray         304        0 5.70~      1973            NA Single~
##  8      41900 Stone-Rob~     315        0 6.60~      1963            NA Single~
##  9     163900 Brownsvil~     328        0 0710~      1986            NA Single~
## 10      40500 Crozet         348        0 0570~      1960            NA Single~
## # ... with 31,218 more rows, and 18 more variables: condition <chr>,
## #   cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## #   lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## #   lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## #   hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## #   remodel <dbl>, fp <dbl>, landuse <dbl>

Count: number of observations

When working with data, we often want to know the number of observations found for each categorical variable. For this task, dplyr provides count(). Let’s count the number of rows of data for each high school district:

count(homes, hsdistrict)

## # A tibble: 4 x 2
##   hsdistrict            n
## * <chr>             <int>
## 1 Albemarle         12254
## 2 Monticello        10105
## 3 Unassigned            5
## 4 Western Albemarle  8864

count(homes, hsdistrict, condition)

## # A tibble: 26 x 3
##    hsdistrict condition       n
##    <chr>      <chr>       <int>
##  1 Albemarle  Average      9119
##  2 Albemarle  Excellent     108
##  3 Albemarle  Fair          447
##  4 Albemarle  Good         2138
##  5 Albemarle  NULL          334
##  6 Albemarle  Poor           78
##  7 Albemarle  Substandard    30
##  8 Monticello Average      7572
##  9 Monticello Excellent      61
## 10 Monticello Fair          481
## # ... with 16 more rows

Your turn #7

Use the count() function to create a table displaying the counts of fullbath.

count(homes, fullbath)

## # A tibble: 11 x 2
##    fullbath     n
##  *    <dbl> <int>
##  1        0   168
##  2        1  4960
##  3        2 14679
##  4        3  8397
##  5        4  2267
##  6        5   595
##  7        6   125
##  8        7    25
##  9        8     8
## 10        9     3
## 11       NA     1

Pipe: a series of functions

What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.

With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:

monticello <- filter(homes, hsdistrict == "Monticello")
monticello_intermediate <- select(monticello, totalvalue, age)

This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.

You can also nest functions (i.e. one function inside of another), like this:

monticello_nest <- select(filter(homes, hsdistrict == "Monticello"), totalvalue, age)

This can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, first filter(), then select()).

The best option is a “pipe.” Pipes let you take the output of one function and send it directly to the next, which is useful when you want to apply many functions to the same data frame. The pipe operator looks like %>%. You can also type CTRL/CMD + Shift + M as a keyboard shortcut. (M since the pipe originally came from the magrittr package.) It is helpful to say yourself “and then” when you see a pipe.

You don’t need to put the data frame homes inside the parentheses. It already gets piped in right at the beginning. A pipe moves left to right.

Let’s recreate the above examples with a pipe:

monticello_pipe <- homes %>% 
  filter(hsdistrict == "Monticello") %>% 
  select(totalvalue, age)

What’s happening here:

Take homes and then,
Use filter() to retrieve only the observations that have hsdistrict equal to Monticello, and then
Use select() to keep the totalvalue and age columns.

Your turn #8

Using pipes, subset the homes data to include homes that are associated with the Western Albemarle high school district (hsdistrict), and retain only the columns yearbuilt, totalrooms, and condition.

Save to a new object: west_alb.

west_alb <- homes %>%
  filter(hsdistrict == "Western Albemarle") %>% 
  select(yearbuilt, totalrooms, condition)

Mutate: compute new variables

Let’s take a look at the summary statistics for totalvalue.

summary(homes$totalvalue)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4300  245800  344050  429529  500200 7859000

That max really is a lot higher than the other values. We can use dplyr::mutate() to create, modify or delete numeric or logical columns.

# use mutate to log transform total value into a new numeric column and to compute new logical variable
homes <- homes %>% 
  mutate(logtotalvalue = log(totalvalue)) %>% 
  select(logtotalvalue, everything()) #puts new variable first
homes

## # A tibble: 31,228 x 27
##    logtotalvalue totalvalue esdistrict finsqft halfbath tmp   yearbuilt
##            <dbl>      <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>
##  1          13.0     428600 Broadus W~    1596        0 3.00~      1939
##  2          12.8     352600 Broadus W~    1616        0 0030~      1981
##  3          12.7     334100 Broadus W~     480        0 3.00~      1999
##  4          12.7     340600 Broadus W~     828        0 3.00~      1754
##  5          13.3     610100 Broadus W~    1224        0 3.00~      1880
##  6          13.7     933400 Crozet        4875        0 5.00~      1769
##  7          14.7    2419200 Crozet        5573        1 0050~      1818
##  8          14.3    1616800 Broadus W~    1120        0 6.00~      1935
##  9          13.3     578100 Broadus W~    1458        0 6.00~      1880
## 10          12.7     322300 Broadus W~    1888        0 0060~      2005
## # ... with 31,218 more rows, and 20 more variables: yearremodeled <dbl>,
## #   usecode <chr>, condition <chr>, cooling <chr>, bedroom <dbl>,
## #   fullbath <dbl>, totalrooms <dbl>, lotsize <dbl>, landvalue <dbl>,
## #   improvementsvalue <dbl>, lastsaleprice <dbl>, lastsaledate1 <chr>,
## #   msdistrict <chr>, hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>,
## #   age <dbl>, remodel <dbl>, fp <dbl>, landuse <dbl>

Note that you have to remove missings for summary statistics. You won’t get an error, but the entire column will show up missing (NA).

#you will get an error
homes <- homes %>% 
  mutate(pricier = lastsaleprice > mean(lastsaleprice)) %>% 
  select(pricier, everything())
homes

## # A tibble: 31,228 x 28
##    pricier logtotalvalue totalvalue esdistrict finsqft halfbath tmp   yearbuilt
##    <lgl>           <dbl>      <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>
##  1 NA               13.0     428600 Broadus W~    1596        0 3.00~      1939
##  2 NA               12.8     352600 Broadus W~    1616        0 0030~      1981
##  3 NA               12.7     334100 Broadus W~     480        0 3.00~      1999
##  4 NA               12.7     340600 Broadus W~     828        0 3.00~      1754
##  5 NA               13.3     610100 Broadus W~    1224        0 3.00~      1880
##  6 NA               13.7     933400 Crozet        4875        0 5.00~      1769
##  7 NA               14.7    2419200 Crozet        5573        1 0050~      1818
##  8 NA               14.3    1616800 Broadus W~    1120        0 6.00~      1935
##  9 NA               13.3     578100 Broadus W~    1458        0 6.00~      1880
## 10 NA               12.7     322300 Broadus W~    1888        0 0060~      2005
## # ... with 31,218 more rows, and 20 more variables: yearremodeled <dbl>,
## #   usecode <chr>, condition <chr>, cooling <chr>, bedroom <dbl>,
## #   fullbath <dbl>, totalrooms <dbl>, lotsize <dbl>, landvalue <dbl>,
## #   improvementsvalue <dbl>, lastsaleprice <dbl>, lastsaledate1 <chr>,
## #   msdistrict <chr>, hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>,
## #   age <dbl>, remodel <dbl>, fp <dbl>, landuse <dbl>

It works as expected if you remove the missings from the calculation of the mean:

homes <- homes %>% 
  mutate(pricier = lastsaleprice > mean(lastsaleprice, na.rm = TRUE)) %>% 
  select(pricier, everything())
homes

## # A tibble: 31,228 x 28
##    pricier logtotalvalue totalvalue esdistrict finsqft halfbath tmp   yearbuilt
##    <lgl>           <dbl>      <dbl> <chr>        <dbl>    <dbl> <chr>     <dbl>
##  1 TRUE             13.0     428600 Broadus W~    1596        0 3.00~      1939
##  2 TRUE             12.8     352600 Broadus W~    1616        0 0030~      1981
##  3 FALSE            12.7     334100 Broadus W~     480        0 3.00~      1999
##  4 FALSE            12.7     340600 Broadus W~     828        0 3.00~      1754
##  5 FALSE            13.3     610100 Broadus W~    1224        0 3.00~      1880
##  6 FALSE            13.7     933400 Crozet        4875        0 5.00~      1769
##  7 FALSE            14.7    2419200 Crozet        5573        1 0050~      1818
##  8 TRUE             14.3    1616800 Broadus W~    1120        0 6.00~      1935
##  9 FALSE            13.3     578100 Broadus W~    1458        0 6.00~      1880
## 10 FALSE            12.7     322300 Broadus W~    1888        0 0060~      2005
## # ... with 31,218 more rows, and 20 more variables: yearremodeled <dbl>,
## #   usecode <chr>, condition <chr>, cooling <chr>, bedroom <dbl>,
## #   fullbath <dbl>, totalrooms <dbl>, lotsize <dbl>, landvalue <dbl>,
## #   improvementsvalue <dbl>, lastsaleprice <dbl>, lastsaledate1 <chr>,
## #   msdistrict <chr>, hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>,
## #   age <dbl>, remodel <dbl>, fp <dbl>, landuse <dbl>

Your turn #9

Create a new data frame from the homes data that meets the following criteria: contains only the hsdistrict column and a new column called total_baths containing a value that is equal to the total number of baths (fullbath plus halfbath times 0.5). Only the rows where total_baths is greater than 4.5 should be shown in the final data frame.

Hint: think about how the commands should be ordered to produce this data frame!

Save object as: homes_totalbaths.

homes_totalbaths <- homes %>% 
  mutate(total_baths = fullbath + halfbath * 0.5) %>% 
  filter(total_baths > 4.5) %>% 
  select(hsdistrict, total_baths)
homes_totalbaths

## # A tibble: 949 x 2
##    hsdistrict        total_baths
##    <chr>                   <dbl>
##  1 Albemarle                 5  
##  2 Albemarle                 5.5
##  3 Western Albemarle         6  
##  4 Western Albemarle         5  
##  5 Western Albemarle         5.5
##  6 Western Albemarle         6  
##  7 Western Albemarle         5.5
##  8 Western Albemarle         5.5
##  9 Western Albemarle         5.5
## 10 Western Albemarle         5  
## # ... with 939 more rows

Group_by and Summarize: compare groups with “split-apply-combine” data analysis

Many data analysis tasks can be approached using the split-apply-combine paradigm:

split the data into groups,
apply some analysis to each group,
and then combine the results as output.

dplyr makes this very easy through the use of the group_by() and summarize() functions. group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. summarize() collapses each group into a single-row summary of that group.

Let’s compute the median total value by high school district:

homes %>% 
  group_by(hsdistrict) %>% 
  summarize(medtotalvalue = median(totalvalue, na.rm = TRUE))

## # A tibble: 4 x 2
##   hsdistrict        medtotalvalue
## * <chr>                     <dbl>
## 1 Albemarle                319650
## 2 Monticello               322100
## 3 Unassigned               685500
## 4 Western Albemarle        439700

Note that summarize and summarise are synonymous in dplyr.

You can also group by multiple variables. Here we are grouping by high schools and by condition. Let’s also sort the values so that the highest median total value is on top.

# group_by high school district and by condition
homes %>%
  group_by(hsdistrict, condition) %>%
  summarize(medtotalvalue = median(totalvalue, na.rm = TRUE),
            maxtotalvalue = max(totalvalue, na.rm = TRUE)) %>% 
  arrange(desc(medtotalvalue))

## `summarise()` has grouped output by 'hsdistrict'. You can override using the `.groups` argument.

## # A tibble: 26 x 4
## # Groups:   hsdistrict [4]
##    hsdistrict        condition   medtotalvalue maxtotalvalue
##    <chr>             <chr>               <dbl>         <dbl>
##  1 Unassigned        Good              1368150       1673500
##  2 Western Albemarle Excellent          700800       3508200
##  3 Unassigned        Substandard        685500        685500
##  4 Western Albemarle NULL               655600       2604700
##  5 Western Albemarle Unknown            594650        626700
##  6 Western Albemarle Good               541500       7766800
##  7 Monticello        NULL               456650       2843100
##  8 Albemarle         NULL               441150       2730200
##  9 Western Albemarle Average            428600       5886700
## 10 Monticello        Good               394400       5251800
## # ... with 16 more rows

Your turn #10

Use group_by() and summarize() to find the mean and max number of full baths for each middle school district. (Do not save the object.)

homes %>% 
  group_by(msdistrict) %>% 
  summarise(meanbath = mean(fullbath, na.rm = TRUE),
            maxbath = max(fullbath, na.rm = TRUE))

## # A tibble: 6 x 3
##   msdistrict meanbath maxbath
## * <chr>         <dbl>   <dbl>
## 1 Burley         2.42       9
## 2 Henley         2.46       8
## 3 Jouett         2.37       9
## 4 Sutherland     2.26       7
## 5 Unassigned     2.4        4
## 6 Walton         1.98       9

Export your data

Similar to the read_csv() function we used to read in a csv file, there are write functions that creates files from a data frame. This time let’s use write_rds() to save to a native R data file.

And since you did all that work, you might want to save a new, clean dataset. Do not write over your raw data!

It’s best practice to put your processed data in a folder separate from your raw data.

write_rds(homes, "../data-processed/albemarle_homes_2020_processed.Rds")

Another quick R Markdown note

Now that we are finished, we can wrap things up by clicking the “Knit…Knit to HTML” button at the top. We should see our complete document in the Viewer.

If you have never worked with R Markdown before, a “Install Required Packages” dialog will pop up that will ask if you want to install the required packages, which includes rmarkdown and knitr. Click Yes.

Cheat Sheets

RStudio’s Data Wrangling with dplyr and tidyr Cheat sheet
RStudio’s R Markdown Cheat Sheet

What we did not cover

There’s only so much we can cover in a 90 minute workshop! Here are more functions and packages to consider:

base::lapply(): apply a function over a list or vector
base::union(), intersect(), setdiff(), and setequal(): set operations
dplyr::bind_rows() and bind_cols(): bind data frames by row and column
dplyr::left_join(), right_join(), inner_join(), and full_join(): mutating joins on relational data
dplyr::semi_join() and anti_join(): filtering joins on relational data
dplyr::mutate(across()): applies the same transformation to multiple columns
tidyr::pivot_longer() and pivot_wider() to reshape data
tidyr::unite() and separate(): create new columns by pasting strings together and separate character columns with regular expressions
readr::parse_factors(): parse factors
lubridate: tidyverse package to work with dates and times
stringr: tidyverse package to work with strings
forcats: tidyverse package for working with categorical variables
janitor::clean_names(): clean column names, package offers more data cleaning functions

More Learning Resources

Similar to what we covered today:

Data Carpentry’s “R for Social Scientists - Introducing dplyr and tidyr” - good for dplyr and tidyr
R-Ladies Sydney’s RYouWithMe CleanItUp series - good for dplyr and tidyr

Expanded coverage of data wrangling:

R for Data Science Wrangle - covers more wrangling concepts than covered here

Previous UVA Library workshops on data wrangling using the Albemarle County homes data:

PhD Plus series for Exploratory Data Analysis and Data Preparation, by David Martin - step by step data prep for the Homes dataset we used today
UVA Library Research Data Services Fall 2019 Data Wrangling in R workshop, by Clay Ford - wide coverage of data wrangling concepts

Citation

This workshop was designed in part using Data Carpentry’s “R for Social Scientists - Introducing dplyr and tidyr” workshop, licensed under CC-BY 4.0 2018–2020 by The Carpentries, as well as R-Ladies Sydney RYouWithMe CleanItUp series.

Data Transformation in R with dplyr - Answers

Jennifer Huck

Spring 2021

Before we get started: