We will be working with the dplyr
package today. dplyr
is useful when working with tabular data to rename and select columns, and summarize insights from your data. dplyr
is part of the tidyverse - a larger collection of packages designed for data science.
We will very briefly rely on readr
package as well.
You might want to reference UVA Library’s Michele Claibourn’s Data Exploration and Wrangling page. That offers a useful, quick overview of the main dplyr functions.
The dplyr workshop materials are available on GitHub.
We will be working with Albemarle County real estate data. The original data came from the Albemarle County Office of Geographic Services. The RDS team has already cleaned this data from its original, raw, messy form. If you are interested in following the steps from raw to final form, check out the PhD Plus series for Exploratory Data Analysis and Data Preparation.
We are going to run this session in R Markdown. R Markdown is great because it contains both your code and the narration that a reader needs to understand your work, all in one file. This also makes it MUCH easier for someone else to understand your work, or for you to understand your own work six months in the future. Simply re-run the code in your R Markdown (.rmd) file, and export the result as html, pdf, or other file types.
For this session, we will run code chunks by clicking on the green, right-pointing triangle arrow, or click into the code chunk and type CTRL/CMD+SHIFT+Enter to run the entire code chunk or CTRL/CMD+Enter to run a line or a highlighted section.
You can optionally clear the output by clicking the X in the upper right corner.
We will get some red text after loading our packages – that’s fine. It is just a regular message.
Let’s start by loading our packages:
library(dplyr) # "a consistent set of verbs that help you solve the most common data manipulation challenges"
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr) # "a fast and friendly way to read rectangular data"
Let’s read in our data with the readr::read_csv()
function. We will also use here::here()
for simple file management. The here
package offers a simple way to work with your files, especially in an R Markdown setting.
Note from a file management perspective, we have our raw data in a folder designated specifically for raw data.
homes <- read_csv("../data-raw/albemarle_homes_2020.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## tmp = col_character(),
## usecode = col_character(),
## condition = col_character(),
## cooling = col_character(),
## lastsaledate1 = col_character(),
## esdistrict = col_character(),
## msdistrict = col_character(),
## `High School District` = col_character(),
## lastsaledate = col_character(),
## condition2 = col_character()
## )
## i Use `spec()` for the full column specifications.
Let’s inspect and preview our data.
# inspect the data
homes
## # A tibble: 31,228 x 27
## tmp yearbuilt yearremodeled usecode condition finsqft cooling bedroom
## <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 3.00~ 1939 NA Single~ Fair 1596 No Cen~ 3
## 2 0030~ 1981 NA Single~ Average 1616 No Cen~ 6
## 3 3.00~ 1999 NA Single~ Average 480 No Cen~ 0
## 4 3.00~ 1754 NA Single~ Substand~ 828 No Cen~ 1
## 5 3.00~ 1880 NA Single~ Fair 1224 No Cen~ 2
## 6 5.00~ 1769 1988 Single~ Average 4875 Centra~ 4
## 7 0050~ 1818 1991 Single~ Average 5573 Centra~ 6
## 8 6.00~ 1935 1994 Single~ Average 1120 No Cen~ 3
## 9 6.00~ 1880 1978 Single~ Average 1458 Centra~ 3
## 10 0060~ 2005 NA Single~ Average 1888 Centra~ 3
## # ... with 31,218 more rows, and 19 more variables: fullbath <dbl>,
## # halfbath <dbl>, totalrooms <dbl>, lotsize <dbl>, landvalue <dbl>,
## # improvementsvalue <dbl>, totalvalue <dbl>, lastsaleprice <dbl>,
## # lastsaledate1 <chr>, esdistrict <chr>, msdistrict <chr>, `High School
## # District` <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## # condition2 <chr>, remodel <dbl>, fp <dbl>, landuse <dbl>
# alternative ways to preview the data
# View(homes)
# glimpse(homes)
# str(homes)
# head(homes)
Let’s tidy up our column names. It is helpful to have simple, meaningful names, that are formatted consistently. Sometimes you have to type column names out in R, so make it easy on yourself.
Note that I will use “column” and “variable” interchangeably. Later, I will also use “row” and “observation” interchangeably.
In R, a general rule is avoid spaces and dashes in your variable names, and start your variable names with a letter.
Let’s review and rename column names.
# review columns names
names(homes)
## [1] "tmp" "yearbuilt" "yearremodeled"
## [4] "usecode" "condition" "finsqft"
## [7] "cooling" "bedroom" "fullbath"
## [10] "halfbath" "totalrooms" "lotsize"
## [13] "landvalue" "improvementsvalue" "totalvalue"
## [16] "lastsaleprice" "lastsaledate1" "esdistrict"
## [19] "msdistrict" "High School District" "censustract"
## [22] "lastsaledate" "age" "condition2"
## [25] "remodel" "fp" "landuse"
These names look good, except for that High School District
- it is inconsistent with the format of the other district variables. It has spaces which means you have to wrap it in quotes to call it, and that’s annoying. Let’s rename it, and we will save the object this time.
# rename problematic column name
# use this format: new_name = old_name
homes <- rename(homes, hsdistrict = "High School District")
names(homes)
## [1] "tmp" "yearbuilt" "yearremodeled"
## [4] "usecode" "condition" "finsqft"
## [7] "cooling" "bedroom" "fullbath"
## [10] "halfbath" "totalrooms" "lotsize"
## [13] "landvalue" "improvementsvalue" "totalvalue"
## [16] "lastsaleprice" "lastsaledate1" "esdistrict"
## [19] "msdistrict" "hsdistrict" "censustract"
## [22] "lastsaledate" "age" "condition2"
## [25] "remodel" "fp" "landuse"
To select a few columns in a data frame, use the select()
function. The first argument in the function is the data frame (homes
), and everything following that are columns to keep.
select(homes, totalvalue, yearbuilt, finsqft) # select three variables
## # A tibble: 31,228 x 3
## totalvalue yearbuilt finsqft
## <dbl> <dbl> <dbl>
## 1 428600 1939 1596
## 2 352600 1981 1616
## 3 334100 1999 480
## 4 340600 1754 828
## 5 610100 1880 1224
## 6 933400 1769 4875
## 7 2419200 1818 5573
## 8 1616800 1935 1120
## 9 578100 1880 1458
## 10 322300 2005 1888
## # ... with 31,218 more rows
select(homes, lotsize:totalvalue) # select variables between lotsize and totval
## # A tibble: 31,228 x 4
## lotsize landvalue improvementsvalue totalvalue
## <dbl> <dbl> <dbl> <dbl>
## 1 56.0 347900 80700 428600
## 2 5.17 140300 212300 352600
## 3 80.8 312500 21600 334100
## 4 80.7 327100 13500 340600
## 5 65 497000 113100 610100
## 6 5.10 106400 827000 933400
## 7 454. 1617000 802200 2419200
## 8 266. 1502900 113900 1616800
## 9 74.5 436100 142000 578100
## 10 5 109800 212500 322300
## # ... with 31,218 more rows
select(homes, contains("year")) # contains a word
## # A tibble: 31,228 x 2
## yearbuilt yearremodeled
## <dbl> <dbl>
## 1 1939 NA
## 2 1981 NA
## 3 1999 NA
## 4 1754 NA
## 5 1880 NA
## 6 1769 1988
## 7 1818 1991
## 8 1935 1994
## 9 1880 1978
## 10 2005 NA
## # ... with 31,218 more rows
select(homes, ends_with("district")) # ends with a word
## # A tibble: 31,228 x 3
## esdistrict msdistrict hsdistrict
## <chr> <chr> <chr>
## 1 Broadus Wood Jouett Albemarle
## 2 Broadus Wood Jouett Albemarle
## 3 Broadus Wood Jouett Albemarle
## 4 Broadus Wood Jouett Albemarle
## 5 Broadus Wood Jouett Albemarle
## 6 Crozet Henley Western Albemarle
## 7 Crozet Henley Western Albemarle
## 8 Broadus Wood Jouett Albemarle
## 9 Broadus Wood Jouett Albemarle
## 10 Broadus Wood Jouett Albemarle
## # ... with 31,218 more rows
Here’s a slightly more complex example: select the variables that I know we will be reviewing (because it’s convenient to see them first in these displays) (e.g., totalvalue), add everything else back in, remove column condition2
, and save to our homes
object.
(condition2
is one of the variables in our dataframe that we don’t need. It is a duplicate of the condition
variable, that tells us if the home is in “excellent,” “fair,” etc. condition.)
homes <- select(homes, totalvalue, esdistrict, finsqft, halfbath, everything(), -condition2)
homes
## # A tibble: 31,228 x 26
## totalvalue esdistrict finsqft halfbath tmp yearbuilt yearremodeled usecode
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 428600 Broadus W~ 1596 0 3.00~ 1939 NA Single~
## 2 352600 Broadus W~ 1616 0 0030~ 1981 NA Single~
## 3 334100 Broadus W~ 480 0 3.00~ 1999 NA Single~
## 4 340600 Broadus W~ 828 0 3.00~ 1754 NA Single~
## 5 610100 Broadus W~ 1224 0 3.00~ 1880 NA Single~
## 6 933400 Crozet 4875 0 5.00~ 1769 1988 Single~
## 7 2419200 Crozet 5573 1 0050~ 1818 1991 Single~
## 8 1616800 Broadus W~ 1120 0 6.00~ 1935 1994 Single~
## 9 578100 Broadus W~ 1458 0 6.00~ 1880 1978 Single~
## 10 322300 Broadus W~ 1888 0 0060~ 2005 NA Single~
## # ... with 31,218 more rows, and 18 more variables: condition <chr>,
## # cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## # lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## # lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## # hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## # remodel <dbl>, fp <dbl>, landuse <dbl>
Using the homes
data, keep the condition
, cooling,
and bedroom
variables only. (Do not update the data frame.)
select(homes, condition, cooling, bedroom)
## # A tibble: 31,228 x 3
## condition cooling bedroom
## <chr> <chr> <dbl>
## 1 Fair No Central Air 3
## 2 Average No Central Air 6
## 3 Average No Central Air 0
## 4 Substandard No Central Air 1
## 5 Fair No Central Air 2
## 6 Average Central Air 4
## 7 Average Central Air 6
## 8 Average No Central Air 3
## 9 Average Central Air 3
## 10 Average Central Air 3
## # ... with 31,218 more rows
First, let’s review our data with the summary()
function:
# get a summary of dataset variables
summary(homes)
## totalvalue esdistrict finsqft halfbath
## Min. : 4300 Length:31228 Min. : 144 Min. :0.0000
## 1st Qu.: 245800 Class :character 1st Qu.:1386 1st Qu.:0.0000
## Median : 344050 Mode :character Median :1860 Median :1.0000
## Mean : 429529 Mean :2087 Mean :0.6323
## 3rd Qu.: 500200 3rd Qu.:2573 3rd Qu.:1.0000
## Max. :7859000 Max. :9649 Max. :5.0000
## NA's :110
## tmp yearbuilt yearremodeled usecode
## Length:31228 Min. :1668 Min. : 2 Length:31228
## Class :character 1st Qu.:1973 1st Qu.:1992 Class :character
## Mode :character Median :1989 Median :2002 Mode :character
## Mean :1981 Mean :1998
## 3rd Qu.:2002 3rd Qu.:2008
## Max. :2019 Max. :2019
## NA's :952 NA's :28929
## condition cooling bedroom fullbath
## Length:31228 Length:31228 Min. : 0.000 Min. :0.000
## Class :character Class :character 1st Qu.: 3.000 1st Qu.:2.000
## Mode :character Mode :character Median : 3.000 Median :2.000
## Mean : 3.386 Mean :2.324
## 3rd Qu.: 4.000 3rd Qu.:3.000
## Max. :12.000 Max. :9.000
## NA's :1 NA's :1
## totalrooms lotsize landvalue improvementsvalue
## Min. : 0.000 Min. : 0.0000 Min. : 0 Min. : 0
## 1st Qu.: 6.000 1st Qu.: 0.1620 1st Qu.: 80000 1st Qu.: 154600
## Median : 7.000 Median : 0.6855 Median : 117000 Median : 225100
## Mean : 7.073 Mean : 4.7997 Mean : 142752 Mean : 286777
## 3rd Qu.: 8.000 3rd Qu.: 2.7480 3rd Qu.: 151900 3rd Qu.: 340025
## Max. :84.000 Max. :1067.1500 Max. :6651000 Max. :4953700
## NA's :9
## lastsaleprice lastsaledate1 msdistrict hsdistrict
## Min. : 0 Length:31228 Length:31228 Length:31228
## 1st Qu.: 0 Class :character Class :character Class :character
## Median : 184000 Mode :character Mode :character Mode :character
## Mean : 246097
## 3rd Qu.: 351000
## Max. :11925000
## NA's :1
## censustract lastsaledate age remodel
## Min. :101.0 Length:31228 Min. : 0.00 Min. :0.00000
## 1st Qu.:103.0 Class :character 1st Qu.: 17.00 1st Qu.:0.00000
## Median :107.0 Mode :character Median : 30.00 Median :0.00000
## Mean :107.5 Mean : 37.71 Mean :0.07362
## 3rd Qu.:111.0 3rd Qu.: 46.00 3rd Qu.:0.00000
## Max. :114.0 Max. :351.00 Max. :1.00000
## NA's :5
## fp landuse
## Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000
## Median :1.0000 Median :0.00000
## Mean :0.7296 Mean :0.05229
## 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000
##
Here’s a quick review of the school names in our dataset. We can use the distinct()
function to select only “distinct” or unique rows in a data frame.
# review high school, middle school, and elementary school names
distinct(homes, hsdistrict)
## # A tibble: 4 x 1
## hsdistrict
## <chr>
## 1 Albemarle
## 2 Western Albemarle
## 3 Monticello
## 4 Unassigned
distinct(homes, msdistrict)
## # A tibble: 6 x 1
## msdistrict
## <chr>
## 1 Jouett
## 2 Henley
## 3 Sutherland
## 4 Burley
## 5 Unassigned
## 6 Walton
distinct(homes, esdistrict)
## # A tibble: 16 x 1
## esdistrict
## <chr>
## 1 Broadus Wood
## 2 Crozet
## 3 Meriwether Lewis
## 4 Baker-Butler
## 5 Stony Point
## 6 Stone-Robinson
## 7 Unassigned
## 8 Brownsville
## 9 Greer
## 10 Agnor-Hurt
## 11 Woodbrook
## 12 Hollymead
## 13 Murray
## 14 Red Hill
## 15 Cale
## 16 Scottsville
Using the homes
data, display the unique rows in the condition
variable.
distinct(homes, condition)
## # A tibble: 8 x 1
## condition
## <chr>
## 1 Fair
## 2 Average
## 3 Substandard
## 4 Excellent
## 5 Unknown
## 6 Poor
## 7 NULL
## 8 Good
To return specific rows, usefilter()
. When we use filter()
, we can use various logical tests:
filter(homes, esdistrict == "Stone-Robinson") # Look at Stone-Robinson esdistrict homes only
## # A tibble: 3,593 x 26
## totalvalue esdistrict finsqft halfbath tmp yearbuilt yearremodeled usecode
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 105300 Stone-Rob~ 2840 0 3.60~ 1920 NA Single~
## 2 244100 Stone-Rob~ 1614 0 3.60~ 1979 NA Single~
## 3 214900 Stone-Rob~ 1418 0 3.60~ 1976 NA Single~
## 4 139600 Stone-Rob~ 1276 0 0360~ 1754 1981 Single~
## 5 185700 Stone-Rob~ 2050 1 0360~ 1975 NA Single~
## 6 602500 Stone-Rob~ 2396 0 3.60~ 1871 NA Single~
## 7 424100 Stone-Rob~ 2694 0 0360~ 1997 NA Single~
## 8 677900 Stone-Rob~ 2453 1 3.60~ 1989 NA Single~
## 9 1268100 Stone-Rob~ 5603 2 3.60~ 2006 NA Single~
## 10 241700 Stone-Rob~ 1889 0 3.60~ 1910 NA Single~
## # ... with 3,583 more rows, and 18 more variables: condition <chr>,
## # cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## # lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## # lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## # hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## # remodel <dbl>, fp <dbl>, landuse <dbl>
filter(homes, yearbuilt > 2011) # homes built after 2011
## # A tibble: 2,765 x 26
## totalvalue esdistrict finsqft halfbath tmp yearbuilt yearremodeled usecode
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 586200 Broadus W~ 2993 0 0070~ 2017 NA Single~
## 2 284300 Broadus W~ 1456 0 0070~ 2018 NA Single~
## 3 965600 Broadus W~ 2832 1 7.00~ 2018 NA Single~
## 4 366100 Meriwethe~ 2070 0 0070~ 2018 NA Single~
## 5 191300 Broadus W~ 1363 0 0070~ 2013 NA Single~
## 6 563300 Meriwethe~ 2325 0 0070~ 2016 NA Single~
## 7 786300 Broadus W~ 1908 1 0080~ 2014 NA Single~
## 8 217300 Broadus W~ 2014 0 0080~ 2013 NA Single~
## 9 507200 Broadus W~ 2184 0 8.00~ 2015 NA Single~
## 10 487500 Broadus W~ 2314 0 0080~ 2018 NA Single~
## # ... with 2,755 more rows, and 18 more variables: condition <chr>,
## # cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## # lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## # lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## # hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## # remodel <dbl>, fp <dbl>, landuse <dbl>
filter(homes, condition %in% c("Poor", "Substandard"))
## # A tibble: 476 x 26
## totalvalue esdistrict finsqft halfbath tmp yearbuilt yearremodeled usecode
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 340600 Broadus W~ 828 0 3.00~ 1754 NA Single~
## 2 199000 Crozet 1538 0 6.00~ 1900 NA Single~
## 3 74200 Crozet 1104 0 6.00~ 1900 NA Single~
## 4 151400 Crozet 1272 0 6.00~ 1960 NA Single~
## 5 123700 Meriwethe~ 1739 0 7.00~ 1910 NA Single~
## 6 213400 Meriwethe~ 1598 0 7.00~ 1920 NA Single~
## 7 128100 Meriwethe~ 904 0 7.00~ 1950 NA Single~
## 8 165400 Broadus W~ 863 0 7.00~ 1754 NA Single~
## 9 92800 Broadus W~ 832 0 0070~ 1940 NA Single~
## 10 822200 Broadus W~ 1936 0 8.00~ 1900 NA Single~
## # ... with 466 more rows, and 18 more variables: condition <chr>,
## # cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## # lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## # lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## # hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## # remodel <dbl>, fp <dbl>, landuse <dbl>
Using the homes
data, keep the observations where finsqft
is less than 1000. (Do not save the object.)
filter(homes, finsqft < 1000)
## # A tibble: 1,709 x 26
## totalvalue esdistrict finsqft halfbath tmp yearbuilt yearremodeled usecode
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 334100 Broadus W~ 480 0 3.00~ 1999 NA Single~
## 2 340600 Broadus W~ 828 0 3.00~ 1754 NA Single~
## 3 93600 Crozet 800 0 0060~ 1960 NA Single~
## 4 124400 Crozet 975 0 0060~ 1966 NA Single~
## 5 382100 Crozet 960 0 6.00~ 1972 NA Single~
## 6 371600 Crozet 671 0 0060~ 1996 NA Single~
## 7 109500 Crozet 618 0 6.00~ 1940 NA Single~
## 8 914000 Crozet 864 0 6.00~ 1936 1992 Single~
## 9 135100 Crozet 952 0 0060~ 1895 NA Single~
## 10 94400 Crozet 884 0 0060~ 1950 NA Single~
## # ... with 1,699 more rows, and 18 more variables: condition <chr>,
## # cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## # lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## # lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## # hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## # remodel <dbl>, fp <dbl>, landuse <dbl>
Let’s look at homes missing and not missing half-bath data.
filter(homes, is.na(halfbath)) # homes missing halfbath data
## # A tibble: 110 x 26
## totalvalue esdistrict finsqft halfbath tmp yearbuilt yearremodeled usecode
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 398800 Broadus W~ 1512 NA 0090~ NA NA Single~
## 2 362000 Broadus W~ 1500 NA 0100~ NA NA Single~
## 3 261900 Broadus W~ 1120 NA 0180~ NA NA Single~
## 4 303100 Broadus W~ 1160 NA 0180~ NA NA Single~
## 5 912700 Broadus W~ 3737 NA 0190~ NA NA Single~
## 6 550900 Broadus W~ 2324 NA 0190~ NA NA Single~
## 7 514200 Broadus W~ 2356 NA 0190~ NA NA Single~
## 8 284400 Broadus W~ 1288 NA 0190~ 2017 NA Single~
## 9 413900 Broadus W~ 3488 NA 0190~ NA NA Single~
## 10 499700 Broadus W~ 2572 NA 0190~ NA NA Single~
## # ... with 100 more rows, and 18 more variables: condition <chr>,
## # cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## # lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## # lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## # hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## # remodel <dbl>, fp <dbl>, landuse <dbl>
filter(homes, !is.na(halfbath)) # homes not missing halfbath data
## # A tibble: 31,118 x 26
## totalvalue esdistrict finsqft halfbath tmp yearbuilt yearremodeled usecode
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 428600 Broadus W~ 1596 0 3.00~ 1939 NA Single~
## 2 352600 Broadus W~ 1616 0 0030~ 1981 NA Single~
## 3 334100 Broadus W~ 480 0 3.00~ 1999 NA Single~
## 4 340600 Broadus W~ 828 0 3.00~ 1754 NA Single~
## 5 610100 Broadus W~ 1224 0 3.00~ 1880 NA Single~
## 6 933400 Crozet 4875 0 5.00~ 1769 1988 Single~
## 7 2419200 Crozet 5573 1 0050~ 1818 1991 Single~
## 8 1616800 Broadus W~ 1120 0 6.00~ 1935 1994 Single~
## 9 578100 Broadus W~ 1458 0 6.00~ 1880 1978 Single~
## 10 322300 Broadus W~ 1888 0 0060~ 2005 NA Single~
## # ... with 31,108 more rows, and 18 more variables: condition <chr>,
## # cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## # lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## # lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## # hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## # remodel <dbl>, fp <dbl>, landuse <dbl>
Using the homes
data, keep the observations where totalrooms
is missing. (Do not save the object.)
filter(homes, is.na(totalrooms))
## # A tibble: 9 x 26
## totalvalue esdistrict finsqft halfbath tmp yearbuilt yearremodeled usecode
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 471000 Baker-But~ 2243 NA 0210~ NA NA Single~
## 2 375600 Baker-But~ 1822 1 032G~ 2018 NA Single~
## 3 130000 Baker-But~ 1832 1 032G~ 2019 NA Single~
## 4 220100 Agnor-Hurt 1576 0 0450~ NA NA Duplex
## 5 569300 Agnor-Hurt 3084 1 062F~ NA NA Single~
## 6 503000 Agnor-Hurt 2548 1 062F~ 2018 NA Single~
## 7 558300 Agnor-Hurt 2906 1 062G~ NA NA Single~
## 8 525600 Agnor-Hurt 1875 0 062G~ NA NA Single~
## 9 484200 Agnor-Hurt 2587 1 062H~ 2017 NA Single~
## # ... with 18 more variables: condition <chr>, cooling <chr>, bedroom <dbl>,
## # fullbath <dbl>, totalrooms <dbl>, lotsize <dbl>, landvalue <dbl>,
## # improvementsvalue <dbl>, lastsaleprice <dbl>, lastsaledate1 <chr>,
## # msdistrict <chr>, hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>,
## # age <dbl>, remodel <dbl>, fp <dbl>, landuse <dbl>
The %in%
value matching operator can be used to match conditions provided in a vector using the c()
function. You can think of %in%
as is a member of. It can replace multiple conditional statements linked together with OR (i.e., with a vertical bar |
).
Let’s focus on homes that are in the elementary school districts of two schools: Stone-Robinson and Cale. We can do this with a vertical bar |
which is the OR operator.
filter(homes, esdistrict == "Stone-Robinson" | esdistrict == "Cale")
## # A tibble: 6,269 x 26
## totalvalue esdistrict finsqft halfbath tmp yearbuilt yearremodeled usecode
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 105300 Stone-Rob~ 2840 0 3.60~ 1920 NA Single~
## 2 244100 Stone-Rob~ 1614 0 3.60~ 1979 NA Single~
## 3 214900 Stone-Rob~ 1418 0 3.60~ 1976 NA Single~
## 4 139600 Stone-Rob~ 1276 0 0360~ 1754 1981 Single~
## 5 185700 Stone-Rob~ 2050 1 0360~ 1975 NA Single~
## 6 602500 Stone-Rob~ 2396 0 3.60~ 1871 NA Single~
## 7 424100 Stone-Rob~ 2694 0 0360~ 1997 NA Single~
## 8 677900 Stone-Rob~ 2453 1 3.60~ 1989 NA Single~
## 9 1268100 Stone-Rob~ 5603 2 3.60~ 2006 NA Single~
## 10 241700 Stone-Rob~ 1889 0 3.60~ 1910 NA Single~
## # ... with 6,259 more rows, and 18 more variables: condition <chr>,
## # cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## # lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## # lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## # hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## # remodel <dbl>, fp <dbl>, landuse <dbl>
That’s fine when we only have two matches, but it is easy to see how that could get tedious and lengthy if you have more statements you want to match. That’s where the value matching operator %in%
comes in:
filter(homes, esdistrict %in% c("Stone-Robinson", "Cale"))
## # A tibble: 6,269 x 26
## totalvalue esdistrict finsqft halfbath tmp yearbuilt yearremodeled usecode
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 105300 Stone-Rob~ 2840 0 3.60~ 1920 NA Single~
## 2 244100 Stone-Rob~ 1614 0 3.60~ 1979 NA Single~
## 3 214900 Stone-Rob~ 1418 0 3.60~ 1976 NA Single~
## 4 139600 Stone-Rob~ 1276 0 0360~ 1754 1981 Single~
## 5 185700 Stone-Rob~ 2050 1 0360~ 1975 NA Single~
## 6 602500 Stone-Rob~ 2396 0 3.60~ 1871 NA Single~
## 7 424100 Stone-Rob~ 2694 0 0360~ 1997 NA Single~
## 8 677900 Stone-Rob~ 2453 1 3.60~ 1989 NA Single~
## 9 1268100 Stone-Rob~ 5603 2 3.60~ 2006 NA Single~
## 10 241700 Stone-Rob~ 1889 0 3.60~ 1910 NA Single~
## # ... with 6,259 more rows, and 18 more variables: condition <chr>,
## # cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## # lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## # lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## # hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## # remodel <dbl>, fp <dbl>, landuse <dbl>
Using the value matching operator, filter the homes
dataset so only homes with condition
of “Poor” or “Substandard” remain. (Do not save the object.)
filter(homes, condition %in% c("Poor", "Substandard"))
## # A tibble: 476 x 26
## totalvalue esdistrict finsqft halfbath tmp yearbuilt yearremodeled usecode
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 340600 Broadus W~ 828 0 3.00~ 1754 NA Single~
## 2 199000 Crozet 1538 0 6.00~ 1900 NA Single~
## 3 74200 Crozet 1104 0 6.00~ 1900 NA Single~
## 4 151400 Crozet 1272 0 6.00~ 1960 NA Single~
## 5 123700 Meriwethe~ 1739 0 7.00~ 1910 NA Single~
## 6 213400 Meriwethe~ 1598 0 7.00~ 1920 NA Single~
## 7 128100 Meriwethe~ 904 0 7.00~ 1950 NA Single~
## 8 165400 Broadus W~ 863 0 7.00~ 1754 NA Single~
## 9 92800 Broadus W~ 832 0 0070~ 1940 NA Single~
## 10 822200 Broadus W~ 1936 0 8.00~ 1900 NA Single~
## # ... with 466 more rows, and 18 more variables: condition <chr>,
## # cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## # lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## # lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## # hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## # remodel <dbl>, fp <dbl>, landuse <dbl>
Remember:
select()
returns specific columnsfilter()
returns specific rowsUse arrange()
to sort the observations. Let’s take a closer look at which homes have the highest totalvalue
.
#look for max last sale price
arrange(homes, totalvalue)
## # A tibble: 31,228 x 26
## totalvalue esdistrict finsqft halfbath tmp yearbuilt yearremodeled usecode
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 4300 Scottsvil~ 480 0 128A~ 1754 NA Single~
## 2 9600 Stone-Rob~ 896 0 0650~ 1754 NA Single~
## 3 11800 Brownsvil~ 862 0 054A~ 1950 NA Single~
## 4 12600 Meriwethe~ 672 0 2.90~ 1965 NA Single~
## 5 15300 Red Hill 1024 0 1090~ 1960 NA Single~
## 6 15900 Scottsvil~ 1476 0 128A~ 1920 NA Single~
## 7 16400 Brownsvil~ 2152 0 085A~ 1900 NA Single~
## 8 20200 Red Hill 624 0 1080~ 1956 NA Single~
## 9 20400 Red Hill 912 0 1260~ 1970 NA Single~
## 10 20600 Scottsvil~ 829 0 128A~ 1754 NA Single~
## # ... with 31,218 more rows, and 18 more variables: condition <chr>,
## # cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## # lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## # lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## # hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## # remodel <dbl>, fp <dbl>, landuse <dbl>
Oops, we wanted the highest values at top. Let’s try the desc()
helper function.
arrange(homes, desc(totalvalue))
## # A tibble: 31,228 x 26
## totalvalue esdistrict finsqft halfbath tmp yearbuilt yearremodeled usecode
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 7859000 Stony Poi~ 7680 2 3.50~ 1860 NA Single~
## 2 7766800 Murray 7264 1 060E~ 1939 2000 Single~
## 3 6755600 Stone-Rob~ 1906 0 0920~ 1754 NA Single~
## 4 6448900 Murray 6035 2 060E~ 1942 2000 Single~
## 5 5980300 Stone-Rob~ 2947 0 0910~ 1754 NA Single~
## 6 5886700 Murray 5657 2 0590~ 2009 NA Single~
## 7 5353100 Red Hill 7110 0 1200~ 2014 NA Single~
## 8 5284500 Murray 6270 2 060E~ 1932 2011 Single~
## 9 5251800 Red Hill 7194 3 1210~ 1740 1994 Single~
## 10 5123000 Murray 8467 2 060E~ 1941 2008 Single~
## # ... with 31,218 more rows, and 18 more variables: condition <chr>,
## # cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## # lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## # lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## # hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## # remodel <dbl>, fp <dbl>, landuse <dbl>
Using the homes
data, sort the observations so the smallest values for finsqft
are on top. (Do not save the object.)
arrange(homes, finsqft)
## # A tibble: 31,228 x 26
## totalvalue esdistrict finsqft halfbath tmp yearbuilt yearremodeled usecode
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 228800 Brownsvil~ 144 0 3.80~ 1952 NA Single~
## 2 51500 Scottsvil~ 176 0 130A~ NA NA Single~
## 3 562600 Meriwethe~ 221 0 4.10~ 1754 NA Single~
## 4 107800 Brownsvil~ 231 0 0380~ 1995 NA Single~
## 5 320600 Stone-Rob~ 272 0 6.60~ 1900 NA Single~
## 6 144400 Crozet 300 0 0140~ 1955 NA Single~
## 7 475500 Murray 304 0 5.70~ 1973 NA Single~
## 8 41900 Stone-Rob~ 315 0 6.60~ 1963 NA Single~
## 9 163900 Brownsvil~ 328 0 0710~ 1986 NA Single~
## 10 40500 Crozet 348 0 0570~ 1960 NA Single~
## # ... with 31,218 more rows, and 18 more variables: condition <chr>,
## # cooling <chr>, bedroom <dbl>, fullbath <dbl>, totalrooms <dbl>,
## # lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## # lastsaleprice <dbl>, lastsaledate1 <chr>, msdistrict <chr>,
## # hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>, age <dbl>,
## # remodel <dbl>, fp <dbl>, landuse <dbl>
When working with data, we often want to know the number of observations found for each categorical variable. For this task, dplyr
provides count()
. Let’s count the number of rows of data for each high school district:
count(homes, hsdistrict)
## # A tibble: 4 x 2
## hsdistrict n
## * <chr> <int>
## 1 Albemarle 12254
## 2 Monticello 10105
## 3 Unassigned 5
## 4 Western Albemarle 8864
count(homes, hsdistrict, condition)
## # A tibble: 26 x 3
## hsdistrict condition n
## <chr> <chr> <int>
## 1 Albemarle Average 9119
## 2 Albemarle Excellent 108
## 3 Albemarle Fair 447
## 4 Albemarle Good 2138
## 5 Albemarle NULL 334
## 6 Albemarle Poor 78
## 7 Albemarle Substandard 30
## 8 Monticello Average 7572
## 9 Monticello Excellent 61
## 10 Monticello Fair 481
## # ... with 16 more rows
Use the count()
function to create a table displaying the counts of fullbath
.
count(homes, fullbath)
## # A tibble: 11 x 2
## fullbath n
## * <dbl> <int>
## 1 0 168
## 2 1 4960
## 3 2 14679
## 4 3 8397
## 5 4 2267
## 6 5 595
## 7 6 125
## 8 7 25
## 9 8 8
## 10 9 3
## 11 NA 1
What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.
With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:
monticello <- filter(homes, hsdistrict == "Monticello")
monticello_intermediate <- select(monticello, totalvalue, age)
This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.
You can also nest functions (i.e. one function inside of another), like this:
monticello_nest <- select(filter(homes, hsdistrict == "Monticello"), totalvalue, age)
This can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, first filter()
, then select()
).
The best option is a “pipe.” Pipes let you take the output of one function and send it directly to the next, which is useful when you want to apply many functions to the same data frame. The pipe operator looks like %>%
. You can also type CTRL/CMD + Shift + M as a keyboard shortcut. (M since the pipe originally came from the magrittr
package.) It is helpful to say yourself “and then” when you see a pipe.
You don’t need to put the data frame homes
inside the parentheses. It already gets piped in right at the beginning. A pipe moves left to right.
Let’s recreate the above examples with a pipe:
monticello_pipe <- homes %>%
filter(hsdistrict == "Monticello") %>%
select(totalvalue, age)
What’s happening here:
homes
and then,filter()
to retrieve only the observations that have hsdistrict equal to Monticello, and thenselect()
to keep the totalvalue
and age
columns.Using pipes, subset the homes
data to include homes that are associated with the Western Albemarle high school district (hsdistrict
), and retain only the columns yearbuilt
, totalrooms
, and condition
.
Save to a new object: west_alb
.
west_alb <- homes %>%
filter(hsdistrict == "Western Albemarle") %>%
select(yearbuilt, totalrooms, condition)
Let’s take a look at the summary statistics for totalvalue
.
summary(homes$totalvalue)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4300 245800 344050 429529 500200 7859000
That max really is a lot higher than the other values. We can use dplyr::mutate()
to create, modify or delete numeric or logical columns.
# use mutate to log transform total value into a new numeric column and to compute new logical variable
homes <- homes %>%
mutate(logtotalvalue = log(totalvalue)) %>%
select(logtotalvalue, everything()) #puts new variable first
homes
## # A tibble: 31,228 x 27
## logtotalvalue totalvalue esdistrict finsqft halfbath tmp yearbuilt
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
## 1 13.0 428600 Broadus W~ 1596 0 3.00~ 1939
## 2 12.8 352600 Broadus W~ 1616 0 0030~ 1981
## 3 12.7 334100 Broadus W~ 480 0 3.00~ 1999
## 4 12.7 340600 Broadus W~ 828 0 3.00~ 1754
## 5 13.3 610100 Broadus W~ 1224 0 3.00~ 1880
## 6 13.7 933400 Crozet 4875 0 5.00~ 1769
## 7 14.7 2419200 Crozet 5573 1 0050~ 1818
## 8 14.3 1616800 Broadus W~ 1120 0 6.00~ 1935
## 9 13.3 578100 Broadus W~ 1458 0 6.00~ 1880
## 10 12.7 322300 Broadus W~ 1888 0 0060~ 2005
## # ... with 31,218 more rows, and 20 more variables: yearremodeled <dbl>,
## # usecode <chr>, condition <chr>, cooling <chr>, bedroom <dbl>,
## # fullbath <dbl>, totalrooms <dbl>, lotsize <dbl>, landvalue <dbl>,
## # improvementsvalue <dbl>, lastsaleprice <dbl>, lastsaledate1 <chr>,
## # msdistrict <chr>, hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>,
## # age <dbl>, remodel <dbl>, fp <dbl>, landuse <dbl>
Note that you have to remove missings for summary statistics. You won’t get an error, but the entire column will show up missing (NA).
#you will get an error
homes <- homes %>%
mutate(pricier = lastsaleprice > mean(lastsaleprice)) %>%
select(pricier, everything())
homes
## # A tibble: 31,228 x 28
## pricier logtotalvalue totalvalue esdistrict finsqft halfbath tmp yearbuilt
## <lgl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
## 1 NA 13.0 428600 Broadus W~ 1596 0 3.00~ 1939
## 2 NA 12.8 352600 Broadus W~ 1616 0 0030~ 1981
## 3 NA 12.7 334100 Broadus W~ 480 0 3.00~ 1999
## 4 NA 12.7 340600 Broadus W~ 828 0 3.00~ 1754
## 5 NA 13.3 610100 Broadus W~ 1224 0 3.00~ 1880
## 6 NA 13.7 933400 Crozet 4875 0 5.00~ 1769
## 7 NA 14.7 2419200 Crozet 5573 1 0050~ 1818
## 8 NA 14.3 1616800 Broadus W~ 1120 0 6.00~ 1935
## 9 NA 13.3 578100 Broadus W~ 1458 0 6.00~ 1880
## 10 NA 12.7 322300 Broadus W~ 1888 0 0060~ 2005
## # ... with 31,218 more rows, and 20 more variables: yearremodeled <dbl>,
## # usecode <chr>, condition <chr>, cooling <chr>, bedroom <dbl>,
## # fullbath <dbl>, totalrooms <dbl>, lotsize <dbl>, landvalue <dbl>,
## # improvementsvalue <dbl>, lastsaleprice <dbl>, lastsaledate1 <chr>,
## # msdistrict <chr>, hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>,
## # age <dbl>, remodel <dbl>, fp <dbl>, landuse <dbl>
It works as expected if you remove the missings from the calculation of the mean:
homes <- homes %>%
mutate(pricier = lastsaleprice > mean(lastsaleprice, na.rm = TRUE)) %>%
select(pricier, everything())
homes
## # A tibble: 31,228 x 28
## pricier logtotalvalue totalvalue esdistrict finsqft halfbath tmp yearbuilt
## <lgl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
## 1 TRUE 13.0 428600 Broadus W~ 1596 0 3.00~ 1939
## 2 TRUE 12.8 352600 Broadus W~ 1616 0 0030~ 1981
## 3 FALSE 12.7 334100 Broadus W~ 480 0 3.00~ 1999
## 4 FALSE 12.7 340600 Broadus W~ 828 0 3.00~ 1754
## 5 FALSE 13.3 610100 Broadus W~ 1224 0 3.00~ 1880
## 6 FALSE 13.7 933400 Crozet 4875 0 5.00~ 1769
## 7 FALSE 14.7 2419200 Crozet 5573 1 0050~ 1818
## 8 TRUE 14.3 1616800 Broadus W~ 1120 0 6.00~ 1935
## 9 FALSE 13.3 578100 Broadus W~ 1458 0 6.00~ 1880
## 10 FALSE 12.7 322300 Broadus W~ 1888 0 0060~ 2005
## # ... with 31,218 more rows, and 20 more variables: yearremodeled <dbl>,
## # usecode <chr>, condition <chr>, cooling <chr>, bedroom <dbl>,
## # fullbath <dbl>, totalrooms <dbl>, lotsize <dbl>, landvalue <dbl>,
## # improvementsvalue <dbl>, lastsaleprice <dbl>, lastsaledate1 <chr>,
## # msdistrict <chr>, hsdistrict <chr>, censustract <dbl>, lastsaledate <chr>,
## # age <dbl>, remodel <dbl>, fp <dbl>, landuse <dbl>
Create a new data frame from the homes
data that meets the following criteria: contains only the hsdistrict
column and a new column called total_baths
containing a value that is equal to the total number of baths (fullbath
plus halfbath
times 0.5). Only the rows where total_baths is greater than 4.5 should be shown in the final data frame.
Hint: think about how the commands should be ordered to produce this data frame!
Save object as: homes_totalbaths
.
homes_totalbaths <- homes %>%
mutate(total_baths = fullbath + halfbath * 0.5) %>%
filter(total_baths > 4.5) %>%
select(hsdistrict, total_baths)
homes_totalbaths
## # A tibble: 949 x 2
## hsdistrict total_baths
## <chr> <dbl>
## 1 Albemarle 5
## 2 Albemarle 5.5
## 3 Western Albemarle 6
## 4 Western Albemarle 5
## 5 Western Albemarle 5.5
## 6 Western Albemarle 6
## 7 Western Albemarle 5.5
## 8 Western Albemarle 5.5
## 9 Western Albemarle 5.5
## 10 Western Albemarle 5
## # ... with 939 more rows
Many data analysis tasks can be approached using the split-apply-combine paradigm:
dplyr
makes this very easy through the use of the group_by()
and summarize()
functions. group_by()
takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. summarize()
collapses each group into a single-row summary of that group.
Let’s compute the median total value by high school district:
homes %>%
group_by(hsdistrict) %>%
summarize(medtotalvalue = median(totalvalue, na.rm = TRUE))
## # A tibble: 4 x 2
## hsdistrict medtotalvalue
## * <chr> <dbl>
## 1 Albemarle 319650
## 2 Monticello 322100
## 3 Unassigned 685500
## 4 Western Albemarle 439700
Note that summarize
and summarise
are synonymous in dplyr.
You can also group by multiple variables. Here we are grouping by high schools and by condition. Let’s also sort the values so that the highest median total value is on top.
# group_by high school district and by condition
homes %>%
group_by(hsdistrict, condition) %>%
summarize(medtotalvalue = median(totalvalue, na.rm = TRUE),
maxtotalvalue = max(totalvalue, na.rm = TRUE)) %>%
arrange(desc(medtotalvalue))
## `summarise()` has grouped output by 'hsdistrict'. You can override using the `.groups` argument.
## # A tibble: 26 x 4
## # Groups: hsdistrict [4]
## hsdistrict condition medtotalvalue maxtotalvalue
## <chr> <chr> <dbl> <dbl>
## 1 Unassigned Good 1368150 1673500
## 2 Western Albemarle Excellent 700800 3508200
## 3 Unassigned Substandard 685500 685500
## 4 Western Albemarle NULL 655600 2604700
## 5 Western Albemarle Unknown 594650 626700
## 6 Western Albemarle Good 541500 7766800
## 7 Monticello NULL 456650 2843100
## 8 Albemarle NULL 441150 2730200
## 9 Western Albemarle Average 428600 5886700
## 10 Monticello Good 394400 5251800
## # ... with 16 more rows
Use group_by()
and summarize()
to find the mean and max number of full baths for each middle school district. (Do not save the object.)
homes %>%
group_by(msdistrict) %>%
summarise(meanbath = mean(fullbath, na.rm = TRUE),
maxbath = max(fullbath, na.rm = TRUE))
## # A tibble: 6 x 3
## msdistrict meanbath maxbath
## * <chr> <dbl> <dbl>
## 1 Burley 2.42 9
## 2 Henley 2.46 8
## 3 Jouett 2.37 9
## 4 Sutherland 2.26 7
## 5 Unassigned 2.4 4
## 6 Walton 1.98 9
Similar to the read_csv()
function we used to read in a csv file, there are write functions that creates files from a data frame. This time let’s use write_rds()
to save to a native R data file.
And since you did all that work, you might want to save a new, clean dataset. Do not write over your raw data!
It’s best practice to put your processed data in a folder separate from your raw data.
write_rds(homes, "../data-processed/albemarle_homes_2020_processed.Rds")
Now that we are finished, we can wrap things up by clicking the “Knit…Knit to HTML” button at the top. We should see our complete document in the Viewer.
If you have never worked with R Markdown before, a “Install Required Packages” dialog will pop up that will ask if you want to install the required packages, which includes rmarkdown and knitr. Click Yes.
There’s only so much we can cover in a 90 minute workshop! Here are more functions and packages to consider:
base::lapply()
: apply a function over a list or vectorbase::union()
, intersect()
, setdiff()
, and setequal()
: set operationsdplyr::bind_rows()
and bind_cols()
: bind data frames by row and columndplyr::left_join()
, right_join()
, inner_join()
, and full_join()
: mutating joins on relational datadplyr::semi_join()
and anti_join()
: filtering joins on relational datadplyr::mutate(across())
: applies the same transformation to multiple columnstidyr::pivot_longer()
and pivot_wider()
to reshape datatidyr::unite()
and separate()
: create new columns by pasting strings together and separate character columns with regular expressionsreadr::parse_factors()
: parse factorslubridate
: tidyverse package to work with dates and timesstringr
: tidyverse package to work with stringsforcats
: tidyverse package for working with categorical variablesjanitor::clean_names()
: clean column names, package offers more data cleaning functionsSimilar to what we covered today:
Expanded coverage of data wrangling:
Previous UVA Library workshops on data wrangling using the Albemarle County homes data:
This workshop was designed in part using Data Carpentry’s “R for Social Scientists - Introducing dplyr and tidyr” workshop, licensed under CC-BY 4.0 2018–2020 by The Carpentries, as well as R-Ladies Sydney RYouWithMe CleanItUp series.