Lab 3 : Data Transformation with dplyr

Learning objectives

  • Data Transformation using dplyr

Load libaries

R code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
R code
library(nycflights13)

Introduction to Data Transformation

Tables

How they are displayed in your qmd file is different from how they are rendered into a html, pdf and other files.

Pipes

In the last few years |> pipe was introduced as a simpler alternative to the %>% pipe that has been used in R and Tidyverse for the last 10 years. In many online examples you will see the %>% used. For many uses in this class they are interchangeable.

Ctrl/Cmd + Shift + M.

Ctrl + Alt + I

Checking each line of codes are you write it

Today we will see in Chapter 4 the following code chunk

R code
flights |>
  filter(dest == "IAH") |> 
  group_by(year, month) |> 
  summarize(
    arr_delay = mean(arr_delay, na.rm = TRUE)
  )
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 12 × 3
# Groups:   year [1]
    year month arr_delay
   <int> <int>     <dbl>
 1  2013     1     4.16 
 2  2013     2     5.40 
 3  2013     3    -1.19 
 4  2013     4    14.8  
 5  2013     5     0.972
 6  2013     6    11.1  
 7  2013     7    11    
 8  2013     8     0.705
 9  2013     9   -10.6  
10  2013    10     1.81 
11  2013    11    -1.78 
12  2013    12    14.5  

If I was writing the code I would check each line as a wrote it to make sure I was getting the right result and to simplify trouble shooting error messages

R code
flights |>
  filter(dest == "IAH") 
# A tibble: 7,198 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      623            627        -4      933            932
 4  2013     1     1      728            732        -4     1041           1038
 5  2013     1     1      739            739         0     1104           1038
 6  2013     1     1      908            908         0     1228           1219
 7  2013     1     1     1028           1026         2     1350           1339
 8  2013     1     1     1044           1045        -1     1352           1351
 9  2013     1     1     1114            900       134     1447           1222
10  2013     1     1     1205           1200         5     1503           1505
# ℹ 7,188 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
R code
flights |>
  filter(dest == "IAH") |> 
  group_by(year, month)
# A tibble: 7,198 × 19
# Groups:   year, month [12]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      623            627        -4      933            932
 4  2013     1     1      728            732        -4     1041           1038
 5  2013     1     1      739            739         0     1104           1038
 6  2013     1     1      908            908         0     1228           1219
 7  2013     1     1     1028           1026         2     1350           1339
 8  2013     1     1     1044           1045        -1     1352           1351
 9  2013     1     1     1114            900       134     1447           1222
10  2013     1     1     1205           1200         5     1503           1505
# ℹ 7,188 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
R code
flights |>
  filter(dest == "IAH") |> 
  group_by(year, month) |> 
  summarize(
    arr_delay = mean(arr_delay, na.rm = TRUE)
  )
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 12 × 3
# Groups:   year [1]
    year month arr_delay
   <int> <int>     <dbl>
 1  2013     1     4.16 
 2  2013     2     5.40 
 3  2013     3    -1.19 
 4  2013     4    14.8  
 5  2013     5     0.972
 6  2013     6    11.1  
 7  2013     7    11    
 8  2013     8     0.705
 9  2013     9   -10.6  
10  2013    10     1.81 
11  2013    11    -1.78 
12  2013    12    14.5  

Assignment

In the first lab with went over assignment of a number or a character sting to a variable

x <- 2

We can assign this to a new variable IAH_arr_delay_by_month

R code
IAH_arr_delay_by_month <- flights |>
  filter(dest == "IAH") |> 
  group_by(year, month) |> 
  summarize(
    arr_delay = mean(arr_delay, na.rm = TRUE)
  )
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

Notice that nothing prints out. The new table is put in the data object IAH_arr_delay_by_month. Now you could use this object repeatedly in your code without running the larger code chunck above each time. You can view IAH_arr_delay_by_month by using view(IAH_arr_delay_by_month) or clicking on the object in the Environment window.

Writing pseudo code

Was there a flight on every month of 2013?

Before writing any code it is best to break this down into the tasks we need to accomplish

  1. filter flight data set to the year 2013
  2. show only 1 row for each month
  3. display table to see if each month is present or count to see if rows equal 12

This is actually the hard part of solving a coding challenge. Writing the codes is relatively easy when you know the steps

R code
flights |> 
  filter(year == 2013) |> 
  distinct(month)
# A tibble: 12 × 1
   month
   <int>
 1     1
 2    10
 3    11
 4    12
 5     2
 6     3
 7     4
 8     5
 9     6
10     7
11     8
12     9

Exercises

R for Data Science Chapter 3.

Today we will walk through Chapter 3 Data Transformation in R for Data Science. As we did last week, by putting the examples and exercises in our own Quarto Markdown file, we can create own personal path through the Chapter.

What to upload to Canvas

After you Render the qmd file to an html file, export the file to your computer and upload it to Canvas.