Lab 3 : Data Transformation with dplyr

Learning objectives

  • Data Transformation using dplyr

Load libaries

R code
library(tidyverse)
library(nycflights13)

Introduction to Data Transformation

Tables

How they are displayed in your qmd file is different from how they are rendered into a html, pdf and other files.

Pipes and shortcuts

In the last few years |> pipe was introduced as a simpler alternative to the %>% pipe that has been used in R and Tidyverse for the last 10 years. In many online examples you will see the %>% used and at times in code from generative AI. For many uses in this class they are interchangeable.

The shortcut keys for generative the |> is Ctrl/Cmd + Shift + M.

The shortcut keys for a new R code chuck are trl + Alt + I

Checking each line of codes are you write it

Today we will see in Chapter 4 the following code chunk

R code
flights |>
  filter(dest == "IAH") |> 
  group_by(year, month) |> 
  summarize(
    arr_delay = mean(arr_delay, na.rm = TRUE)
  )
# A tibble: 12 × 3
# Groups:   year [1]
    year month arr_delay
   <int> <int>     <dbl>
 1  2013     1     4.16 
 2  2013     2     5.40 
 3  2013     3    -1.19 
 4  2013     4    14.8  
 5  2013     5     0.972
 6  2013     6    11.1  
 7  2013     7    11    
 8  2013     8     0.705
 9  2013     9   -10.6  
10  2013    10     1.81 
11  2013    11    -1.78 
12  2013    12    14.5  

If I was writing the code I would check (run the code chunk) each line as a wrote it to make sure I was getting the right result and to simplify trouble shooting error messages

R code
flights |>
  filter(dest == "IAH") 
# A tibble: 7,198 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      623            627        -4      933            932
 4  2013     1     1      728            732        -4     1041           1038
 5  2013     1     1      739            739         0     1104           1038
 6  2013     1     1      908            908         0     1228           1219
 7  2013     1     1     1028           1026         2     1350           1339
 8  2013     1     1     1044           1045        -1     1352           1351
 9  2013     1     1     1114            900       134     1447           1222
10  2013     1     1     1205           1200         5     1503           1505
# ℹ 7,188 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
R code
flights |>
  filter(dest == "IAH") |> 
  group_by(year, month)
# A tibble: 7,198 × 19
# Groups:   year, month [12]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      623            627        -4      933            932
 4  2013     1     1      728            732        -4     1041           1038
 5  2013     1     1      739            739         0     1104           1038
 6  2013     1     1      908            908         0     1228           1219
 7  2013     1     1     1028           1026         2     1350           1339
 8  2013     1     1     1044           1045        -1     1352           1351
 9  2013     1     1     1114            900       134     1447           1222
10  2013     1     1     1205           1200         5     1503           1505
# ℹ 7,188 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
R code
flights |>
  filter(dest == "IAH") |> 
  group_by(year, month) |> 
  summarize(
    arr_delay = mean(arr_delay, na.rm = TRUE)
  )
# A tibble: 12 × 3
# Groups:   year [1]
    year month arr_delay
   <int> <int>     <dbl>
 1  2013     1     4.16 
 2  2013     2     5.40 
 3  2013     3    -1.19 
 4  2013     4    14.8  
 5  2013     5     0.972
 6  2013     6    11.1  
 7  2013     7    11    
 8  2013     8     0.705
 9  2013     9   -10.6  
10  2013    10     1.81 
11  2013    11    -1.78 
12  2013    12    14.5  

Assignment

In the first lab with went over assignment of a number or a character sting to a variable

x <- 2

The above code does not create a new variable. After running the code flights is unchanged. This is good in many situations working with large data because we don’t want to be creating new variables that use up more computer memory and it is easier to keep track of fewer variables. If we wish to save the end results, we can assign this to a new variable (e.g. IAH_arr_delay_by_month)

R code
IAH_arr_delay_by_month <- flights |>
  filter(dest == "IAH") |> 
  group_by(year, month) |> 
  summarize(
    arr_delay = mean(arr_delay, na.rm = TRUE)
  )

Notice that nothing prints out. The new table is put in the data object IAH_arr_delay_by_month. Now you could use this object repeatedly in your code without running the larger code chunk above each time. You can view IAH_arr_delay_by_month by using view(IAH_arr_delay_by_month) or clicking on the object in the Environment window.

Writing pseudo code

Was there a flight on every month of 2013?

Before writing any code it is best to break this down into the tasks we need to accomplish

  1. filter flight data set to the year 2013
  2. show only 1 row for each month
  3. display table to see if each month is present or count to see if rows equal 12

This is actually the hard part of solving a coding challenge. Writing the codes is relatively easy when you know the steps. This is the greatest challenge in using Generative AI to assist you in coding.

R code
flights |> 
  filter(year == 2013) |> 
  distinct(month)
# A tibble: 12 × 1
   month
   <int>
 1     1
 2    10
 3    11
 4    12
 5     2
 6     3
 7     4
 8     5
 9     6
10     7
11     8
12     9

Exercises

R for Data Science Chapter 3.

Today we will walk through Chapter 3 Data Transformation in R for Data Science. As we did last week, by putting the examples and exercises in our own Quarto Markdown file, we can create own personal path through the Chapter.

What to upload to Canvas

After you Render the qmd file to an html file, export the file to your computer and upload it to Canvas.