Learning R with the help of AI tools starting graphing using ggplot2

Learning objectives

Generative AI
Installing R packages
Built-in R data sets and data set packages
ggplot2

Generative AI

Climate Change

AI and your future jobs

AI won’t take your job, but someone using who knows how to use AI might. Think of AI as a force multiplier. You have to learn to code first before you can use AI to help you. Google recently reported that about 25% of its new code is AI-generated.

UMass and Generative AI

Copilot

Microsoft designed Copilot to work off of the latest version of OpenAI’s GPT model, GPT-4. GPT-5 is coming soon.

AI and R

RStudio github copilot
AI Assisted Coding in RStudio Integrating OpenAI’s ChatGPT into RStudio is now possible with “Chattr”, “GPT Studio” and “GitHub Copilot”. These new tools will help you find the right functions and commands and to quickly generate code snippets to save you time.

Vibe coding

Github, Git and AI

Bioinformatics and AI

AI and Scholarly Publishing

Introduction to R Graphics

R provides comprehensive graphics utilities for visualizing and exploring scientific data. To date we have been making a few plots using the R Base Graphics. In addition, several more recent graphics environments extend these utilities. These include the grid, lattice and ggplot2 packages. All have the roles, but ggplot2 environment that is part of the Tidyverse package has become popular and is now used for many R packages and in scientific publications.

ggplot2 and the Grammar of Graphics

ggplot2 is meant to be an implementation of the Grammar of Graphics, hence the gg in ggplot. The basic notion is that there is a grammar to the composition of graphical components in statistical graphics. By directly controlling that grammar, you can generate a large set of carefully constructed graphics from a relatively small set of operations. As Hadley Wickham (2010), the author of ggplot2 said,

“A good grammar will allow us to gain insight into the composition of complicated graphics, and reveal unexpected connections between seemingly different graphics.”

Tutorials and resources

You can make amazing graphs with ggplot, but there is a long learning curve so we will have multiple lab sessions on ggplot and graphing. Here are a few different resources for ggplot.

Hadley Wickham and Garrett Grolemund released the second edition of R for Data Science.
Data Carpentry’s Data Analysis and Visualization in R for Ecologists
For those with a visual learning style there is Maria Nattestad’s Youtube videos
The ggplot cheatsheet

On the Computer

Create and save your Quarto Markdown (qmd) file

Just like last week we will be writing our code in a Quarto Markdown (qmd) file. Remember to use the following formatting in your YAML block. You can add different themes or change the parameters below, but you need to put in the embed-resources: true line true into the YAML block.

---
title: "Lab 2 Data Visualization"
author: "You"
format:
  html:
    toc: true
    toc_float: true
    embed-resources: true
execute: 
  warning: false
  message: false
---

Installing and loading R packages

In this course we will work with many different R packages that will need to be installed on your computer. I have already installed most of these packages for students on Posit Cloud. If you are working on your own computer or on Unity, you can install them using Tools > Install Packages. You only need to install a package once!

To work with an R package load it with the library command. I always load my packages at the beginning of my files.

R code

library(tidyverse)

Data for today’s lab

In most labs we will be loading in data from files (e.g. our 23andME SNP data). Today and next week for simplicity we will work with data sets that come with R and the are available as R packages.

Data sets (data frames) that come with R

R contains pre-loaded data sets that will see in many examples posted on the internet. The mtcars and iris data sets are very popular. You can see the whole list by typing data(). This will pop up a window with a list of the data sets. Include #| eval: false within your R code chunk if you want to show but not run code. You can also use the older R Markdown style of including it in the header ```{r eval = FALSE}

R code

data()

In class we will talk more about the structure of a data set, which can be summarized using the str command

R code

glimpse(iris)

Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
$ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

You can see the whole data set by typing the name iris or by typing view(iris) which will pop up a window with the data set. However we don’t want to show all 150 observations (rows) of the iris data set in this document. We can use the head command to show just the first 5 rows.

R code

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Data sets that are part of R packages

R for Data Science uses the palmerpenguins package, “which includes the penguins dataset containing body measurements for penguins on three islands in the Palmer Archipelago, and the ggthemes package, which offers a colorblind safe color palette. We will load these for our work today.” You likely will need to install these packages before loading the libraries.

R code

library(palmerpenguins)
library(ggthemes)

Data Analysis and Visualization in R for Ecologists uses the ratdat package, a long-term dataset from Portal, Arizona, in the Chihuahuan desert.

R code

library(ratdat)

The help command can be used to learn more about the palmerpenguins and ratdat packages. After running the below commands, in the right bottom corner under the Help tab the package documentation can be viewed. I used #| eval: false in the below code chunk.

R code

help(package="palmerpenguins")
help(package="ratdat")

Exercises

R for Data Science Chapter 1

Today we will walk through Chapter 1 of R for Data Science. By putting the examples and exercises in our own Quarto Markdown file, we can create own personal path through the Chapter. Make are readable report by delineating the sections (e.g. 1.2.3 Creating a ggplot) with hashtags so they are visible in your report outline. Include all of the example code in the chapter in your report (In addition to the exercises).

Working through the exercises is a great time to explore changing the code with or without Copilot! Answers to all the questions are available online thanks to Martin Lukic and others. I recommend not using these, but learn how to use Copilot to help when your are not sure and to ask me questions during class, help sessions or email.

In your report include notes on the places you used Copilot and your prompts. One way to do this would be to have

Exercise 1

Ex 1 Copilot notes

There are probably better ways to do this. Think of one that works for you and clearly communicates to me your strategies.

Ex 1 code chunk

What to upload to Canvas

After you Render the qmd file to an html file, export the file to your computer and upload it to Canvas.