R code
plot(cars)
In recent years, the field of genomic analysis has sifted towards requiring some knowledge of R, Python/Perl/C and the use of high performance computers (often requiring some fundamental Unix skills) available at national computing centers for working with large data sets. While there are many great software packages available for particular computational problems in evolutionary biology, many software programs do not have a user interface (e.g. drop down menus and such) and are run in command line mode. The lab sessions in this course have been designed to give students an introduction to working with R and packages used for Human Genome Analysis.
The lab course is divided into 3 parts
Reproducibility is the hallmark of science, which is based on empirical observations coupled with explanatory models. While reproducibility encompasses the full science lifecycle, and includes issues such as methodological consistency and treatment of bias, in this course we will focus on computational reproducibility: the ability to document data, analyses, and models sufficiently for other researchers to be able to understand and ideally re-execute the computations that lead to scientific results and conclusions. With current publishing practices, this can be difficult because data are typically unavailable, the method sections of papers do not detail the computational approaches used, and analyses and models are often conducted in graphical programs, or, when scripted analyses are employed, the code is not available. In this course we will learn how to write code that is integrated into reproducible reports.
Here a few links that I will go over in lab:
R is the largest and most comprehensive public domain statistical computing environment. The core R package is enhanced by several hundred user-supplied add-on packages, including many for gene expression analysis, in the Comprehensive R Archive Network (CRAN). Omegahat Project for Statistical Computing. BioConductor is an open source and open development software project for the analysis and comprehension of genomic data and is based primarily on the R programming language. R and Bioconductor are free, Open Source and available for Windows, MacOS and a wide variety of UNIX platforms.
The most popular way to write R programs and to interactively run code and create graphs is using the RStudio Integrated Devopement Environment (IDE). It is open source software that is available for free. There are other ways to write and run R code, such as using text editors, VS Code editors, Neovim or Jupyter Notebooks, but we will focus on RStudio in this class.
Many introductory and advance tutorials have been developed for R. Here are a few
There are also many workshops and online R courses that you could take to follow up what you learn in this class.
Log into Posit Cloud and create an account. I will share the link for our Workspace in an Announcement on Canvas. The steps are
If you are a graduate student in 678 please set up an account (you will need your PIs approval) on Unity https://unity.rc.umass.edu/ the UMass High Performance Computing cluster. You will be using R and RStudio from the Unity HPC. You are also welcome to use R and RStudio from your own computer.
Unity staff maintain a Slack channel for help to solves bioinformatics-related issues. To join the Unity Slack community, please sign up with your UMass email here. If you’re unable to register with your school email, please contact hpc@umass.edu with your preferred email address and they’ll send you a direct invite.
I run R and RStudio on my computer. You can too. Everything we do in this class you should be able to do from your laptop.
Install the latest release (2024-06-14, Race for Your Life) R-4.4.1 of R from CRAN and follow the installation instructions. If you have an older verion of R on your computer please update to this release as I can’t guarantee the labs will work on older versions.
Install R Studio, a nice graphical interface for working with R.
Open RStudio and install tidyverse
under Tools > Install Packages. You will need to install other packages as well for this and future labs.
The default R studio appearance includes 4 windows.
In this window you can type directly into a file, run code and save the file for reuse. In this class we will work with Quarto files (discussed below).
The console is where you can type R commands and see output.
Type
3 + 3
To better document and save your code write it in the Quarto documents rather than the console. On occasion we will use the console to access documentation and for other purposes.
The Environment
tab shows all the active objects. If you have a data frame loaded, then click on the object will enable you to view the table. The History
tab shows a list of commands used so far.
There are data sets that come with the R package and used in tutorials. If you run the following command you will see a graph of related to the cars data set in the Plots window
The Quarto is a scientific publishing system. In this class we will use one of it’s simplest features, producing a report with the code and resulting output (graphs, tables, statistical analysis). Quarto can also be used to produce slides, web sites, scientific manuscripts and books. For example, and all the labs for this course and my research laboratory website were made using Quarto. Quarto wraps together many previous packages used for publishing with R.
To use Quarto with R, the rmarkdown R package is installed. There are some differences between a Quarto and R Markdown document, but overall they are very similar.
In RStudio select File > New File > Quarto Document. Add a title and your name then create the document. Notice the your file says untitled with an asterisk. Save your file (e.g. lab1). This will automatically add the .qmd extension to your file (lab1.qmd). ALWAYS SAVE YOUR FILE BEFORE YOU START WORKING AND OFTEN WHILE WORKING.
Click on the Render
icon. This will run the code, show the output and create a html file that is automatically saved to your directory (look for the template.html file) and will automatically open this file in your browser. Now let’s go back to the template.qmd file. The top section of the document delineated by the ---
is called the YAML block. In this template it contains your the title, your name, the output type (html) and the editor preference (visual).You can also work with your file directly with the source code by clicking the source
icon. Which is often quicker once you’ve learned rmarkown.
We use the YAML block more fully below when we create the lab report.
The text with the white background is in rmarkdown. The icons in the same section as the visual
icon you can easily made the text in bold or in italics, change the text from normal to a header, create bulleted or numbered lists, insert html links, add images, insert tables and more.
The text with the gray background is in R code chunks. Click on the green play
icon in the top right corner of the code chunk to run the code.
Create new code chunk by clicking on the green +C
icon to the right of the Render
icon. In the code chunk type plot(cars)
. Then click the run the code to see a graph of the cars data set that comes preloaded into R.
plot(cars)
The following lines of code in your YAML block with generate a table of contents (toc) as shown at the top of this lab. The line with embed-resources
creates a stand alone html file. If this line is not present in addition to the template.html file a folder called template_files will be created. In this case your template.html will not have the proper format when you turn it in.
---
title: "Lab 1"
author: "Jeff Blanchard"
format:
html:
toc: true
toc_float: true
embed-resources: true
editor: visual
---
Your lab report must each exercises labeled with a header so that each one appears in the table of contents.
If you are working on Posit Cloud or the Unity You will need to first export (download) the Lab1_yourname.html file to your computer, then upload the file to Canvas. In the bottom right corner click on the wheel
icon then select Export.
All R statements where you create objects are called assignment statements and the form “object_name <- value”
<- 3 x
Simply typing x will give the value of x
x
[1] 3
You will make lots of assignments and <- is a pain to type. Instead, use RStudio’s keyboard shortcut: Alt + - (the minus sign). Notice that RStudio automagically surrounds <- with spaces, which is a good code formatting practice. An equals sign = will work in place of <-, but it will cause confusion later so keep to the convention of using <- to make assignments
Object names must start with a letter, and can only contain letters, numbers, underscores and periods. You want your object names to be descriptive, so you’ll need a convention for multiple words. I recommend snake_case where you separate lowercase words with an underscore. Note that R is case sensitive, e.g., object names gene, GENE, Gene are all different.
<- 3100000000 genome_size
Important note: since there are many built-in functions in R, make sure that the new object names you assign are not already used by the system. A simple way of checking this is to type in the name you want to use. If the system returns an error message telling you that such object is not found, it is safe to use the name.
A character object is used to represent string values in R. It is defined by double quotes ““.
<- "ATGAAA"
DNA DNA
[1] "ATGAAA"
A vector is a sequence of data elements of the same basic type. data elements in a vector are officially called components. Assignment operator (<-) stores the value (object) on the right side of (<-) expression in the left side. Once assigned, the object can be used just as an ordinary component of the computation. The c function concanenates the components into a vector.
<- c(1,10,100)
random_numbers random_numbers
[1] 1 10 100
Now you can do scalar computations on a vector
* 2 random_numbers
[1] 2 20 200
or use sum, sort, min, max, length and many other operations. For example
sort(random_numbers)
[1] 1 10 100
You can also do vector arithmatic
<- c(1,10,100)
random_numbers <- c(1,2,3)
y* y random_numbers
[1] 1 20 300
Vectors can also be made of characters
<- c("AUG", "UAU", "UGA")
codons codons
[1] "AUG" "UAU" "UGA"
The main goal for today’s lab is to create the lab report so we need a few exercises to fill it out
For x = 2 and y = 15, compute the sum and difference of x and y
Create a vector of the values 22, 62, 148, 43 and 129. Multiple the vector by 5.
Create a vector of the nucleotides A, T, C and G. Remember to put a “” around each letter. Arrange the nucleotides alphabetically using the sort function sort(vector_name)