Lab 1 - Overview & Getting Started

Overview

Learning objectives

  • What is bioinformatics?
  • What is reproducible research?
  • Why learn bioinformatics and data science skills?
  • High Performance Computing (Unity)
  • Overview of the R statistical programming language

Overview

In recent years, the field of genomic analysis and bioinformatics has sifted towards requiring some knowledge of R, Python/Perl/C and the use of high performance computers (often requiring some fundamental Unix skills) available at national computing centers for working with large data sets. While there are many great software packages available for particular computational problems in evolutionary biology, many software programs do not have a user interface (e.g. drop down menus and such) and are run in command line mode. The lab sessions in this course have been designed to give students an introduction to working with R and packages used for genome and metagenome analyses. We are using recently release National Ecological Observatory Network data to design a course-based Undergraduate Research Experience (CURE). The first 4 weeks we will discuss to project space, discuss research ideas and formulate testable hypothesis or discovery driven approaches, design the experimental approaches. For a preview today I will give an overview of the project space.

What is Bioinformatics and data science?

Bioinformatics is the field of science in which biology, computer science, statistics and information technology merge into a single discipline. There are three important sub-disciplines within bioinformatics:

  • The development of new algorithms and statistics with which to assess relationships among members of large data sets.
  • The development and implementation of tools that enable efficient access and management of different types of information.
  • The analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures.

Bioinformatics…

  • is a term coined in response to the high demand of techniques and resources for handling the explosion of molecular data.
  • is a buzzword to describe a growing field.
  • benefits from the physicists, chemists and mathematicians crossing over into biology.
  • is a collection of tools.
  • is way of thinking about a problem!

The Development and Implementation of Tools

In order to make new algorithms and data sources available to biologists someone needs to write applications that include these algorithms and create new databases. Often this is first done by academic research groups. Later redone by private companies when market is large and profitable enough. There is a large gap between what is done by research groups and companies. Sometimes this is filled by large government funded projects, but not usually in time for most researchers. This is why bioinformatics and programming skills have become very valuable.

Data Science

The field of data science has grown tremendously over the last decade and the two programming languages, R and Python, used in analyzing genomic data are the most popular languages for data science. This made it easy to transfer bioinformatics skills to diverse fields.

Here a few links that I will go over in lab:

R

R is the largest and most comprehensive public domain statistical computing environment. The core R package is enhanced by several hundred user-supplied add-on packages, including many for gene expression analysis, in the Comprehensive R Archive Network (CRAN). Omegahat Project for Statistical Computing. BioConductor is an open source and open development software project for the analysis and comprehension of genomic data and is based primarily on the R programming language. R and Bioconductor are free, Open Source and available for Windows, MacOS and a wide variety of UNIX platforms.

Reproducible Research

Reproducibility is the hallmark of science, which is based on empirical observations coupled with explanatory models. While reproducibility encompasses the full science lifecycle, and includes issues such as methodological consistency and treatment of bias, in this course we will focus on computational reproducibility: the ability to document data, analyses, and models sufficiently for other researchers to be able to understand and ideally re-execute the computations that lead to scientific results and conclusions. With current publishing practices, this can be difficult because data are typically unavailable, the method sections of papers do not detail the computational approaches used, and analyses and models are often conducted in graphical programs, or, when scripted analyses are employed, the code is not available. In this course we will learn how to write code that is integrated into reproducible reports.

R manuals, help and tutorials

Many introductory and advance tutorials have been developed for R. Here are a few

There are also many workshops and online R courses that you could take to follow up what you learn in this class.

GitHub

GitHub has become a popular way to manage, share and view code for open source projects. The tutorials created for this course will be written in Quarto and posted on GitHub. Thus, you will be able to continue to see course materials after the end of the semester. You will use and make a GitHub web site for your research project.

Generative AI

You all are part of the first generation of generative AI users. The saying goes “AI won’t take your job, but someone using who knows how to use AI might.” Think of AI as a force multiplier. You have to learn to code and clearly state your problems before AI can to help you. This class will fully use generative AI in hopes that it will challenge us to think more creatively about problems and not stress out about syntax. Think first…ask questions…code…solve problem! We will use the UMass version of Microsoft Copilot Chat and Copilot integrated into RStudio. We will embrace “Vibe coding” (see What is Vibe Coding, Exactly?) and ride the waves.

High Performance Computing (Unity) and Unix

In this course we will learn to write basic unix commands, run bioinformatics software from the command line and allocate computer resources for submitting large jobs. UMass has modern High Performance Computing system, Unity, and excellent staff members to help get you going and trouble shoot issues. Working on HPCs has become much easier with the advent of web interfaces that look and work much like software running on your computer.

DYIers

You can do all of the R-based labs on your own computer. Follow these directions by the makers of RStudio, Posit. You will need to download the lab files from Unity or the course GitHub site.

On the computer

Accessing the course Unity resources

If you haven’t already done so, request a Unity HPC account and access to our course directory

To request a Unity account and access our course directory.

  • Go to Unity
  • Request an account
  • Go to MyPIs, click on the + button and enter pi_bio678_umass_edu

R and RStudio

RStudio using Open OnDemand

Open OnDemand makes supercomputing accessible through a web portal.

  • Go to Unity
  • On the left menu select OpenOnDemand
  • In the top menu select Interactive Apps then RStudio
  • Set the job duration for 4 hrs to cover the length of the lab. Otherwise set the time to what you anticipate needing.
  • Unless otherwise suggested set CPU Core Count to 2 and the memory to 8 gb.
  • Click Launch. It takes about a minute the job to start and then you can launch the RStudio Interface.

RStudio Interface

The default R studio appearance includes 4 windows.

  1. The R script(s) and data view (upper left window).
  2. Console (bottom left window).
  3. Work space and history (upper right window).
  4. Files, plots, packages and help (bottom right window).

RStudio Screenshot

The R script(s) and data view window (upper left window)

In this window you can type directly into a file, run code and save the file for reuse. In this class we will work with Quarto files (discussed below).

Console Window (bottom left window)

The console is where you can type R commands and see output.

Type

3 + 3

To better document and save your code write it in the Quarto documents rather than the console. On occasion we will use the console to access documentation and for other purposes.

Environment and History tabs (upper right window)

The Environment tab shows all the active objects. If you have a data frame loaded, then click on the object will enable you to view the table. The History tab shows a list of commands used so far.

Files, Plots, Packages and Help (bottom right window)

There are data sets that come with the R package and used in tutorials. If you run the following command you will see a graph of related to the cars data set in the Plots window

Quarto

The Quarto is a scientific publishing system. In this class we will use one of it’s simplest features, producing a report with the code and resulting output (graphs, tables, statistical analysis). Quarto can also be used to produce slides, web sites, scientific manuscripts and books. For example, and all the labs for this course and my research laboratory website were made using Quarto. Quarto wraps together many previous packages used for publishing with R.

To use Quarto with R, the rmarkdown R package is installed. There are some differences between a Quarto and R Markdown document, but overall they are very similar.

Producing Lab Reports with Quarto

In RStudio select File > New File > Quarto Document. Add a title (e.g. Lab 1) and your name then create the document. Notice the your file says untitled with an asterisk. Save your file (e.g. lab1). This will automatically add the .qmd extension to your file (lab1.qmd). ALWAYS SAVE YOUR FILE BEFORE YOU START WORKING AND OFTEN WHILE WORKING.

The top section of the document delineated by the --- is called the YAML block. In this template it contains your the title, your name, the output type (html) and the editor preference (visual).You can also work with your file directly with the source code by clicking the source icon.

The following lines of code in your YAML block with generate a table of contents (toc) as shown at the top of this lab. The line with embed-resources creates a stand alone html file. This is also availabe as lab_template.qmd in our course directory /work/pi_bio678_umass_edu

---
title: "Lab 1"
author: "Your name"
format:
  html:
    toc: true
    toc_float: true
    embed-resources: true
editor: visual
execute: 
  warning: false
  message: false
---

The text with the white background is in rmarkdown. The icons in the same section as the visual icon you can easily made the text in bold or in italics, change the text from normal to a header, create bulleted or numbered lists, insert html links, add images, insert tables and more.

The text with the gray background is in R code chunks. Click on the green play icon in the top right corner of the code chunk to run the code.

Click on the Render icon. This will run the code, show the output and create a html file that is automatically saved to your directory (look for the lab1.html file) and will automatically open this file in your browser.

Create new code chunk by clicking on the green +C icon to the right of the Render icon. In the code chunk type plot(cars). Then click the run the code to see a graph of the cars data set that comes preloaded into R.

R code
plot(cars)

Writing R code

Assignment statements

All R statements where you create objects are called assignment statements and the form “object_name <- value”

R code
x <- 3

Simply typing x will give the value of x

R code
x
[1] 3

You will make lots of assignments and <- is a pain to type. Instead, use RStudio’s keyboard shortcut: Alt + - (the minus sign). Notice that RStudio automagically surrounds <- with spaces, which is a good code formatting practice. An equals sign = will work in place of <-, but it will cause confusion later so keep to the convention of using <- to make assignments

Object Names

Object names must start with a letter, and can only contain letters, numbers, underscores and periods. You want your object names to be descriptive, so you’ll need a convention for multiple words. I recommend snake_case where you separate lowercase words with an underscore. Note that R is case sensitive, e.g., object names gene, GENE, Gene are all different.

R code
genome_size <- 3100000000

Important note: since there are many built-in functions in R, make sure that the new object names you assign are not already used by the system. A simple way of checking this is to type in the name you want to use. If the system returns an error message telling you that such object is not found, it is safe to use the name.

Characters

A character object is used to represent string values in R. It is defined by double quotes ““.

R code
DNA <- "ATGAAA"
DNA
[1] "ATGAAA"

Vectors

A vector is a sequence of data elements of the same basic type. data elements in a vector are officially called components. Assignment operator (<-) stores the value (object) on the right side of (<-) expression in the left side. Once assigned, the object can be used just as an ordinary component of the computation. The c function concanenates the components into a vector.

R code
random_numbers <- c(1,10,100)    
random_numbers
[1]   1  10 100

Now you can do scalar computations on a vector

R code
random_numbers * 2
[1]   2  20 200

or use sum, sort, min, max, length and many other operations. For example

R code
sort(random_numbers)
[1]   1  10 100

You can also do vector arithmatic

R code
random_numbers <- c(1,10,100) 
y<- c(1,2,3) 
random_numbers * y
[1]   1  20 300

Vectors can also be made of characters

R code
codons<- c("AUG", "UAU", "UGA") 
codons
[1] "AUG" "UAU" "UGA"

Exercises

Your lab report must have each exercise labeled with a header (e.g. ## Exercise 1) so that each one appears in the table of contents.

You will need to first export (download) the Lab1_yourname.html file to your computer, then upload the file to Canvas. In the bottom right corner click on the wheel icon then select Export.

Exporting html file

The main goal for today’s lab is to create the lab report so we need a few exercises to fill it out

Exercise 1

For x = 2 and y = 15, compute the sum and difference of x and y

Exercise 2

Create a vector of the values 22, 62, 148, 43 and 129. Multiple the vector by 5.

Exercise 3

Create a vector of the nucleotides A, T, C and G. Remember to put a “” around each letter. Arrange the nucleotides alphabetically using the sort function sort(vector_name)