Learning objectives

Why HPC

Early steps in genome data processing, particularly assembly and read mapping are computational intensive. Even with a modest data set of a few dozen samples it is difficult/impossible to do on your lap. And who wants to bog down there or another computer for long periods of time (only to run out of memory). There are several options available for using high performance computers at minimal or no cost.

MMassachusetts Green High Performance Computing Center (MGHPCC)

The MGHPCC is an intercollegiate high-performance computing facility located in Holyoke, Massachusetts. MGHPCC is for research computing, only Principal Investigators (PIs) as defined by the local school and their staff or authorized collaborators may receive accounts on MGHPCC. PI authorization is required for all new account requests. You can easily request access if you meet the requirements. If you are in this class and are not part of an active research group you can list me as the PI and I will add you to our lab account for the purpose of this class.

See the MGHPCC wiki for instructions on how to log in and transfer data. Most of what we do in this class can be run using interactive mode.

Other options

These others options can work as well depending on your needs and

Using the lessons with Amazon Web Services (AWS)

The Data Carpentry Genomic lessons are designed to be run on pre-imaged Amazon Web Services (AWS) instances. With the exception of a spreadsheet program, all of the software and data used in their genomic workshop are hosted on an Amazon Machine Image (AMI). To work on this lessons independently outside of a Data Carpentry workshop, you will need to start your own AMI instance. Follow these instructions on creating an Amazon instance. Use the AMI ami-0985860a69ae4cb3d (Data Carpentry Genomics Beta 2.0 (April 2019)) listed on the Community AMIs page. Please note that you must set your location as N. Virginia in order to access this community AMI. You can change your location in the upper right corner of the main AWS menu bar. The cost of using this AMI for a few days, with the t2.medium instance type is very low (about USD $1.50 per user, per day).

Galaxy

Galaxy an open, web-based platform for accessible, reproducible, and transparent computational research. I have used it in the past in teaching this course and has many strengths. It has all the tools needed for this tutorial and for some people will be valuable to learn for their research. We do not use it currently because the course is focused on learning R and associated packages rather than learning a particularly platform. We can’t do everything in one semester!

Jet Stream

https://jetstream-cloud.org/ is part of XSEDE, a collection of integrated advanced digital resources and services funded by NSF. You need to be eligible for an XSEDE allocation to use Jetstream, which means must be based at a U.S. institution. And although XSEDE is NSF-funded, “projects need not be supported by NSF grants” to receive an allocation. Jetstream enables researchers to launch, use, and shutdown their own Galaxy servers that have been pre-configured similar to the Main Galaxy server. We may use Jetstream in the future for the class as they now allow faculty to submit an application for a course allocation.

National Center for Genome Analysis Support (NCGAS)

The mission of the National Center for Genome Analysis Support is to enable the biological research community of the US to analyze, understand, and make use of the vast amount of genomic information now available. NCGAS focuses particularly on transcriptome- and genome-level assembly, phylogenetics, metagenomics/transcriptomics, and community genomics. My lab has accounts at NCGAS that we have used for aspects of microbiome assembly, annotation and analysis.

Setting it up on a local computer

I don’t recommend setting up your computer to do these analyses. However, if your lab has a Unix computer with more CPUs than it work fine. We were able to run this tutorial on our lab computer.

Exercises

There are no exercises for this Lab but you need to be able to login into MGHPCC and use unix commands in the terminal.