Lab g2: Connecting a Github repo site with a new RStudio project

Much of the material for this lesson was borrowed from or inspired by Matt Jones’ NCEAS Reproducible Research Techniques for Synthesis workshop

Learning Objectives

In this lesson, you will learn:

  • What computational reproducibility is and why it is useful
  • How version control can increase computational reproducibility
  • How to start a Github repo and Pages site
  • to set up your own RStudio Project and sync with your GitHub repo
  • How to make a simple web page for your site

Background

Reproducible Research

Reproducibility is the hallmark of science, which is based on empirical observations coupled with explanatory models. While reproducibility encompasses the full science lifecycle, and includes issues such as methodological consistency and treatment of bias, in this course we will focus on computational reproducibility: the ability to document data, analyses, and models sufficiently for other researchers to be able to understand and ideally re-execute the computations that led to scientific results and conclusions.

What is needed for computational reproducibility?

The first step towards addressing these issues is to be able to evaluate the data, analyses, and models on which conclusions are drawn. Under current practice, this can be difficult because data are typically unavailable, the method sections of papers do not detail the computational approaches used, and analyses and models are often conducted in graphical programs, or, when scripted analyses are employed,the code is not available.

And yet, this is easily remedied. Researchers can achieve computational reproducibility through open science approaches, including straightforward steps for archiving data and code openly along with the scientific workflows describing the provenance of scientific results (e.g., @hampton_tao_2015, @munafo_manifesto_2017).

Conceptualizing workflows

Scientific workflows encapsulate all of the steps from data acquisition, cleaning, transformation, integration, analysis, and visualization.

Workflows can range in detail from simple flowcharts to fully executable scripts. R scripts and python scripts are a textual form of a workflow, and when researchers publish specific versions of the scripts and data used in an analysis, it becomes far easier to repeat their computations and understand the provenance of their conclusions.

The problem with filenames

Every file in the scientific process changes. Manuscripts are edited. Figures get revised. Code gets fixed when problems are discovered. Data files get combined together, then errors are fixed, and then they are split and combined again. In the course of a single analysis, one can expect thousands of changes to files. And yet, all we use to track this are simplistic filenames. You might think there is a better way, and you’d be right: version control.

Version control systems help you track all of the changes to your files, without the spaghetti mess that ensues from simple file renaming. In version control systems like git, the system tracks not just the name of the file, but also its contents, so that when contents change, it can tell you which pieces went where. It tracks which version of a file a new version came from. So its easy to draw a graph showing all of the versions of a file, like this one:

Version control systems assign an identifier to every version of every file, and track their relationships. They also allow branches in those versions, and merging those branches back into the main line of work. They also support having multiple copies on multiple computers for backup, and for collaboration. And finally, they let you tag particular versions, such that it is easy to return to a set of files exactly as they were when you tagged them. For example, the exact versions of data, code, and narrative that were used when a manuscript was originallysubmitted might be eco-ms-1 in the graph above, and then when it was revised and resubmitted, it was done with tag eco-ms-2. A different paper was started and submitted with tag dens-ms-1, showing that you can be working on multiple manuscripts with closely related but not identical sets of code and data being used for each, and keep track of it all.

Version control and Collaboration using git and GitHub

First, just what are git and GitHub?

  • git: version control software used to track files in a folder (a repository)
    • git creates the versioned history of a repository
  • GitHub: web site that allows users to store their git repositories and share them with others

The Git lifecycle

As a git user, you’ll need to understand the basic concepts associated with versioned sets of changes, and how they are stored and moved across repositories. Any given git repository can be cloned so that it exist both locally, and remotely. But each of these cloned repositories is simply a copy of all of the files and change history for those files, stored in git’s particular format. For our purposes, we can consider a git repository just a folder with a bunch of additional version-related metadata.

In a local git-enabled folder, the folder contains a workspace containing the current version of all files in the repository. These working files are linked to a hidden folder containing the ‘Local repository’, which contains all of the other changes made to the files, along with the version metadata.

So, when working with files using git, you can use git commands to indicate specifically which changes to the local working files should be staged for versioning (using the git add command), and when to record those changes as a version in the local repository (using the command git commit).

The remaining concepts are involved in synchronizing the changes in your local repository with changes in a remote repository. The git push command is used to send local changes up to a remote repository (possibly on GitHub), and the git pull command is used to fetch changes from a remote repository and merge them into the local repository.

  • git clone: to copy a whole remote repository to local
  • git add (stage): notify git to track particular changes
  • git commit: store those changes as a version
  • git pull: merge changes from a remote repository to our local repository
  • git push: copy changes from our local repository to a remote repository
  • git status: determine the state of all files in the local repository
  • git log: print the history of changes in a repository

Those seven commands are the majority of what you need to successfully use git. But this is all super abstract, so let’s explore with some real examples.

On the Computer

Installing git

git is installed on Posit Cloud, Unity and most unix systems. If you are working on your own computer using MacOSX or Windows you will need to install git. Follow the directions on the GitHub site.

On the Github website

  • Register for a Github account on https://github.com/
  • Create a new repo by clicking on the new repository button.
  • Give it a name, make it public, Add a readme file, Add R .gitignore file, Add Apache license.
  • Enable a web page for the repo - Click of the settings wheel, in the left menu select pages then under branch select main and save.
  • Go back to your main repo page, in the right corner in the About section click the wheel. Select Use your GitHub Pages website. Save changes
  • Click on the green code button and copy the https link.

Make a local version of your github repo in Posit Cloud -

  • In Your Workspace (not the EvoGeno Workspace) click on the New Project button then New Project from Git Repository

* Paste in the link to your Github repository (e.g https://github.com/jeffreyblanchard/posit-test.git)

If your are using RStudio on Unity or your own Computer

  • In RStudio click on File > New Project (You can not do this on Posit Cloud. Only on Unity or with RStudio downloaded to your own computer)
  • Select Version Control
  • Select Git
  • Paste in link your github site
  • Choose the directory you want to use for the project

Configure your local repo to connect with the github repo

  • Install the R package usethis using Tools > Install Packages
  • Configure git with your username and email. This must be the username and email associated with your GitHub account.
R code
library(usethis)
use_git_config(user.name = "Jane Doe", user.email = "jane@example.org")

* Create a github token

R code
usethis::create_github_token()

This will open a web page on your github account. The recommended scopes will be pre-selected. This will be fine for now and you can change later if needed.

  • Click “Generate token”.

  • Copy the generated PAT (beginning with ghp)to your clipboard. Provide this PAT next time a Git operation asks for your password.

  • To link your PAT with your repo

R code
gitcreds::gitcreds_set() 

to get a prompt where you can paste your PAT: ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Make an index file as the homepage for your Github repo

  • In RStudio open a new Rmarkdown template

  • Save it as index.Rmd

  • Add text and images to make your homepage

  • Knit it to make index.html

  • Under the git tab in the upper right corner, select commit the changes`

  • Add a description of your changes in the commit messeage

  • Click on Commit

  • Close the window

  • Click on Push to copy the index files to your github repo

  • This will automatically replace your homepage README.md file with index.html (it takes about 5 minutes before the page updates)

  • It will take about 5 minutes before you can see it on your github repo page

  • You can quickly spice up your page by adding a theme to your YAML block of your index.Rmd (available themes are “default”, “bootstrap”, “cerulean”, “cosmo”, “darkly”, “flatly”, “journal”, “lumen”, “paper”, “readable”, “sandstone”, “simplex”, “spacelab”, “united”, “yeti”)

  • You could also try [prettydoc](https://prettydoc.statr.me/0

Linking an existing R project with a new Github repo

On the github website

  • Create a new repo by clicking on the new repository button.
  • Give it a name, make it public. Do not add readme, .gitignore or license files
  • Copy the link to the site.

Enable git in Rstudio

  • Open your project in Rstudio and navigate to Tools -> Version Control -> Project Setup
  • Click SVN/Git tab and select git as the version control system. It will ask you to initialize a new git repo and restart Rstudio
  • After Rstudio reopens, confirm that there is a Git tab in the environment pane

In the R Console (bottom left)

  • In the R Console Add your git credentials with the usethis package and your token begining with ghp...
R code
library(usethis)
use_git_config(user.name = "jeffreyblanchard", user.email = "jlb@umass.edu")
gitcreds::gitcreds_set() 

If you have forgotten your token. Create a new token using

R code
library(usethis)
usethis::create_github_token()

In the Terminal Window (bottom left)

Now in the Terminal window establish a connection to your github repo and make your first commit

R code
echo "# test-project" >> README.md
git add README.md
git commit -m "first commit"
git branch -M main
git remote add origin https://github.com/jeffreyblanchard/test-project.git
git push -u origin main

This only pushes the readme to github. Now under the Git tab in RStudio (top right pane) check your files, commit and then push.

To enable a web page for your repo

  • Enable a web page for the repo - Click of the settings wheel, in the left menu select pages then under branch select main and save.

  • Go back to your main repo page, in the right corner in the About section click the wheel. Select Use your GitHub Pages website. Save changes

  • Click on the green code button and copy the https link.