6 October 2015

What is reproducible research?

"The final product of research is not only the paper itself, but also the full computation environment used to produce the results in the paper such as the code and data necessary for reproduction of the results and building upon the research." (Xie, 2014).

Articles submitted for journals should include:

  • Manuscript
  • Experimental design
  • Code
  • Data

Obviously, this is not always possible!

OSF

  • The Open Science Framework provide cloud-based management for your project
  • This can be either an open or closed project
  • Integrates with many tools
  • "Digital notebook"
  • Various reproducibility projects

Hurdles

What do you think are the biggest hurdles to reproducible research?

What can we do to make our research, and consulting work, 100% reproducible?

Why is this important as a consultant?

Reproducing research

  • Reproducing research should be easy but it isn't

  • knitr makes creating dynamic reports very simple.
  • You should be able to share data and an Rmd or an Rnw file with a friend, colleague, or reviewer and they should be able to replicate your findings.

Tools for reproducibilty

Which one should you use, see Yihue Xie's post.

The developers of Rstudio are often the first to integrate the latest and greatest from R. Else, ESS and LyX are fully integrated with knitr or add a custom command to your favorite editor.

Knitting

Why knitr?

install.packages("knitr", dependencies = TRUE)

How can knitr help us achieve reproducibility?

  1. We never need to copy and paste results into reports.
  2. If the data changes, our models, figures, and tables are automatically updated*.
  3. From a knitr document, automatically generate a report using knit() or extract the R code using purl().
  4. Generate a LaTeX or Markdown report from an R script with stitch() and add text with spin()
  5. It is much more feature rich than Sweave.

knitr basics

Generates PDF
knit('knit_toy/knit_toy.Rnw')

Generates R script
purl('purl_toy/purl_toy.Rnw')

Generates PDF
stitch('stitch_toy/stitch_toy.R')

Generates Markdown
spin('spin_toy/spin_toy.R')

Markdown and Shiny demonstration

if(!require("shiny"))
  install.packages("shiny")
demo("notebook", package = "knitr")

Chunks

Input is evaluated in chunks. Either code chunks

For knitr, chunks are what we write R code in.

```{r}
<insert R code for Markdown>
```

<<>>=
<insert R code for LaTeX>
@

Throughout I use Markdown syntax as I've created an ioslide presentation using RMarkdown. However, the chunk.begin and chunk.end syntax can always be interchanged. (In fact, you can roll your own syntax for these if you hate the above!)

More on chunks

  • Chunks have a plethora of options available by default
  • You can also 'roll your own' chunk options provided they are valid R code.
length(opts_chunk$get())
## [1] 53
opts_chunk$get("engine")
## [1] "R"
  • knitr works with other languagues too (python, ruby, etc)

knitr output

  • knitr output may be inline
`r <insert R code for Markdown>`

\Sexpr{<insert R code for LaTeX>}
  • A realization of a \(\chi_2^2\) is 0.99.
A realization of a $\chi_2^2$ is `r rchisq(1, df = 2)`.
  • Chunk option can for text, tabular, or graphical output.

Chunk output

  • What will this generate?
```{r, cool_chunk, eval = -1, echo = c(1, 3), warning = FALSE, 
message = FALSE,align ='center'}

coef(lm(dist ~ speed, data = cars))[1]

ggplot(aes(x=speed, y = dist), data = cars) + geom_point(col = "#56B4E9") +
geom_smooth(col = "999999") + theme_bw() + ylab("Driving Speed") +
xlab("Distance to Stop")

rnorm(0, sd = -1) 
```

Answer

## coef(lm(dist ~ speed, data = cars))[1]

rnorm(0, sd = -1)
## numeric(0)

Helpful chunk output options

  • eval = TRUE : Evaluate all or part of the current chunk
  • echo = TRUE : Show all or part of the source code
  • results = 'asis' : Writes raw output from R to the output document without markup. Helpful for creating tables with xtable. markup is the default.
  • include = TRUE : Code chunk will be included in output. If you don't want a chunk in the output but still evaluated set this to FALSE

Perhaps helpful?

foo <- 2
bar <- foo
## they are the same

The code

```{r}
foo <- 2
bar <- foo
```
```{r eval = foo < bar, echo = FALSE}
cat("foo is greater than bar")
```
```{r eval = foo == bar, echo = FALSE}
cat("they are the same")
```

Local vs. global settings

  • Often we want to set options for our entire document rather than for every chunk
  • Set figure width equal to 6 with a center alignment and hide all R code

Local (tedious)

```{r fig.width = 6, fig.align = 'center', echo = FALSE}
plot(rnorm(10))
```

Global (smart)

```{r}
opts_chunk$set(fig.width=6, fig.align = 'center', echo = FALSE))
```

Tables

Tables are easily handled with xtable. Make sure to specify results = "asis" to render the table.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.60 0.0123
speed 3.9324 0.4155 9.46 0.0000

Finer control of tables with LaTeX. Finally, we can insert our fitted model using inline code:

\(\hat{dist}_i = -17.58 + 3.93 speed_i\)

Code used

```r
library(xtable)
mod1 <- lm(dist ~ speed, data = cars)
coef_tab <- summary(mod1)$coef
print(xtable(mod1), type = "html")
```

For the inline code:

$\hat{dist_i} = `r coef_tab[1,1]` + `r coef_tab[2,1] ` speed_i$

Using an ioslides table

Coefficients Estimate Standard Error t-value Pr(>|t|)
(Intercept) -17.58 6.76 -2.6 0.01
speed 3.93 0.42 9.46 1.4910^{-12}

Interactive figures (ggvis)

Can embed interactive plots

Figures options

  • dev = 'png' : Sets the default graphical device. tikz has nicer font rendering for LaTeX. Can interact with ggobi, graphviz, rgl, etc.
  • fig.width and fig.height : Sets width and height of the device.
  • fig.cap and fig.align : Set a caption and alignment
  • Able to set encoding (for Icelandic characters) and dingbat font (reduces the size of pdfs).
  • Can send additional graphic device specific arguments via dev.args = list(option1 = "foo", option2 = "bar")

cache = TRUE

  • Caching is helpful if you have a large document or R takes a long time to evaluate certain chunks.
  • Caching compares the MD5 hash of a cached chunk with the MD5 hash of the same cache when knit() is re-run
  • Manually set cache dependencies (i.e Chunk B depends on A) or do this automatically (autodep = TRUE)
  • Adding new data won't update a cache. One way to enable this is file.info() function in the chunk. For example,
```{r, foo_time = file.info('foo.csv')$mtime}
foo <- read.csv("foo")
...

Embed code chunks

```{r, A}
y <- rcauchy(1)
```
```{r, B}
y
<<A>>
y
```
  • Chunks can be nest recursively with each other as long as the recursion is finite.

The answer

y <- rcauchy(1)
y
## [1] -2.6
y <- rcauchy(1)
y
## [1] 0.19

Reusing whole chunks

```{r, cau, include = FALSE, eval = FALSE}
y <- rcauchy(1)
```
```{r, norm, include = FALSE, eval = FALSE}
y
x <- y + rnorm(1)
x
```
```{r, C, ref.label = c('cau','norm')
```

And the output

y <- rcauchy(1)
y
## [1] 2.8
x <- y + rnorm(1)
x
## [1] 4.1

External code

  • External chunk code can be kept in R scripts and can be referenced by chunk label or the line number.
## @knitr nitrogen_conversion
tons_mgN <- function(tons, unit){
mgN <- (tons * 1e9) / 20 / 5.7 / unit
return(mgN)
}
read_chunk("nitrogen_conversions.R")
```{r, nitrogen_conversion}
```
  • specify from and to arguments to use lines numbers in read_chunk().

Child documents

  • Just like LaTeX, child documents (\include{foo.tex}) consisting of smaller parts can be used with knitr.
  • Consist of plain chunks
  • To use this just specify the child ='foo.Rnw' or 'foo.Rmd' if using Rmarkdown.
  • These could be called conditionally. For example, if you were doing a report and only ran an additional set of analyses conditional on some output.
  • child = if(bar > foo) 'foo.Rnw'

Hooks

  • Hooks allow you to expand the capability of knitr.
  • Chunk hooks can be called before (say want to crop the margins of a graph via par() or after the chunk (if you want to insert commands into output like LaTeX or Markdown commands).
  • See Yihui's hooks page for more details.

Packrat

  • Packrat is an R package that helps you manage your project’s R package dependencies in an isolated, reproducible and portable way.
  • init(), initialize a new packrat project.
  • snapshot(), take a snapshot of the installed packages.
  • bundle(), bundles (tarball) a packrat project for sharing.
  • unbundle(), unbundles a packrat project.
  • This will modify .Rprofile!

GitHub

  • Github is a place to store your code.
  • Free and has version control using git.
  • Great way to share code, data, and teaching materials.

  • Travis-CI will automatically run your code on an external computer that you are not using in your development.
  • Free for open-source projects
  • Seemlessly links with GitHub
  • Easy to set up for R
  • Save the following in .travis.yml

    language: r
    sudo: required

Learning more