Creating reproducible research with knitr

6 October 2015

What is reproducible research?

"The final product of research is not only the paper itself, but also the full computation environment used to produce the results in the paper such as the code and data necessary for reproduction of the results and building upon the research." (Xie, 2014).

Articles submitted for journals should include:

Manuscript
Experimental design
Code
Data

Obviously, this is not always possible!

OSF

The Open Science Framework provide cloud-based management for your project
This can be either an open or closed project
Integrates with many tools
"Digital notebook"
Various reproducibility projects

Hurdles

What do you think are the biggest hurdles to reproducible research?

What can we do to make our research, and consulting work, 100% reproducible?

Why is this important as a consultant?

Reproducing research

Reproducing research should be easy but it isn't
knitr makes creating dynamic reports very simple.
You should be able to share data and an Rmd or an Rnw file with a friend, colleague, or reviewer and they should be able to replicate your findings.

Tools for reproducibilty

R
knitr
Markdown or LaTeX (others exist (e.g. HTML) and you can create your own too )

Which one should you use, see Yihue Xie's post.

GitHub
Travis-CI
Rstudio (recommended, if you're not wedded to an IDE)

The developers of Rstudio are often the first to integrate the latest and greatest from R. Else, ESS and LyX are fully integrated with knitr or add a custom command to your favorite editor.

Knitting

Why `knitr`?

install.packages("knitr", dependencies = TRUE)

How can knitr help us achieve reproducibility?

We never need to copy and paste results into reports.
If the data changes, our models, figures, and tables are automatically updated*.
From a knitr document, automatically generate a report using knit() or extract the R code using purl().
Generate a LaTeX or Markdown report from an R script with stitch() and add text with spin()
It is much more feature rich than Sweave.

`knitr` basics

Generates PDF
knit('knit_toy/knit_toy.Rnw')

Generates R script
purl('purl_toy/purl_toy.Rnw')

Generates PDF
stitch('stitch_toy/stitch_toy.R')

Generates Markdown
spin('spin_toy/spin_toy.R')

Markdown and Shiny demonstration

if(!require("shiny"))
  install.packages("shiny")
demo("notebook", package = "knitr")

Chunks

Input is evaluated in chunks. Either code chunks

For knitr, chunks are what we write R code in.

```{r}
<insert R code for Markdown>
```

<<>>=
<insert R code for LaTeX>
@

Throughout I use Markdown syntax as I've created an ioslide presentation using RMarkdown. However, the chunk.begin and chunk.end syntax can always be interchanged. (In fact, you can roll your own syntax for these if you hate the above!)

More on chunks

Chunks have a plethora of options available by default
You can also 'roll your own' chunk options provided they are valid R code.

length(opts_chunk$get())

## [1] 53

opts_chunk$get("engine")

## [1] "R"

knitr works with other languagues too (python, ruby, etc)

`knitr` output

knitr output may be inline

`r <insert R code for Markdown>`

\Sexpr{<insert R code for LaTeX>}

A realization of a \(\chi_2^2\) is 0.99.

A realization of a $\chi_2^2$ is `r rchisq(1, df = 2)`.

Chunk option can for text, tabular, or graphical output.

Chunk output

What will this generate?

```{r, cool_chunk, eval = -1, echo = c(1, 3), warning = FALSE, 
message = FALSE,align ='center'}

coef(lm(dist ~ speed, data = cars))[1]

ggplot(aes(x=speed, y = dist), data = cars) + geom_point(col = "#56B4E9") +
geom_smooth(col = "999999") + theme_bw() + ylab("Driving Speed") +
xlab("Distance to Stop")

rnorm(0, sd = -1) 
```

Answer

## coef(lm(dist ~ speed, data = cars))[1]

rnorm(0, sd = -1)

## numeric(0)

Helpful chunk output options

eval = TRUE : Evaluate all or part of the current chunk
echo = TRUE : Show all or part of the source code
results = 'asis' : Writes raw output from R to the output document without markup. Helpful for creating tables with xtable. markup is the default.
include = TRUE : Code chunk will be included in output. If you don't want a chunk in the output but still evaluated set this to FALSE

Perhaps helpful?

foo <- 2
bar <- foo

## they are the same

The code

```{r}
foo <- 2
bar <- foo
```
```{r eval = foo < bar, echo = FALSE}
cat("foo is greater than bar")
```
```{r eval = foo == bar, echo = FALSE}
cat("they are the same")
```

Local vs. global settings

Often we want to set options for our entire document rather than for every chunk
Set figure width equal to 6 with a center alignment and hide all R code

Local (tedious)

```{r fig.width = 6, fig.align = 'center', echo = FALSE}
plot(rnorm(10))
```

Global (smart)

```{r}
opts_chunk$set(fig.width=6, fig.align = 'center', echo = FALSE))
```

Tables

Tables are easily handled with xtable. Make sure to specify results = "asis" to render the table.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-17.5791	6.7584	-2.60	0.0123
speed	3.9324	0.4155	9.46	0.0000

Finer control of tables with LaTeX. Finally, we can insert our fitted model using inline code:

\(\hat{dist}_i = -17.58 + 3.93 speed_i\)

Code used

```r
library(xtable)
mod1 <- lm(dist ~ speed, data = cars)
coef_tab <- summary(mod1)$coef
print(xtable(mod1), type = "html")
```

For the inline code:

$\hat{dist_i} = `r coef_tab[1,1]` + `r coef_tab[2,1] ` speed_i$

Using an ioslides table

Coefficients	Estimate	Standard Error	t-value	Pr(>\|t\|)
(Intercept)	-17.58	6.76	-2.6	0.01
speed	3.93	0.42	9.46	1.4910^{-12}

Interactive figures (`ggvis`)

Can embed interactive plots

https://lundinn.shinyapps.io/rasch

Figures options

dev = 'png' : Sets the default graphical device. tikz has nicer font rendering for LaTeX. Can interact with ggobi, graphviz, rgl, etc.
fig.width and fig.height : Sets width and height of the device.
fig.cap and fig.align : Set a caption and alignment
Able to set encoding (for Icelandic characters) and dingbat font (reduces the size of pdfs).
Can send additional graphic device specific arguments via dev.args = list(option1 = "foo", option2 = "bar")

`cache = TRUE`

Caching is helpful if you have a large document or R takes a long time to evaluate certain chunks.
Caching compares the MD5 hash of a cached chunk with the MD5 hash of the same cache when knit() is re-run
Manually set cache dependencies (i.e Chunk B depends on A) or do this automatically (autodep = TRUE)
Adding new data won't update a cache. One way to enable this is file.info() function in the chunk. For example,

```{r, foo_time = file.info('foo.csv')$mtime}
foo <- read.csv("foo")
...

Embed code chunks

```{r, A}
y <- rcauchy(1)
```
```{r, B}
y
<<A>>
y
```

Chunks can be nest recursively with each other as long as the recursion is finite.

The answer

y <- rcauchy(1)

## [1] -2.6

y <- rcauchy(1)
y

## [1] 0.19

Reusing whole chunks

```{r, cau, include = FALSE, eval = FALSE}
y <- rcauchy(1)
```
```{r, norm, include = FALSE, eval = FALSE}
y
x <- y + rnorm(1)
x
```
```{r, C, ref.label = c('cau','norm')
```

And the output

y <- rcauchy(1)
y

## [1] 2.8

x <- y + rnorm(1)
x

## [1] 4.1

External code

External chunk code can be kept in R scripts and can be referenced by chunk label or the line number.

## @knitr nitrogen_conversion
tons_mgN <- function(tons, unit){
mgN <- (tons * 1e9) / 20 / 5.7 / unit
return(mgN)
}

read_chunk("nitrogen_conversions.R")

```{r, nitrogen_conversion}
```

specify from and to arguments to use lines numbers in read_chunk().

Child documents

Just like LaTeX, child documents (\include{foo.tex}) consisting of smaller parts can be used with knitr.
Consist of plain chunks
To use this just specify the child ='foo.Rnw' or 'foo.Rmd' if using Rmarkdown.
These could be called conditionally. For example, if you were doing a report and only ran an additional set of analyses conditional on some output.
child = if(bar > foo) 'foo.Rnw'

Hooks

Hooks allow you to expand the capability of knitr.
Chunk hooks can be called before (say want to crop the margins of a graph via par() or after the chunk (if you want to insert commands into output like LaTeX or Markdown commands).
See Yihui's hooks page for more details.

Packrat

Packrat is an R package that helps you manage your project’s R package dependencies in an isolated, reproducible and portable way.
init(), initialize a new packrat project.
snapshot(), take a snapshot of the installed packages.
bundle(), bundles (tarball) a packrat project for sharing.
unbundle(), unbundles a packrat project.
This will modify .Rprofile!

GitHub

Github is a place to store your code.
Free and has version control using git.
Great way to share code, data, and teaching materials.

Travis-CI will automatically run your code on an external computer that you are not using in your development.
Free for open-source projects
Seemlessly links with GitHub
Easy to set up for R
Save the following in .travis.yml
```
language: r
sudo: required
```

Learning more

knitr has an excellent website
Purchase Yihui's book from Amazon (recommended)
Visit StackOverflow and search on the tag knitr
Rstudio blog. Lots of interesting things happen here.
Other reproduciblility talk