An Introduction to Docker for R Users
What is Docker?
Docker is “a computer program that performs
operating-system-level
virtualization, also known as ‘containerization’”
Wikipedia. As any
first line of a Wikipedia article about tech, this sentence
is obscure
to anyone not already familiar with the content of the
article.
So, to put it more simply, Docker is a program that allows
to
manipulate (launch and stop) multiple operating systems
(called
containers) on your machine (your machine will be called the
host).
Just imagine having 10 RaspberryPi with different flavors of
Linux, each
focused on doing one simple thing, that you can turn on and
off whenever
you need to ; but all of this happens on your computer.
Why Docker & R?
Docker is designed to enclose environments inside an image /
a
container. What this allows, for example, is to have a Linux
machine on
a Macbook, or a machine with R 3.3 when your main computer
has R 3.5.
Also, this means that you can use older versions of a
package for a
specific task, while still keeping the package on your
machine
up-to-date.
This way, you can “solve” dependencies issues: if ever you
are afraid
dependencies will break your analysis when packages are
updated, build a
container that will always have the software versions you
desire: be
it Linux, R, or any package.
Docker images vs Docker containers
On your machine, you’re going to need two things: images,
and
containers. Images are the definition of the OS, while the
containers
are the actual running instances of the images. You’ll need
to install
the image just once, while the containers are to be launched
whenever
you need this instance. And of course, multiple containers
of the same
images can be run at the same time.
To compare with R, this is the same principle as installing
vs loading a
package: a package is to be downloaded once, while it has to
be launched
every time you need it. And a package can be launched in
several R
sessions at the same time easily.
Dockerfile
A Docker image is built from a Dockerfile. This file is the
configuration file, and describes several things: from what
previous
docker image you are building this one, how to configure the
OS, and
what happens when you run the container. In a sense, it’s a
little bit
like the DESCRIPTION + NAMESPACE files of an R package,
which
describes which are the dependencies to your package, gives
meta
information, and states which functions and data are to be
available to
the users library()ing the package.
So, let’s build a very basic Dockerfile for R, focused on
reproducibility. The idea is this one: I have today an
analysis that
works (for example contained in a .R file), and I want to be
sure this
analysis will always work in the future, regardless of any
update to the
packages used.
So first, create a folder for your analysis, and a
Dockerfile:
mkdir ~/mydocker
cd ~/mydocker
touch Dockerfile
FROM
Every Dockerfile starts with a FROM, which describes what
image we
are building our image from. There are a lot of official
images, and you
can also build from a local one.
This FROM is, in a way, describing the dependency of your
image ; just
as in R, when building a package, you always rely on another
package (be
it only the {base} package).
If you’re going for an R based image, Dirk Eddelbuettel
& Carl Boettiger
are maintaining rocker, a collection
of Docker images for R you can use. We’ll use the
rocker/r-base in
this blogpost.
FROM rocker/r-base
RUN
Once we’ve got that, we’ll add some RUN statements: these
are commands
which mimic command line commands. Remember what we want: an
image that
will, ad vitam aeternam, run an analysis as if we were still
today. So
what we’ll do is use the {checkpoint} package.
The command to make R execute something, from the terminal,
is R -e "my
code". Let’s add a {checkpoint} installation.
FROM rocker/r-base
RUN R -e "install.packages('checkpoint')"
We need a /root/.checkpoint folder to use {checkpoint},
let’s create
that one with mkdir (make directory).
FROM rocker/r-base
RUN R -e "install.packages('checkpoint')"
RUN mkdir /root/.checkpoint
COPY
Now, I need to get the script for my analysis from my
machine (host) to
the container. For that, we’ll need to use COPY localfile
pathinthecontainer. I’ll first create a folder to receive
everything,
with mkdir. Note that here, the myscript.R has to be in the
same
folder as the Dockerfile on your computer.
Let’s say this is the content of myscript.R:
library(checkpoint)
checkpoint("2019-01-06")
library(tidystringdist)
df <- tidy_comb_all(iris, Species)
p <- tidy_stringdist(df)
write.csv(p, "p.csv")
Here, the {tidystringdist} that will be installed in the
machine will
be the one from the date of today, even if I build this image
in one
year, or two, or four.
FROM rocker/r-base
RUN R -e "install.packages('checkpoint')"
RUN mkdir /home/analysis
COPY myscript.R /home/analysis/myscript.R
CMD
CMD is the command to be run every time you’ll launch the
docker. What
we want is myscript.R to be sourced.
FROM rocker/r-base
RUN R -e "install.packages('checkpoint')"
RUN mkdir /home/analysis
COPY myscript.R /home/analysis/myscript.R
CMD R -e "source('/home/analysis/myscript.R')"
Build, and run
Build
Now, go and build your image. From your terminal, in the
directory where
the Dockerfile is located, run:
docker build -t analysis .
-t name is the name of the image (here analysis), and .
means it
will build the Dockerfile in the current working directory.
run
Then, just launch with:
docker run analysis
And your analysis will be run 🎉!
Export container content
One thing to do now: you want to access what is created by
your analysis
(here p.csv) outside your container ; i.e, on the host.
Because yes,
as for now, everything that happens in the container stays
in the
container. So what we need is to make the docker container
share a
folder with the host. For this, we’ll use what is called
Volume, which
are (roughly speaking), a way to tell the Docker container
to use a
folder from the host as a folder inside the container.
That way, everything that will be created in the folder by
the container
will persist after the container is turned off. To do this,
we’ll use
the -v flag when running the container, with
path/from/host:/path/in/container. Also, create a folder to
receive
the results in both :
FROM rocker/r-base
RUN R -e "install.packages('checkpoint')"
RUN mkdir /home/analysis && mkdir /home/results
COPY myscript.R /home/analysis/myscript.R
CMD cd /home/analysis && R -e
"source('myscript.R')" && mv /home/analysis/p.csv
/home/results/p.csv
mkdir ~/mydocker/results
docker run -v ~/mydocker/results:/home/results analysis
Wait for the computation to be done, and…
ls ~/mydocker/results
p.csv
🤘
What to do next?
So now, every time you’ll launch this Docker image, the
analysis will be
performed and you’ll get the result back. With no problem of
dependencies: the packages will always be installed from the
day you
desire. Although, this can be a little bit long to run as
the packages
are installed each time you run the container. But as I said
in the
Disclaimer, this is a basic introduction to Docker, R and
reproducibility, so the goal was more to get beginners on
board with
Docker 🙂
Other things you can do would be:
Using {packrat}, and get the
library bundle in the container.
Use remotes::install_version() if you want your analysis to
be
based on package version instead of a time based
installation.
FROM rocker/r-base
RUN R -e "install.packages('remotes');
remotes::install_version('tidystringdist', '0.1.2')"
...
Use the Volume trick to bring data into your container, so
that
any data will be analysed in the very same
environment.[Source]-https://www.r-bloggers.com/an-introduction-to-docker-for-r-users/
Beginners
& Advanced level Docker Training in Mumbai. Asterix Solution's 25 Hour Docker Training gives
broad hands-on practicals.
Comments
Post a Comment