Data Science On Arch Linux

November 21, 2017
Data Science Setup Archlinux

This post explains my setup of Python, R and other related things a data scientist might need.

Update on 26th Nov 2017: updated section about jupyter and added section about spark and sparkR.

Python

On Arch Linux it is advised to use pacman or any other helper for installation of your dependencies via the AUR (except for virtual environments, which you should definitely prefer). If a package isn’t available via the AUR, I use pip2pkgbuild to create a PKGBUILD on the fly and install the package via pacman. Note that Arch made the official switch from Python 2 to Python 3 already, so packages that are named python-foo are for 3, python2-foo for 2 respectively.

Speeding Up R

The installation of r installs R, and if not already done blas, the linear algebra library. blas does not have multithreading capabilities, and thus can be slower than openblas for example. So replacing blas with openblas from the AUR can improve performance. If you run Intel processors you can use the intel math kernel library. Install intel-mkl from the AUR, then r-mkl also from the AUR. If you already have an R installation, this will need to be removed. Afterwards run update.packages(checkBuilt=TRUE) within an R session. (If you have problems when selecting a mirror due to the missing tlk dependencies, see below and repeat the step after setting your .Rprofile.)

Faster Mirrors

Using the following ~/.Rprofile to use the fastest CRAN mirrors automatically:

## Set CRAN mirror:
local({
      r <- getOption("repos")
        r["CRAN"] <- "https://cloud.r-project.org/"
          options(repos = r)
})

Coloring R Output

Install devtools via install.packages("devtools") (in an R session) if not already done. Then install devtools::install_github('jalvesaq/colorout').

Automatic Upgrade Of R Packages

New R versions might break some installed packages. The following hook automatically updates the installed packages (/etc/pacman.d/hooks/r-upgrade.hook):

[Trigger]
Operation = Upgrade
Type = Package
Target = r

[Action]
Description = Updates R packages and rebuilds them if necessary after R upgrade
# Depends is optional if this should depend on another package
Depends = 
When = PostTransaction
Exec = Rscript -e 'update.packages(checkBuilt = TRUE, ask = FALSE)'

Jupyter Notebooks

Install the jupyter-notebook package. To start the webclient, execute the binary with the very same name. jupyter-console provides a terminal based client.

Kernels

  • Python: Python 3 is included in the jupyter package. Python 2 can be installed via pacman -S python2-ipykernel.

  • R: Run the following to install the kernel:

    devtools::install_github('IRkernel/IRkernel')
    IRkernel::installspec()
    

    Note: If you get the error that jupyter-client might be missing, reinstall python-jupyter_core as per this issue.

Spark & SparkR

Spark can simply be run via the AUR package apache-spark. The Spark wiki page did not work for me. Instead this worked to install sparkR as described here:

cd /opt/apache-spark/R/lib/SparkR
R -e "devtools::install('.')"

Note that $SPARK_HOME is defined in /etc/profile.d/apache-spark.sh and leads to /opt/apache-spark. If this is not the case for you, you have to edit the command above.


References:

comments powered by Disqus