Useful packages

Author

Phillip Alday

Published

2024-06-28

Unlike R, Julia does not immediately expose a huge number of functions, but instead requires loading packages (whether from the standard library or from the broader package ecosystem) for a lot of relevant functionality for statistical analysis. There are technical reasons for this, such as the ease of using the Julia package system. One further motivation is that Julia is aimed at a broader “technical computing” audience (like MATLAB or perhaps Python) and less at a “statistical analysis” audience.

This has two important implications:

  1. Even relatively simple programs will often load several packages.
  2. Packages are often focused on adding a relatively narrow set of functionality, which means that “basic” functionality (e.g. reading a CSV file and manipulating it as a DataFrame) is often split across multiple packages. In other words, see the the first point!

This notebook is not intended to be an exhaustive list of packages, but rather to highlight a few packages that I suspect will be particularly useful. Before getting onto the packages, I have one final hint: take advantage of how easy and first-class package management in Julia is. Having good package management makes reproducible analyses much easier and avoids breaking old analyses when you start a new one. The package-manager REPL mode (activated by typing ] at the julia> prompt) is very useful.

1 Data wrangling

1.1 Reading data

  • Arrow.jl a high performance format for data storage, accessible in R via the arrow package and in Python via pyarrow. (Confusingly, the function for reading and writing Arrow format files in R is called read_feather and write_feather, but the modern Arrow format is distinct from the older Feather format provided by the feather package.) This is the format that we store the example and test datasets in for MixedModels.jl.

  • CSV.jl useful for reading comma-separated values, tab-separated values and basically everything handled by the read.csv and read.table family of functions in R.

Note that by default both Arrow.jl and CSV.jl do not return a DataFrame, but rather “column tables” – named tuples of column vectors.

1.2 DataFrames

Unlike in R, DataFrames are not part of the base language, nor the standard library.

DataFrames.jl provides the basic infrastructure around DataFrames, as well as its own mini language for doing the split-apply-combine approach that underlies R’s dplyr and much of the tidyverse. The DataFrames.jl documentation is the place for looking at how to e.g. read in a CSV or Arrow file as a DataFrame. Note that DataFrames.jl by default depends on CategoricalArrays.jl to handle the equivalent of factor in the R world, but there is an alternative package for factor-like array type in Julia, PooledArrays.jl. PooledArrays are simpler, but more limited than CategoricalArrays and we (Phillip and Doug) sometimes use them in our examples and simulations. The tables produced by reading an Arrow file have their own representation of factor-like data as DictEncoded arrays.

DataFrame.jl’s mini language can be a bit daunting, if you’re used to manipulations in the style of base R or the tidyverse. For that, there are several options; recently, we’e had particularly nice experiences with DataFrameMacros.jl and Chain.jl for a convenient syntax to connect or “pipe” together successive operations. It’s your choice whether and which of these add-ons you want to use! Phillip tends to write his code using raw DataFrames.jl, but Doug really enjoys DataFrameMacros.jl.

The recently added Tidier collection of Julia packages is popular with those coming from the tidyverse.

1.3 Regression

Unlike in R, neither formula processing nor basic regression are part of the base language or the standard library.

The formula syntax and basic contrast-coding schemes in Julia is provided by StatsModels.jl. By default, MixedModels.jl re-exports the @formula macro and most commonly used contrast schemes from StatsModels.jl, so you often don’t have to worry about loading StatsModels.jl directly. The same is true for GLM.jl, which provides basic linear and generalized linear models, such as ordinary least squares (OLS) regression and logistic regression, i.e. the classical, non mixed regression models.

The basic functionality looks quite similar to R, e.g.

julia > lm(@formula(y ~ 1 + x), data)
julia > glm(@formula(y ~ 1 + x), data, Binomial(), LogitLink())

but the more general modelling API (also used by MixedModels.jl) is also supported:

julia > fit(LinearModel, @formula(y ~ 1 + x), mydata)
julia > fit(
  GeneralizedLinearModel,
  @formula(y ~ 1 + x),
  data,
  Binomial(),
  LogitLink(),
)

(You can also specify your model matrices directly and skip the formula interface, but we don’t recommend this as it’s easy to mess up in really subtle but very problematic ways.)

1.4 @formula, macros and domain-specific languages

As a sidebar: why is @formula a macro and not a normal function? Well, that’s because formulas are essentially their own domain-specific language (a variant of Wilkinson-Rogers notation) and macros are used for manipulating the language itself – or in this case, handling an entirely new, embedded language! This is also why macros are used by packages like Turing.jl and Soss.jl that define a language for Bayesian probabilistic programming like PyMC3 or Stan.

1.5 Extensions to the formula syntax

There are several ongoing efforts to extend the formula syntax to include some of the “extras” available in R, e.g. RegressionFormulae.jl to use the caret (^) notation to limit interactions to a certain order ((a+b+c)^2 generates a + b + c + a&b + a&c + b&c, but not a&b&c). Note also that Julia uses & to express interactions, not : like in R.

1.6 Standardizing Predictors

Although function calls such as log can be used within Julia formulae, they must act on a rowwise basis, i.e. on observations. Transformations such as z-scoring or centering (often done with scale in R) require knowledge of the entire column. StandardizedPredictors.jl provides functions for centering, scaling, and z-scoring within the formula. These are treated as pseudo-contrasts and computed on demand, meaning that predict and effects (see next) computations will handle these transformations on new data (e.g. centering new data around the mean computed during fitting the original data) correctly and automatically.

1.7 Effects

John Fox’s effects package in R (and the related ggeffects package for plotting these using ggplot2) provides a nice way to visualize a model’s overall view of the data. This functionality is provided by Effects.jl and works out-of-the-box with most regression model packages in Julia (including MixedModels.jl). Support for formulae with embedded functions (such as log) is not yet complete, but we’re working on it!

1.8 Estimated Marginal / Least Square Means

Effects.jl provides a subset of the functionality (basic estimated-marginal means and exhaustive pairwise comparisons) of the R package emmeans package. However, it is often better to use sensible, hypothesis-driven contrast coding than to compute all pairwise comparisons after the fact. 😃

2 Hypothesis Testing

Classical statistical tests such as the t-test can be found in the package HypothesisTests.jl.

3 Plotting ecosystem

Throughout this course, we have used the Makie ecosystem for plotting, but there are several alternatives in Julia.

3.1 Makie

The Makie ecosystem is a relatively new take on graphics that aims to be both powerful and easy to use. Makie.jl itself only provides abstract definitions for many components (and is used in e.g. MixedModelsMakie.jl to define plot types for MixedModels.jl). The actual plotting and rendering is handled by a backend package such as CairoMakie.jl (good for Quarto notebooks or rending static 2D images) and GLMakie.jl (good for dynamic, interactive visuals and 3D images). AlgebraOfGraphics.jl builds a grammar of graphics upon the Makie framework. It’s a great way to get good plots very quickly, but extensive customization is still best achieved by using Makie directly.

3.2 Plots.jl

Plots.jl is the original plotting package in Julia, but we often find it difficult to work with compared to some of the other alternatives. StatsPlots.jl builds on this, adding common statistical plots, while UnicodePlots.jl renders plots as Unicode characters directly in the REPL.

PGFPlotsX.jl is a very new package that writes directly to PGF (the format used by LaTeX’s tikz framework) and can stand alone or be used as a rendering backend for the Plots.jl ecosystem.

3.3 Gadfly

Gadfly.jl was the original attempt to create a plotting system in Julia based on the grammar of graphics (the “gg” in ggplot2). Development has largely stalled, but some functionality still exceeds AlgebraOfGraphics.jl, which has taken up the grammar of graphics mantle. Notably, the MixedModels.jl documentation still uses Gadfly as of this writing (early September 2021).

3.4 Others

There are many other graphics packages available in Julia, often wrapping well-established frameworks such as VegaLite.

4 Connecting to Other Languages

Using Julia doesn’t mean you have to leave all the packages you knew in other languages behind. In Julia, it’s often possible to even easily and quickly invoke code from other languages from within Julia.

RCall.jl provides a very convenient interface for interacting with R. JellyMe4.jl adds support for moving MixedModels.jl and lme4 models back and forth between the languages (which means that you can use emmeans, sjtools, DHARMa, car, etc. to examine MixedModels.jl models!). RData.jl provides support for reading .rds and .rda files from Julia, while RDatasets.jl provides convenient access to many of the standard datasets provided by R and various R packages.

PyCall.jl provides a very convenient way for interacting with Python code and packages. PyPlot.jl builds upon this foundation to provide support for Python’s matplotlib. Similarly, PyMNE.jl and PyFOOOF.jl provide some additional functionality to make interacting with MNE-Python and FOOOF from within Julia even easier than with vanilla PyCall. More recently, PythonCall.jl has proven to be a populat alternative to PyCall.jl.

For MATLAB users, there is also MATLAB.jl

Cxx.jl provides interoperability with C++. It also provides a C++ REPL mode, making it possible to treating C++ much more like a dynamic language than the traditional compiler toolchain would allow.

Support for calling C and Fortran is part of the Julia standard library.

Back to top