A quick tour of StatsModels.jl

Dave Kleinschmidt

18 February 2020

using DataFrames, Pipe
using MixedModels, GLM
using LinearAlgebra, Statistics

using StatsModels
using StatsModels: pretty_mat

Contrast coding

What and why?

To fit any kind of statistical model you need some kind of numerical representation of your data. Data often comes in a table, a named collection of variables of different types of data. Some of that data is "continuous", or basically numeric. But often our data is not numeric (or continuous), but "categorical", having a finite number of distinct levels.

For instance, let's look at the KB07 dataset:

kb07 = MixedModels.dataset(:kb07)
first(kb07, 5)

5 rows × 7 columns

	subj	item	spkr	prec	load	rt_trunc	rt_raw
	String	String	String	String	String	Int16	Int16
1	S030	I01	new	break	yes	2267	2267
2	S030	I02	old	maintain	no	3856	3856
3	S030	I03	old	break	no	1567	1567
4	S030	I04	new	maintain	no	1732	1732
5	S030	I05	new	break	no	2660	2660

Here :spkr, :prec, and :load are categortical variables, each of which takes on two different values. If we fit a regression using this dataset, we end up with predictors that refer to specific levels:

f = @formula(rt_trunc ~ 1 + spkr + prec + spkr&prec + (1 | subj))
mod = fit(MixedModel, f, kb07)

Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + prec + spkr & prec + (1 | subj)
     logLik        -2 logLik          AIC             BIC       
 -1.45767208×10⁴  2.91534417×10⁴  2.91654417×10⁴  2.91983781×10⁴

Variance components:
            Column    Variance  Std.Dev. 
subj     (Intercept)   95885.39 309.65366
Residual              662657.47 814.03776
 Number of obs: 1789; levels of grouping factors: 56

  Fixed-effects parameters:
──────────────────────────────────────────────────────────────────
                             Estimate  Std.Error  z value  P(>|z|)
──────────────────────────────────────────────────────────────────
(Intercept)                 2425.32      56.5224    42.91   <1e-99
spkr: old                    179.992     54.4214     3.31   0.0009
prec: maintain              -623.347     54.4214   -11.45   <1e-29
spkr: old & prec: maintain   -86.7763    76.9856    -1.13   0.2597
──────────────────────────────────────────────────────────────────

Let's look at a few rows of the fixed effects design matrix that's generated for this model:

pretty_mat(mod.X[1:5, :])

5×4 Array{Int64,2}:
 1  0  0  0
 1  1  1  1
 1  1  0  0
 1  0  1  0
 1  0  0  0

A few things to note: all the values are 0 or 1, and there's one column of all 1s at the start (that's the (Intercept) term). Columns 2 and 3 correspond to spkr and prec: there's a 0 where spkr == "new" and a 1 for "old". Note that the coefficient name for this column is spkr: old, which indicates that this predictor indicates the presence of "old", relative to the (implicit) baseline of "new". Similarly for prec: maintain.

The last column is the interaction term spkr&prec, and it's the elementwise product of the columns for spkr: new and pred: maintain.

How to take control

You can set your own contrasts via the contrasts= keyword argument in fit, with the variable you want to code as the key and contrasts as the value:

using StatsModels

contrasts = Dict(
    :spkr => EffectsCoding(base = "old"),
    :prec => DummyCoding(levels = ["maintain", "break"])
)

mod2 = fit(MixedModel, f, kb07, contrasts=contrasts)

Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + prec + spkr & prec + (1 | subj)
     logLik        -2 logLik          AIC             BIC       
 -1.45767208×10⁴  2.91534417×10⁴  2.91654417×10⁴  2.91983781×10⁴

Variance components:
            Column    Variance  Std.Dev. 
subj     (Intercept)   95885.39 309.65366
Residual              662657.47 814.03776
 Number of obs: 1789; levels of grouping factors: 56

  Fixed-effects parameters:
───────────────────────────────────────────────────────────────
                          Estimate  Std.Error  z value  P(>|z|)
───────────────────────────────────────────────────────────────
(Intercept)              1848.58      49.5329    37.32   <1e-99
spkr: new                 -46.6081    27.2263    -1.71   0.0869
prec: break               666.735     38.4928    17.32   <1e-66
spkr: new & prec: break   -43.3882    38.4928    -1.13   0.2597
───────────────────────────────────────────────────────────────

This example illustrates two ways to control the ordering of levels used to compute the contrasts:

you can use base= to determine the baseline level
you can use levels= to indicate all the levels that are used in the contrasts, the first of which is automatically set as the baseline.

Reversed Helmert coding

Let's say you want to use reverse Helmert coding. It's easy using reverse to flip the order of the levels. Here's the original:

spkr_levels = ["old","new"]
fit(MixedModel,
    f,
    kb07,
    contrasts = Dict(:spkr => HelmertCoding(levels=spkr_levels)))

Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + prec + spkr & prec + (1 | subj)
     logLik        -2 logLik          AIC             BIC       
 -1.45767208×10⁴  2.91534417×10⁴  2.91654417×10⁴  2.91983781×10⁴

Variance components:
            Column    Variance  Std.Dev. 
subj     (Intercept)   95885.39 309.65366
Residual              662657.47 814.03776
 Number of obs: 1789; levels of grouping factors: 56

  Fixed-effects parameters:
──────────────────────────────────────────────────────────────────
                             Estimate  Std.Error  z value  P(>|z|)
──────────────────────────────────────────────────────────────────
(Intercept)                 2515.31      49.5243    50.79   <1e-99
spkr: new                    -89.9962    27.2107    -3.31   0.0009
prec: maintain              -666.735     38.4928   -17.32   <1e-66
spkr: new & prec: maintain    43.3882    38.4928     1.13   0.2597
──────────────────────────────────────────────────────────────────

And reversed:

fit(MixedModel,
    f,
    kb07,
    contrasts = Dict(:spkr => HelmertCoding(levels=reverse(spkr_levels))))

Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + prec + spkr & prec + (1 | subj)
     logLik        -2 logLik          AIC             BIC       
 -1.45767208×10⁴  2.91534417×10⁴  2.91654417×10⁴  2.91983781×10⁴

Variance components:
            Column    Variance  Std.Dev. 
subj     (Intercept)   95885.39 309.65366
Residual              662657.47 814.03776
 Number of obs: 1789; levels of grouping factors: 56

  Fixed-effects parameters:
──────────────────────────────────────────────────────────────────
                             Estimate  Std.Error  z value  P(>|z|)
──────────────────────────────────────────────────────────────────
(Intercept)                 2515.31      49.5243    50.79   <1e-99
spkr: old                     89.9962    27.2107     3.31   0.0009
prec: maintain              -666.735     38.4928   -17.32   <1e-66
spkr: old & prec: maintain   -43.3882    38.4928    -1.13   0.2597
──────────────────────────────────────────────────────────────────

Built in contrast coding schemes

StatsModels.jl provides a few commonly used contrast coding schemes, some less-commonly used schemes, and structs that allow you to manually specify your own, custom schemes.

All are subtypes of the AbstractContrasts type:

using InteractiveUtils
subtypes(AbstractContrasts)

7-element Array{Any,1}:
 StatsModels.ContrastsCoding
 StatsModels.DummyCoding
 StatsModels.EffectsCoding
 StatsModels.FullDummyCoding
 StatsModels.HelmertCoding
 StatsModels.HypothesisCoding
 StatsModels.SeqDiffCoding

And all have fairly extensive documentation via the normal help system. For instance:

# use ?SeqDiffCoding in the REPL

Standard contrasts

The most commonly used contrasts are DummyCoding and EffectsCoding (which are similar to contr.treatment and contr.sum in R, respectively).

"Exotic" contrasts

We also provide HelmertCoding and SeqDiffCoding (corresponding to base R's contr.helmert and MASS's contr.sdiff).

Manual contrasts

There are two ways to manually specify contrasts. First, you can specify them directly via ContrastsCoding. If you do, it's good practice to specify the levels corresponding to the rows of the matrix, although they can be omitted in which case they'll be inferred from the data.

For instance, here's a weird set of contrasts for :spkr:

cs = Matrix([-1/3 2/3]')
contr_manual = Dict(:spkr => StatsModels.ContrastsCoding(cs, levels=["old", "new"]))
mod3 = fit(MixedModel, f, kb07, contrasts=contr_manual)

Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + prec + spkr & prec + (1 | subj)
     logLik        -2 logLik          AIC             BIC       
 -1.45767208×10⁴  2.91534417×10⁴  2.91654417×10⁴  2.91983781×10⁴

Variance components:
            Column    Variance  Std.Dev. 
subj     (Intercept)   95885.39 309.65366
Residual              662657.47 814.03776
 Number of obs: 1789; levels of grouping factors: 56

  Fixed-effects parameters:
──────────────────────────────────────────────────────────────────
                             Estimate  Std.Error  z value  P(>|z|)
──────────────────────────────────────────────────────────────────
(Intercept)                 2545.31      50.3425    50.56   <1e-99
spkr: new                   -179.992     54.4214    -3.31   0.0009
prec: maintain              -681.198     40.582    -16.79   <1e-62
spkr: new & prec: maintain    86.7763    76.9856     1.13   0.2597
──────────────────────────────────────────────────────────────────

(Note that the estimates and even the signs of the fixed effect βs change when we change the contrasts, but the overall log-likelihood doesn't).

We can see that the values from the contrasts matrix we specified are plugged directly in to the fixed effects matrix, and are also used in computing the predictor for the interaction:

mod3.X[1:5, :]

5×4 Array{Float64,2}:
 1.0   0.666667  0.0   0.0
 1.0  -0.333333  1.0  -0.333333
 1.0  -0.333333  0.0  -0.0
 1.0   0.666667  1.0   0.666667
 1.0   0.666667  0.0   0.0

Example: manual Helmert contrasts

Let's say you want Helmert contrasts but you always forget what it's called. Here's how you can manually specify them using StatsModels.ContrastsCoding.

Because this isn't very interesting with only two levels, let's combine :spkr and :prec into a single, 4-level variable:

kb07ex = transform(kb07, AsTable([:spkr, :prec]) => (x -> x.spkr .* "-" .* x.prec) => :spkr_prec);
levels = ["new-break", "new-maintain", "old-break", "old-maintain"]
f2 = @formula(rt_trunc ~ 1 + spkr_prec + (1 | subj))

FormulaTerm
Response:
  rt_trunc(unknown)
Predictors:
  1
  spkr_prec(unknown)
  (subj)->1 | subj

using StatsModels: ContrastsCoding
man_helm = [-1 -1 -1
             1 -1 -1
             0  2 -1
             0  0  3]
contr_helm_man = ContrastsCoding(man_helm[:,1:3], levels=levels)
fit(MixedModel, f2, kb07ex, contrasts = Dict(:spkr_prec => contr_helm_man))

Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr_prec + (1 | subj)
     logLik        -2 logLik          AIC             BIC       
 -1.45767208×10⁴  2.91534417×10⁴  2.91654417×10⁴  2.91983781×10⁴

Variance components:
            Column    Variance  Std.Dev. 
subj     (Intercept)   95885.39 309.65366
Residual              662657.47 814.03776
 Number of obs: 1789; levels of grouping factors: 56

  Fixed-effects parameters:
───────────────────────────────────────────────────────────────
                          Estimate  Std.Error  z value  P(>|z|)
───────────────────────────────────────────────────────────────
(Intercept)              2181.94      45.6362    47.81   <1e-99
spkr_prec: new-maintain  -311.673     27.2107   -11.45   <1e-29
spkr_prec: old-break      163.889     15.7041    10.44   <1e-24
spkr_prec: old-maintain   -95.5865    11.1225    -8.59   <1e-17
───────────────────────────────────────────────────────────────

We can see that this is equivalent to HelmertCoding:

fit(MixedModel, f2, kb07ex, contrasts = Dict(:spkr_prec => HelmertCoding(levels=levels)))

Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr_prec + (1 | subj)
     logLik        -2 logLik          AIC             BIC       
 -1.45767208×10⁴  2.91534417×10⁴  2.91654417×10⁴  2.91983781×10⁴

Variance components:
            Column    Variance  Std.Dev. 
subj     (Intercept)   95885.39 309.65366
Residual              662657.47 814.03776
 Number of obs: 1789; levels of grouping factors: 56

  Fixed-effects parameters:
───────────────────────────────────────────────────────────────
                          Estimate  Std.Error  z value  P(>|z|)
───────────────────────────────────────────────────────────────
(Intercept)              2181.94      45.6362    47.81   <1e-99
spkr_prec: new-maintain  -311.673     27.2107   -11.45   <1e-29
spkr_prec: old-break      163.889     15.7041    10.44   <1e-24
spkr_prec: old-maintain   -95.5865    11.1225    -8.59   <1e-17
───────────────────────────────────────────────────────────────

Contrasts from hypotheses

A better way to specify manual contrasts is via HypothesisCoding, where each row of the matrix corresponds to the weights given to the cell means of the levels corresponding to each column (see Schad et al. 2020 for more information). As before with manual contrasts, this is less interesting with only two levels, so we'll again look at a scenario where we combine :spkr and :prec into a single, 4-level predictor, and want to test some strange hypotheses.

Here's the model fit with the default (dummy/treatment-coded contrasts):

mod4 = fit(MixedModel, f2, kb07ex)

Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr_prec + (1 | subj)
     logLik        -2 logLik          AIC             BIC       
 -1.45767208×10⁴  2.91534417×10⁴  2.91654417×10⁴  2.91983781×10⁴

Variance components:
            Column    Variance  Std.Dev. 
subj     (Intercept)   95885.39 309.65366
Residual              662657.47 814.03776
 Number of obs: 1789; levels of grouping factors: 56

  Fixed-effects parameters:
──────────────────────────────────────────────────────────────
                         Estimate  Std.Error  z value  P(>|z|)
──────────────────────────────────────────────────────────────
(Intercept)              2425.32     56.5224    42.91   <1e-99
spkr_prec: new-maintain  -623.347    54.4214   -11.45   <1e-29
spkr_prec: old-break      179.992    54.4214     3.31   0.0009
spkr_prec: old-maintain  -530.131    54.4839    -9.73   <1e-21
──────────────────────────────────────────────────────────────

Let's see how you could generate custom contrasts for a number of different a priori hypotheses.

Example: Sequential differences coding

One hypothesis you might want to test is that the first condition is different from the second, the second from the third, the third from the fourth, etc. First we have to turn these hypotheses into a numeric form. The null hypothesis that condition 1 is not different from condition 2 can be expressed by saying the difference between the two mean responses in these conditions is zero. Mathematically, we can write that as $\mu_2 - \mu_1 = 0$, or equivalently:

\[ -1 \cdot \mu_1 + 1 \cdot \mu_2 + 0 \cdot \mu_3 + ... + 0 \cdot \mu_n = 0 \]

The weights for each of the means are the entries in our hypothesis vector for this hypothesis. So the first hypothesis vector is [-1, 1, 0, 0]. Likewise, the second is [0, -1, 1, 0] ($\mu_3 - \mu_2 = 0$), and the third is [0, 0, -1, 1] ($\mu_4 - \mu_3 = 0$). Putting these together we get:

seq_diff_hyps = [-1  1  0  0
                  0 -1  1  0
                  0  0 -1  1]

3×4 Array{Int64,2}:
 -1   1   0  0
  0  -1   1  0
  0   0  -1  1

These hypotheses correspond to the following contrasts (using the StatsModels.pretty_mat function to make pretty fractions; this is based on rationalize function, which is like the fractions function in R):

seq_diff_contrs = HypothesisCoding(seq_diff_hyps, levels = levels)
pretty_mat(seq_diff_contrs.contrasts)

4×3 Array{Rational{Int64},2}:
 -3//4  -1//2  -1//4
  1//4  -1//2  -1//4
  1//4   1//2  -1//4
  1//4   1//2   3//4

You can see that the contrasts for these hypothese are rather different! It's not immediately obvious just looking at them how they're related (at least not to me), which shows the power of hypothesis coding: you can work in a format that does make intuitive sense (the weights assigned to each group's mean response).

When we fit the model, we should see that the corresponding betas are the same as the differences between the cell means:

fit(MixedModel,
    f2,
    kb07ex,
    contrasts = Dict(:spkr_prec => seq_diff_contrs))

Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr_prec + (1 | subj)
     logLik        -2 logLik          AIC             BIC       
 -1.45767208×10⁴  2.91534417×10⁴  2.91654417×10⁴  2.91983781×10⁴

Variance components:
            Column    Variance  Std.Dev. 
subj     (Intercept)   95885.39 309.65366
Residual              662657.47 814.03776
 Number of obs: 1789; levels of grouping factors: 56

  Fixed-effects parameters:
──────────────────────────────────────────────────────────────
                         Estimate  Std.Error  z value  P(>|z|)
──────────────────────────────────────────────────────────────
(Intercept)              2181.94     45.6362    47.81   <1e-99
spkr_prec: new-maintain  -623.347    54.4214   -11.45   <1e-29
spkr_prec: old-break      803.339    54.3902    14.77   <1e-48
spkr_prec: old-maintain  -710.123    54.4527   -13.04   <1e-38
──────────────────────────────────────────────────────────────

Calculating the cell mean differences manually:

cell_means = @pipe kb07ex |>
    groupby(_, :spkr_prec) |>
    combine(_, :rt_trunc => mean => :mean_rt) |>
    innerjoin(DataFrame(spkr_prec=levels), _, on=:spkr_prec)  # make sure ordering is right
diff(cell_means.mean_rt)                                      # compare with betas above...

3-element Array{Float64,1}:
 -623.465254474273
  803.3392857142858
 -709.9560177770661

Note that the intercept corresponds to the grand mean, not to the first level's mean! That's because the hypothesis vectors are zero-mean, so they don't affect the hypothesis for the intercept (as long as the design is balanced).

Example: custom, a priori hypotheses

Let's say we want to test whether the effect of :prec depends on whether :spkr is old vs. new. We need one contrast to test the hypothesis that "maintain" != "break" for "new", and another for "old". That leaves one over, to test the overall difference between "new" and "old".

prec_old = (levels .== "old-break") .- (levels .== "old-maintain")

4-element Array{Int64,1}:
  0
  0
  1
 -1

prec_new = (levels .== "new-break") .- (levels .== "new-maintain")

4-element Array{Int64,1}:
  1
 -1
  0
  0

old_new = (abs.(prec_old) .- abs.(prec_new)) ./ 2

4-element Array{Float64,1}:
 -0.5
 -0.5
  0.5
  0.5

contr_hyp = HypothesisCoding(hcat(old_new, prec_old, prec_new)',
                             labels=["old", "(old) break", "(new) break"])
contr_hyp.hypotheses

3×4 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 -0.5  -0.5  0.5   0.5
  0.0   0.0  1.0  -1.0
  1.0  -1.0  0.0   0.0

These hypotheses correspond to the following contrasts:

pretty_mat(contr_hyp.contrasts)

4×3 Array{Rational{Int64},2}:
 -1//2   0//1   1//2
 -1//2   0//1  -1//2
  1//2   1//2   0//1
  1//2  -1//2   0//1

Notice how the ±1 coding in the hypotheses (which translates into the difference between the mean response in those cells) is transformed into ±½ coding in the contrasts.

mod5 = fit(MixedModel, f2, kb07ex, contrasts = Dict(:spkr_prec => contr_hyp))

Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr_prec + (1 | subj)
     logLik        -2 logLik          AIC             BIC       
 -1.45767208×10⁴  2.91534417×10⁴  2.91654417×10⁴  2.91983781×10⁴

Variance components:
            Column    Variance  Std.Dev. 
subj     (Intercept)   95885.39 309.65366
Residual              662657.47 814.03776
 Number of obs: 1789; levels of grouping factors: 56

  Fixed-effects parameters:
─────────────────────────────────────────────────────────────
                        Estimate  Std.Error  z value  P(>|z|)
─────────────────────────────────────────────────────────────
(Intercept)             2181.94     45.6362    47.81   <1e-99
spkr_prec: old           136.604    38.4928     3.55   0.0004
spkr_prec: (old) break   710.123    54.4527    13.04   <1e-38
spkr_prec: (new) break   623.347    54.4214    11.45   <1e-29
─────────────────────────────────────────────────────────────

Note that this is equivalent to the / "nesting" syntax using EffectsCoding, after adjusting for the 2× factor from the +1/-1 coding:

mod6 = fit(MixedModel,
           @formula(rt_trunc ~ 1 + spkr/prec + (1|subj)), 
           kb07,
           contrasts = Dict(:spkr => EffectsCoding(base="new"),
                            :prec => EffectsCoding(base="maintain")))

Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + spkr & prec + (1 | subj)
     logLik        -2 logLik          AIC             BIC       
 -1.45767208×10⁴  2.91534417×10⁴  2.91654417×10⁴  2.91983781×10⁴

Variance components:
            Column    Variance  Std.Dev. 
subj     (Intercept)   95885.39 309.65366
Residual              662657.47 814.03776
 Number of obs: 1789; levels of grouping factors: 56

  Fixed-effects parameters:
───────────────────────────────────────────────────────────────
                          Estimate  Std.Error  z value  P(>|z|)
───────────────────────────────────────────────────────────────
(Intercept)              2181.94      45.6362    47.81   <1e-99
spkr: old                  68.3021    19.2464     3.55   0.0004
spkr: new & prec: break   311.673     27.2107    11.45   <1e-29
spkr: old & prec: break   355.062     27.2263    13.04   <1e-38
───────────────────────────────────────────────────────────────

Example: Helmert contrasts that actually make sense

Let's say you want something like Helmert contrasts, but where the βs are interpretable as the difference between the $n$th level and the average of levels $1\ldots n-1$. Here are the hypotheses that correspond to that:

helmert_hypotheses = [-1 -1/2 -1/3
                       1 -1/2 -1/3
                       0  1   -1/3
                       0  0    1]

4×3 Array{Float64,2}:
 -1.0  -0.5  -0.333333
  1.0  -0.5  -0.333333
  0.0   1.0  -0.333333
  0.0   0.0   1.0

And the resulting contrasts matrix:

contr_helm_hyp = HypothesisCoding(helmert_hypotheses',
                                  levels=levels, labels=levels[2:end])
pretty_mat(contr_helm_hyp.contrasts)

4×3 Array{Rational{Int64},2}:
 -1//2  -1//3  -1//4
  1//2  -1//3  -1//4
  0//1   2//3  -1//4
  0//1   0//1   3//4

Which is similar but not identical to the contrats matrix for HelmertCoding! In a way that I would not be able to derive off the top of my head.

Now we fit the model:

fit(MixedModel, f2, kb07ex, contrasts = Dict(:spkr_prec => contr_helm_hyp))

Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr_prec + (1 | subj)
     logLik        -2 logLik          AIC             BIC       
 -1.45767208×10⁴  2.91534417×10⁴  2.91654417×10⁴  2.91983781×10⁴

Variance components:
            Column    Variance  Std.Dev. 
subj     (Intercept)   95885.39 309.65366
Residual              662657.47 814.03776
 Number of obs: 1789; levels of grouping factors: 56

  Fixed-effects parameters:
──────────────────────────────────────────────────────────────
                         Estimate  Std.Error  z value  P(>|z|)
──────────────────────────────────────────────────────────────
(Intercept)              2181.94     45.6362    47.81   <1e-99
spkr_prec: new-maintain  -623.347    54.4214   -11.45   <1e-29
spkr_prec: old-break      491.666    47.1123    10.44   <1e-24
spkr_prec: old-maintain  -382.346    44.4902    -8.59   <1e-17
──────────────────────────────────────────────────────────────

...and we can see that the βs for levels are very close to the cumulative means minus the mean for that level, computed manually

lev_means = @pipe kb07ex |>
    groupby(_, :spkr_prec) |>
    combine(_, :rt_trunc => mean => :mean_rt) |>
    innerjoin(DataFrame(spkr_prec=levels), _, on=:spkr_prec) |>
    transform(_, :mean_rt => (x -> cumsum(x) ./ (1:length(x))) => :cum_mean) |>
    transform(_, [:mean_rt, :cum_mean] => ((x,y) -> x - lag(y)) => :diff_with_last_mean)

4 rows × 4 columns

	spkr_prec	mean_rt	cum_mean	diff_with_last_mean
	String	Float64	Float64	Float64?
1	new-break	2425.43	2425.43	missing
2	new-maintain	1801.97	2113.7	-623.465
3	old-break	2605.31	2277.57	491.607
4	old-maintain	1895.35	2182.02	-382.218

The `@formula`

A formula in Julia is created with the @formula macro. Between the macro and fitting a model, the formula goes through a number of steps.

The @formula macro itself does some transformations on the syntax, and creates Terms
Then a Schema is extracted from the data, which says which Terms are ContinuousTerms and which are CategoricalTerms
The Schema is then used to transform the original formula into a "concrete formula".
The concrete formula (with all Terms replaced by continuous/categorical versions) generates model matrix columns when given some data.

The details are described in the documentation, and for the most part modeling packages handle these steps for you. But in the interest of allowing you to do your own weird things, here are a few examples.

A formula is made of terms

f = @formula(y ~ 1 + a + b + a&b)

FormulaTerm
Response:
  y(unknown)
Predictors:
  1
  a(unknown)
  b(unknown)
  a(unknown) & b(unknown)

You can inpsect the internal structure with (it's like str in R):

dump(f)

StatsModels.FormulaTerm{StatsModels.Term,Tuple{StatsModels.ConstantTerm{Int
64},StatsModels.Term,StatsModels.Term,StatsModels.InteractionTerm{Tuple{Sta
tsModels.Term,StatsModels.Term}}}}
  lhs: StatsModels.Term
    sym: Symbol y
  rhs: Tuple{StatsModels.ConstantTerm{Int64},StatsModels.Term,StatsModels.T
erm,StatsModels.InteractionTerm{Tuple{StatsModels.Term,StatsModels.Term}}}
    1: StatsModels.ConstantTerm{Int64}
      n: Int64 1
    2: StatsModels.Term
      sym: Symbol a
    3: StatsModels.Term
      sym: Symbol b
    4: StatsModels.InteractionTerm{Tuple{StatsModels.Term,StatsModels.Term}
}
      terms: Tuple{StatsModels.Term,StatsModels.Term}
        1: StatsModels.Term
          sym: Symbol a
        2: StatsModels.Term
          sym: Symbol b

We can build the same formula directly, using terms:

t_a = term(:a)
t_b = term(:b)
t_1 = term(1)
t_y = term(:y)

dump(t_a)

StatsModels.Term
  sym: Symbol a

f_dir = FormulaTerm(t_y, (t_1, t_a, t_b, InteractionTerm((t_a, t_b))))

FormulaTerm
Response:
  y(unknown)
Predictors:
  1
  a(unknown)
  b(unknown)
  a(unknown) & b(unknown)

Or, using operator overloading:

f_op = t_y ~ t_1 + t_a + t_b + t_a & t_b

FormulaTerm
Response:
  y(unknown)
Predictors:
  1
  a(unknown)
  b(unknown)
  a(unknown) & b(unknown)

These three are all equivalent:

f == f_dir == f_op

true

The schema gives concrete terms

If we have some fake data:

df = DataFrame(y = rand(100), a = rand(100), b = repeat([:Q, :R, :S, :T], 25))
first(df, 5)

5 rows × 3 columns

	y	a	b
	Float64	Float64	Symbol
1	0.307495	0.256856	Q
2	0.670065	0.858086	R
3	0.11368	0.276019	S
4	0.914106	0.298955	T
5	0.7041	0.379508	Q

We can extract a Schema:

sch = schema(df)

StatsModels.Schema with 3 entries:
  y => y
  a => a
  b => b

This maps un-typed Terms to concrete verisons. Now we know that :a is a continuous variable in this dataset:

sch[term(:a)]

a(continuous)

Categorical terms hold the contrasts matrix and levels:

t_b_concrete = sch[term(:b)]

b(StatsModels.DummyCoding:4→3)

cmat = t_b_concrete.contrasts
cmat.matrix

4×3 Array{Float64,2}:
 0.0  0.0  0.0
 1.0  0.0  0.0
 0.0  1.0  0.0
 0.0  0.0  1.0

cmat.levels

4-element Array{Symbol,1}:
 :Q
 :R
 :S
 :T

`apply_schema` combines terms and schema to get concrete versions

The canonical case is to apply the schema to the whole formula:

apply_schema(f, sch)

FormulaTerm
Response:
  y(continuous)
Predictors:
  1
  a(continuous)
  b(StatsModels.DummyCoding:4→3)
  a(continuous) & b(StatsModels.DummyCoding:4→3)

Note that the schema gets pushed through the interaction term, too.

We can also apply the schema to a single term:

apply_schema(term(:a) & term(:b), sch)

a(continuous) & b(StatsModels.DummyCoding:4→3)

Of course if the schema doesn't have enough information, we'll get an error:

apply_schema(term(:argle_bargle), sch)

ERROR: KeyError: key argle_bargle not found

Concrete terms generate arrays with `modelcols`

Any term can generate model columns with modelcols:

t_ab_concrete = apply_schema(term(:a) & term(:b), sch)
modelcols(t_ab_concrete, first(df, 6))

6×3 Array{Float64,2}:
 0.0       0.0       0.0
 0.858086  0.0       0.0
 0.0       0.276019  0.0
 0.0       0.0       0.298955
 0.0       0.0       0.0
 0.797318  0.0       0.0

Compare with:

f_concrete = apply_schema(f, sch)
@show t_ab_concrete_formula = f_concrete.rhs.terms[end]

t_ab_concrete_formula = f_concrete.rhs.terms[end] = a & b

modelcols(t_ab_concrete_formula, first(df, 6))

6×3 Array{Float64,2}:
 0.0       0.0       0.0
 0.858086  0.0       0.0
 0.0       0.276019  0.0
 0.0       0.0       0.298955
 0.0       0.0       0.0
 0.797318  0.0       0.0

Of course you can generate columns for the whole formula (it returns a tuple of left-hand side, right-hand side columns):

y, X = modelcols(f_concrete, first(df, 6))
X

6×8 Array{Float64,2}:
 1.0  0.256856  0.0  0.0  0.0  0.0       0.0       0.0
 1.0  0.858086  1.0  0.0  0.0  0.858086  0.0       0.0
 1.0  0.276019  0.0  1.0  0.0  0.0       0.276019  0.0
 1.0  0.298955  0.0  0.0  1.0  0.0       0.0       0.298955
 1.0  0.379508  0.0  0.0  0.0  0.0       0.0       0.0
 1.0  0.797318  1.0  0.0  0.0  0.797318  0.0       0.0

Predicting based on new data

Any table with the right columns can be passed to modelcols and the right columns are generated, even if some levels are missing:

df2 = DataFrame(a = rand(5), b = [:R, :R, :Q, :S, :R])
modelcols(f_concrete.rhs, df2)

5×8 Array{Float64,2}:
 1.0  0.358765  1.0  0.0  0.0  0.358765  0.0       0.0
 1.0  0.415072  1.0  0.0  0.0  0.415072  0.0       0.0
 1.0  0.975042  0.0  0.0  0.0  0.0       0.0       0.0
 1.0  0.485075  0.0  1.0  0.0  0.0       0.485075  0.0
 1.0  0.542844  1.0  0.0  0.0  0.542844  0.0       0.0

Use a named tuple for a single row

A single row of the model matrix can be generated from a NamedTuple of data:

data_row = (a = 1.5, b = :T)
modelcols(f_concrete.rhs, data_row)

8-element Array{Float64,1}:
 1.0
 1.5
 0.0
 0.0
 1.0
 0.0
 0.0
 1.5

Get coefficient names for any term with `coefnames`

coefnames(f_concrete)

("y", ["(Intercept)", "a", "b: R", "b: S", "b: T", "a & b: R", "a & b: S", 
"a & b: T"])

coefnames(f_concrete.rhs.terms[end])

3-element Array{String,1}:
 "a & b: R"
 "a & b: S"
 "a & b: T"

coefnames(sch[term(:b)])

3-element Array{String,1}:
 "b: R"
 "b: S"
 "b: T"

Formula syntax

The formula syntax is very similar to R, with the exception that an interaction is specified with &, and that some R syntax is not supported by default (^, / outside of MixedModels.jl).

Non-special calls

Any function calls that are not special syntax (+, &, *, and ~) are treated as normal julia code, so you can write things like

f2 = @formula(log(y) ~ 1 + (a + a^2) * b)

FormulaTerm
Response:
  (y)->log(y)
Predictors:
  1
  a(unknown)
  (a)->a ^ 2
  b(unknown)
  a(unknown) & b(unknown)
  (a)->a ^ 2 & b(unknown)

f2_concrete = apply_schema(f2, sch)

FormulaTerm
Response:
  (y)->log(y)
Predictors:
  1
  a(continuous)
  (a)->a ^ 2
  b(StatsModels.DummyCoding:4→3)
  a(continuous) & b(StatsModels.DummyCoding:4→3)
  (a)->a ^ 2 & b(StatsModels.DummyCoding:4→3)

y2, X2 = modelcols(f2_concrete, first(df, 5))

([-1.1792956706381146, -0.40038061953493465, -2.1743636432356452, -0.089808
88454433106, -0.3508347943014509], [1.0 0.2568563742620089 … 0.0 0.0; 1.0 0
.8580859902013884 … 0.0 0.0; … ; 1.0 0.2989550398201468 … 0.0 0.08937411583
386556; 1.0 0.3795084943411997 … 0.0 0.0])

y2 == log.(df[1:5, :y])

true

X2

5×12 Array{Float64,2}:
 1.0  0.256856  0.0659752  0.0  0.0  0.0  …  0.0       0.0        0.0
 1.0  0.858086  0.736312   1.0  0.0  0.0     0.736312  0.0        0.0
 1.0  0.276019  0.0761864  0.0  1.0  0.0     0.0       0.0761864  0.0
 1.0  0.298955  0.0893741  0.0  0.0  1.0     0.0       0.0        0.0893741
 1.0  0.379508  0.144027   0.0  0.0  0.0     0.0       0.0        0.0

coefnames(f2_concrete.rhs)

12-element Array{String,1}:
 "(Intercept)"
 "a"
 "a ^ 2"
 "b: R"
 "b: S"
 "b: T"
 "a & b: R"
 "a & b: S"
 "a & b: T"
 "a ^ 2 & b: R"
 "a ^ 2 & b: S"
 "a ^ 2 & b: T"

Advanced: making the ordinary special

You may have noticed that zercocorr and | were not included in the list of special syntax above. StatsModels.jl provides a method to add special syntax for the @formula that's specific to certain models. This works using the standard Julia techniques of multiple dispatch, by providing methods that intercept apply_schema for particular combinations of functions, schema, and context (model type), like so:

function StatsModels.apply_schema(
    t::FunctionTerm{typeof(|)},
    schema::StatsModels.FullRank,
    Mod::Type{<:MixedModel},
)
    schema = StatsModels.FullRank(schema.schema)
    lhs, rhs = t.args_parsed
    if !StatsModels.hasintercept(lhs) && !StatsModels.omitsintercept(lhs)
        lhs = InterceptTerm{true}() + lhs
    end
    lhs, rhs = apply_schema.((lhs, rhs), Ref(schema), Mod)
    RandomEffectsTerm(MatrixTerm(lhs), rhs)
end

There's a simpler example in the StatsModels docs which adds a poly(x, n) syntax for polynomial regression.

Example: specifying many different models

Let's see how each of the predictors in the KB07 dataset does on its own.

template = @formula(rt_trunc ~ 1 + (1|subj) + (1|subj))
fits = map([:spkr, :prec, :load]) do p
    f = template.lhs ~ template.rhs + term(p)
    fit(MixedModel, f, kb07)
end

3-element Array{MixedModels.LinearMixedModel{Float64},1}:
 Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + (1 | subj)
     logLik        -2 logLik          AIC             BIC       
 -1.47156041×10⁴  2.94312081×10⁴  2.94392081×10⁴  2.94611658×10⁴

Variance components:
            Column    Variance   Std.Dev. 
subj     (Intercept)   92268.768 303.75775
Residual              777856.310 881.96163
 Number of obs: 1789; levels of grouping factors: 56

  Fixed-effects parameters:
──────────────────────────────────────────────────
             Estimate  Std.Error  z value  P(>|z|)
──────────────────────────────────────────────────
(Intercept)  2113.29     50.1677    42.12   <1e-99
spkr: old     137.777    41.7046     3.30   0.0010
──────────────────────────────────────────────────
 Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + prec + (1 | subj)
     logLik        -2 logLik          AIC             BIC       
 -1.45836279×10⁴  2.91672558×10⁴  2.91752558×10⁴  2.91972134×10⁴

Variance components:
            Column    Variance   Std.Dev. 
subj     (Intercept)   95717.665 309.38272
Residual              667961.065 817.28885
 Number of obs: 1789; levels of grouping factors: 56

  Fixed-effects parameters:
─────────────────────────────────────────────────────
                Estimate  Std.Error  z value  P(>|z|)
─────────────────────────────────────────────────────
(Intercept)     2515.42     49.5539    50.76   <1e-99
prec: maintain  -666.945    38.6465   -17.26   <1e-66
─────────────────────────────────────────────────────
 Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + load + (1 | subj)
     logLik       -2 logLik         AIC            BIC      
 -1.4713984×10⁴ 2.94279681×10⁴ 2.94359681×10⁴ 2.94579257×10⁴

Variance components:
            Column    Variance Std.Dev. 
subj     (Intercept)   92327.49 303.8544
Residual              776400.55 881.1359
 Number of obs: 1789; levels of grouping factors: 56

  Fixed-effects parameters:
──────────────────────────────────────────────────
             Estimate  Std.Error  z value  P(>|z|)
──────────────────────────────────────────────────
(Intercept)  2103.56     50.1817    41.92   <1e-99
load: yes     156.884    41.6655     3.77   0.0002
──────────────────────────────────────────────────

Which predictor provides the best fit to the data on its own?

sort!(fits, by=objective)
foreach(fits) do fit
    println(round(fit.objective), ": ", fit.formula)
end

29167.0: rt_trunc ~ 1 + prec + (1 | subj)
29428.0: rt_trunc ~ 1 + load + (1 | subj)
29431.0: rt_trunc ~ 1 + spkr + (1 | subj)

Looks like it's prec, followed by load, and then spkr.