using DataFrames, Pipe using MixedModels, GLM using LinearAlgebra, Statistics using StatsModels using StatsModels: pretty_mat
To fit any kind of statistical model you need some kind of numerical representation of your data. Data often comes in a table, a named collection of variables of different types of data. Some of that data is "continuous", or basically numeric. But often our data is not numeric (or continuous), but "categorical", having a finite number of distinct levels.
For instance, let's look at the KB07 dataset:
kb07 = MixedModels.dataset(:kb07) first(kb07, 5)
subj | item | spkr | prec | load | rt_trunc | rt_raw | |
---|---|---|---|---|---|---|---|
String | String | String | String | String | Int16 | Int16 | |
1 | S030 | I01 | new | break | yes | 2267 | 2267 |
2 | S030 | I02 | old | maintain | no | 3856 | 3856 |
3 | S030 | I03 | old | break | no | 1567 | 1567 |
4 | S030 | I04 | new | maintain | no | 1732 | 1732 |
5 | S030 | I05 | new | break | no | 2660 | 2660 |
Here :spkr
, :prec
, and :load
are categortical variables, each of which takes on two different values. If we fit a regression using this dataset, we end up with predictors that refer to specific levels:
f = @formula(rt_trunc ~ 1 + spkr + prec + spkr&prec + (1 | subj)) mod = fit(MixedModel, f, kb07)
Linear mixed model fit by maximum likelihood rt_trunc ~ 1 + spkr + prec + spkr & prec + (1 | subj) logLik -2 logLik AIC BIC -1.45767208×10⁴ 2.91534417×10⁴ 2.91654417×10⁴ 2.91983781×10⁴ Variance components: Column Variance Std.Dev. subj (Intercept) 95885.39 309.65366 Residual 662657.47 814.03776 Number of obs: 1789; levels of grouping factors: 56 Fixed-effects parameters: ────────────────────────────────────────────────────────────────── Estimate Std.Error z value P(>|z|) ────────────────────────────────────────────────────────────────── (Intercept) 2425.32 56.5224 42.91 <1e-99 spkr: old 179.992 54.4214 3.31 0.0009 prec: maintain -623.347 54.4214 -11.45 <1e-29 spkr: old & prec: maintain -86.7763 76.9856 -1.13 0.2597 ──────────────────────────────────────────────────────────────────
Let's look at a few rows of the fixed effects design matrix that's generated for this model:
pretty_mat(mod.X[1:5, :])
5×4 Array{Int64,2}: 1 0 0 0 1 1 1 1 1 1 0 0 1 0 1 0 1 0 0 0
A few things to note: all the values are 0 or 1, and there's one column of all 1s at the start (that's the (Intercept)
term). Columns 2 and 3 correspond to spkr
and prec
: there's a 0 where spkr == "new"
and a 1 for "old"
. Note that the coefficient name for this column is spkr: old
, which indicates that this predictor indicates the presence of "old", relative to the (implicit) baseline of "new". Similarly for prec: maintain
.
The last column is the interaction term spkr&prec
, and it's the elementwise product of the columns for spkr: new
and pred: maintain
.
You can set your own contrasts via the contrasts=
keyword argument in fit
, with the variable you want to code as the key and contrasts as the value:
using StatsModels contrasts = Dict( :spkr => EffectsCoding(base = "old"), :prec => DummyCoding(levels = ["maintain", "break"]) ) mod2 = fit(MixedModel, f, kb07, contrasts=contrasts)
Linear mixed model fit by maximum likelihood rt_trunc ~ 1 + spkr + prec + spkr & prec + (1 | subj) logLik -2 logLik AIC BIC -1.45767208×10⁴ 2.91534417×10⁴ 2.91654417×10⁴ 2.91983781×10⁴ Variance components: Column Variance Std.Dev. subj (Intercept) 95885.39 309.65366 Residual 662657.47 814.03776 Number of obs: 1789; levels of grouping factors: 56 Fixed-effects parameters: ─────────────────────────────────────────────────────────────── Estimate Std.Error z value P(>|z|) ─────────────────────────────────────────────────────────────── (Intercept) 1848.58 49.5329 37.32 <1e-99 spkr: new -46.6081 27.2263 -1.71 0.0869 prec: break 666.735 38.4928 17.32 <1e-66 spkr: new & prec: break -43.3882 38.4928 -1.13 0.2597 ───────────────────────────────────────────────────────────────
This example illustrates two ways to control the ordering of levels used to compute the contrasts:
you can use base=
to determine the baseline level
you can use levels=
to indicate all the levels that are used in the contrasts, the first of which is automatically set as the baseline.
Let's say you want to use reverse Helmert coding. It's easy using reverse
to flip the order of the levels. Here's the original:
spkr_levels = ["old","new"] fit(MixedModel, f, kb07, contrasts = Dict(:spkr => HelmertCoding(levels=spkr_levels)))
Linear mixed model fit by maximum likelihood rt_trunc ~ 1 + spkr + prec + spkr & prec + (1 | subj) logLik -2 logLik AIC BIC -1.45767208×10⁴ 2.91534417×10⁴ 2.91654417×10⁴ 2.91983781×10⁴ Variance components: Column Variance Std.Dev. subj (Intercept) 95885.39 309.65366 Residual 662657.47 814.03776 Number of obs: 1789; levels of grouping factors: 56 Fixed-effects parameters: ────────────────────────────────────────────────────────────────── Estimate Std.Error z value P(>|z|) ────────────────────────────────────────────────────────────────── (Intercept) 2515.31 49.5243 50.79 <1e-99 spkr: new -89.9962 27.2107 -3.31 0.0009 prec: maintain -666.735 38.4928 -17.32 <1e-66 spkr: new & prec: maintain 43.3882 38.4928 1.13 0.2597 ──────────────────────────────────────────────────────────────────
And reversed:
fit(MixedModel, f, kb07, contrasts = Dict(:spkr => HelmertCoding(levels=reverse(spkr_levels))))
Linear mixed model fit by maximum likelihood rt_trunc ~ 1 + spkr + prec + spkr & prec + (1 | subj) logLik -2 logLik AIC BIC -1.45767208×10⁴ 2.91534417×10⁴ 2.91654417×10⁴ 2.91983781×10⁴ Variance components: Column Variance Std.Dev. subj (Intercept) 95885.39 309.65366 Residual 662657.47 814.03776 Number of obs: 1789; levels of grouping factors: 56 Fixed-effects parameters: ────────────────────────────────────────────────────────────────── Estimate Std.Error z value P(>|z|) ────────────────────────────────────────────────────────────────── (Intercept) 2515.31 49.5243 50.79 <1e-99 spkr: old 89.9962 27.2107 3.31 0.0009 prec: maintain -666.735 38.4928 -17.32 <1e-66 spkr: old & prec: maintain -43.3882 38.4928 -1.13 0.2597 ──────────────────────────────────────────────────────────────────
StatsModels.jl provides a few commonly used contrast coding schemes, some less-commonly used schemes, and structs that allow you to manually specify your own, custom schemes.
All are subtypes of the AbstractContrasts
type:
using InteractiveUtils subtypes(AbstractContrasts)
7-element Array{Any,1}: StatsModels.ContrastsCoding StatsModels.DummyCoding StatsModels.EffectsCoding StatsModels.FullDummyCoding StatsModels.HelmertCoding StatsModels.HypothesisCoding StatsModels.SeqDiffCoding
And all have fairly extensive documentation via the normal help system. For instance:
# use ?SeqDiffCoding in the REPL
The most commonly used contrasts are DummyCoding
and EffectsCoding
(which are similar to contr.treatment
and contr.sum
in R, respectively).
We also provide HelmertCoding
and SeqDiffCoding
(corresponding to base R's contr.helmert
and MASS's contr.sdiff
).
There are two ways to manually specify contrasts. First, you can specify them directly via ContrastsCoding
. If you do, it's good practice to specify the levels corresponding to the rows of the matrix, although they can be omitted in which case they'll be inferred from the data.
For instance, here's a weird set of contrasts for :spkr
:
cs = Matrix([-1/3 2/3]') contr_manual = Dict(:spkr => StatsModels.ContrastsCoding(cs, levels=["old", "new"])) mod3 = fit(MixedModel, f, kb07, contrasts=contr_manual)
Linear mixed model fit by maximum likelihood rt_trunc ~ 1 + spkr + prec + spkr & prec + (1 | subj) logLik -2 logLik AIC BIC -1.45767208×10⁴ 2.91534417×10⁴ 2.91654417×10⁴ 2.91983781×10⁴ Variance components: Column Variance Std.Dev. subj (Intercept) 95885.39 309.65366 Residual 662657.47 814.03776 Number of obs: 1789; levels of grouping factors: 56 Fixed-effects parameters: ────────────────────────────────────────────────────────────────── Estimate Std.Error z value P(>|z|) ────────────────────────────────────────────────────────────────── (Intercept) 2545.31 50.3425 50.56 <1e-99 spkr: new -179.992 54.4214 -3.31 0.0009 prec: maintain -681.198 40.582 -16.79 <1e-62 spkr: new & prec: maintain 86.7763 76.9856 1.13 0.2597 ──────────────────────────────────────────────────────────────────
(Note that the estimates and even the signs of the fixed effect βs change when we change the contrasts, but the overall log-likelihood doesn't).
We can see that the values from the contrasts matrix we specified are plugged directly in to the fixed effects matrix, and are also used in computing the predictor for the interaction:
mod3.X[1:5, :]
5×4 Array{Float64,2}: 1.0 0.666667 0.0 0.0 1.0 -0.333333 1.0 -0.333333 1.0 -0.333333 0.0 -0.0 1.0 0.666667 1.0 0.666667 1.0 0.666667 0.0 0.0
Let's say you want Helmert contrasts but you always forget what it's called. Here's how you can manually specify them using StatsModels.ContrastsCoding
.
Because this isn't very interesting with only two levels, let's combine :spkr
and :prec
into a single, 4-level variable:
kb07ex = transform(kb07, AsTable([:spkr, :prec]) => (x -> x.spkr .* "-" .* x.prec) => :spkr_prec); levels = ["new-break", "new-maintain", "old-break", "old-maintain"] f2 = @formula(rt_trunc ~ 1 + spkr_prec + (1 | subj))
FormulaTerm Response: rt_trunc(unknown) Predictors: 1 spkr_prec(unknown) (subj)->1 | subj
using StatsModels: ContrastsCoding man_helm = [-1 -1 -1 1 -1 -1 0 2 -1 0 0 3] contr_helm_man = ContrastsCoding(man_helm[:,1:3], levels=levels) fit(MixedModel, f2, kb07ex, contrasts = Dict(:spkr_prec => contr_helm_man))
Linear mixed model fit by maximum likelihood rt_trunc ~ 1 + spkr_prec + (1 | subj) logLik -2 logLik AIC BIC -1.45767208×10⁴ 2.91534417×10⁴ 2.91654417×10⁴ 2.91983781×10⁴ Variance components: Column Variance Std.Dev. subj (Intercept) 95885.39 309.65366 Residual 662657.47 814.03776 Number of obs: 1789; levels of grouping factors: 56 Fixed-effects parameters: ─────────────────────────────────────────────────────────────── Estimate Std.Error z value P(>|z|) ─────────────────────────────────────────────────────────────── (Intercept) 2181.94 45.6362 47.81 <1e-99 spkr_prec: new-maintain -311.673 27.2107 -11.45 <1e-29 spkr_prec: old-break 163.889 15.7041 10.44 <1e-24 spkr_prec: old-maintain -95.5865 11.1225 -8.59 <1e-17 ───────────────────────────────────────────────────────────────
We can see that this is equivalent to HelmertCoding
:
fit(MixedModel, f2, kb07ex, contrasts = Dict(:spkr_prec => HelmertCoding(levels=levels)))
Linear mixed model fit by maximum likelihood rt_trunc ~ 1 + spkr_prec + (1 | subj) logLik -2 logLik AIC BIC -1.45767208×10⁴ 2.91534417×10⁴ 2.91654417×10⁴ 2.91983781×10⁴ Variance components: Column Variance Std.Dev. subj (Intercept) 95885.39 309.65366 Residual 662657.47 814.03776 Number of obs: 1789; levels of grouping factors: 56 Fixed-effects parameters: ─────────────────────────────────────────────────────────────── Estimate Std.Error z value P(>|z|) ─────────────────────────────────────────────────────────────── (Intercept) 2181.94 45.6362 47.81 <1e-99 spkr_prec: new-maintain -311.673 27.2107 -11.45 <1e-29 spkr_prec: old-break 163.889 15.7041 10.44 <1e-24 spkr_prec: old-maintain -95.5865 11.1225 -8.59 <1e-17 ───────────────────────────────────────────────────────────────
A better way to specify manual contrasts is via HypothesisCoding
, where each row of the matrix corresponds to the weights given to the cell means of the levels corresponding to each column (see Schad et al. 2020 for more information). As before with manual contrasts, this is less interesting with only two levels, so we'll again look at a scenario where we combine :spkr
and :prec
into a single, 4-level predictor, and want to test some strange hypotheses.
Here's the model fit with the default (dummy/treatment-coded contrasts):
mod4 = fit(MixedModel, f2, kb07ex)
Linear mixed model fit by maximum likelihood rt_trunc ~ 1 + spkr_prec + (1 | subj) logLik -2 logLik AIC BIC -1.45767208×10⁴ 2.91534417×10⁴ 2.91654417×10⁴ 2.91983781×10⁴ Variance components: Column Variance Std.Dev. subj (Intercept) 95885.39 309.65366 Residual 662657.47 814.03776 Number of obs: 1789; levels of grouping factors: 56 Fixed-effects parameters: ────────────────────────────────────────────────────────────── Estimate Std.Error z value P(>|z|) ────────────────────────────────────────────────────────────── (Intercept) 2425.32 56.5224 42.91 <1e-99 spkr_prec: new-maintain -623.347 54.4214 -11.45 <1e-29 spkr_prec: old-break 179.992 54.4214 3.31 0.0009 spkr_prec: old-maintain -530.131 54.4839 -9.73 <1e-21 ──────────────────────────────────────────────────────────────
Let's see how you could generate custom contrasts for a number of different a priori hypotheses.
One hypothesis you might want to test is that the first condition is different from the second, the second from the third, the third from the fourth, etc. First we have to turn these hypotheses into a numeric form. The null hypothesis that condition 1 is not different from condition 2 can be expressed by saying the difference between the two mean responses in these conditions is zero. Mathematically, we can write that as $\mu_2 - \mu_1 = 0$, or equivalently:
\[ -1 \cdot \mu_1 + 1 \cdot \mu_2 + 0 \cdot \mu_3 + ... + 0 \cdot \mu_n = 0 \]
The weights for each of the means are the entries in our hypothesis vector for this hypothesis. So the first hypothesis vector is [-1, 1, 0, 0]
. Likewise, the second is [0, -1, 1, 0]
($\mu_3 - \mu_2 = 0$), and the third is [0, 0, -1, 1]
($\mu_4 - \mu_3 = 0$). Putting these together we get:
seq_diff_hyps = [-1 1 0 0 0 -1 1 0 0 0 -1 1]
3×4 Array{Int64,2}: -1 1 0 0 0 -1 1 0 0 0 -1 1
These hypotheses correspond to the following contrasts (using the StatsModels.pretty_mat
function to make pretty fractions; this is based on rationalize
function, which is like the fractions
function in R):
seq_diff_contrs = HypothesisCoding(seq_diff_hyps, levels = levels) pretty_mat(seq_diff_contrs.contrasts)
4×3 Array{Rational{Int64},2}: -3//4 -1//2 -1//4 1//4 -1//2 -1//4 1//4 1//2 -1//4 1//4 1//2 3//4
You can see that the contrasts for these hypothese are rather different! It's not immediately obvious just looking at them how they're related (at least not to me), which shows the power of hypothesis coding: you can work in a format that does make intuitive sense (the weights assigned to each group's mean response).
When we fit the model, we should see that the corresponding betas are the same as the differences between the cell means:
fit(MixedModel, f2, kb07ex, contrasts = Dict(:spkr_prec => seq_diff_contrs))
Linear mixed model fit by maximum likelihood rt_trunc ~ 1 + spkr_prec + (1 | subj) logLik -2 logLik AIC BIC -1.45767208×10⁴ 2.91534417×10⁴ 2.91654417×10⁴ 2.91983781×10⁴ Variance components: Column Variance Std.Dev. subj (Intercept) 95885.39 309.65366 Residual 662657.47 814.03776 Number of obs: 1789; levels of grouping factors: 56 Fixed-effects parameters: ────────────────────────────────────────────────────────────── Estimate Std.Error z value P(>|z|) ────────────────────────────────────────────────────────────── (Intercept) 2181.94 45.6362 47.81 <1e-99 spkr_prec: new-maintain -623.347 54.4214 -11.45 <1e-29 spkr_prec: old-break 803.339 54.3902 14.77 <1e-48 spkr_prec: old-maintain -710.123 54.4527 -13.04 <1e-38 ──────────────────────────────────────────────────────────────
Calculating the cell mean differences manually:
cell_means = @pipe kb07ex |> groupby(_, :spkr_prec) |> combine(_, :rt_trunc => mean => :mean_rt) |> innerjoin(DataFrame(spkr_prec=levels), _, on=:spkr_prec) # make sure ordering is right diff(cell_means.mean_rt) # compare with betas above...
3-element Array{Float64,1}: -623.465254474273 803.3392857142858 -709.9560177770661
Note that the intercept corresponds to the grand mean, not to the first level's mean! That's because the hypothesis vectors are zero-mean, so they don't affect the hypothesis for the intercept (as long as the design is balanced).
Let's say we want to test whether the effect of :prec
depends on whether :spkr
is old vs. new. We need one contrast to test the hypothesis that "maintain" != "break"
for "new", and another for "old". That leaves one over, to test the overall difference between "new" and "old".
prec_old = (levels .== "old-break") .- (levels .== "old-maintain")
4-element Array{Int64,1}: 0 0 1 -1
prec_new = (levels .== "new-break") .- (levels .== "new-maintain")
4-element Array{Int64,1}: 1 -1 0 0
old_new = (abs.(prec_old) .- abs.(prec_new)) ./ 2
4-element Array{Float64,1}: -0.5 -0.5 0.5 0.5
contr_hyp = HypothesisCoding(hcat(old_new, prec_old, prec_new)', labels=["old", "(old) break", "(new) break"]) contr_hyp.hypotheses
3×4 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}: -0.5 -0.5 0.5 0.5 0.0 0.0 1.0 -1.0 1.0 -1.0 0.0 0.0
These hypotheses correspond to the following contrasts:
pretty_mat(contr_hyp.contrasts)
4×3 Array{Rational{Int64},2}: -1//2 0//1 1//2 -1//2 0//1 -1//2 1//2 1//2 0//1 1//2 -1//2 0//1
Notice how the ±1 coding in the hypotheses (which translates into the difference between the mean response in those cells) is transformed into ±½ coding in the contrasts.
mod5 = fit(MixedModel, f2, kb07ex, contrasts = Dict(:spkr_prec => contr_hyp))
Linear mixed model fit by maximum likelihood rt_trunc ~ 1 + spkr_prec + (1 | subj) logLik -2 logLik AIC BIC -1.45767208×10⁴ 2.91534417×10⁴ 2.91654417×10⁴ 2.91983781×10⁴ Variance components: Column Variance Std.Dev. subj (Intercept) 95885.39 309.65366 Residual 662657.47 814.03776 Number of obs: 1789; levels of grouping factors: 56 Fixed-effects parameters: ───────────────────────────────────────────────────────────── Estimate Std.Error z value P(>|z|) ───────────────────────────────────────────────────────────── (Intercept) 2181.94 45.6362 47.81 <1e-99 spkr_prec: old 136.604 38.4928 3.55 0.0004 spkr_prec: (old) break 710.123 54.4527 13.04 <1e-38 spkr_prec: (new) break 623.347 54.4214 11.45 <1e-29 ─────────────────────────────────────────────────────────────
Note that this is equivalent to the /
"nesting" syntax using EffectsCoding
, after adjusting for the 2× factor from the +1/-1 coding:
mod6 = fit(MixedModel, @formula(rt_trunc ~ 1 + spkr/prec + (1|subj)), kb07, contrasts = Dict(:spkr => EffectsCoding(base="new"), :prec => EffectsCoding(base="maintain")))
Linear mixed model fit by maximum likelihood rt_trunc ~ 1 + spkr + spkr & prec + (1 | subj) logLik -2 logLik AIC BIC -1.45767208×10⁴ 2.91534417×10⁴ 2.91654417×10⁴ 2.91983781×10⁴ Variance components: Column Variance Std.Dev. subj (Intercept) 95885.39 309.65366 Residual 662657.47 814.03776 Number of obs: 1789; levels of grouping factors: 56 Fixed-effects parameters: ─────────────────────────────────────────────────────────────── Estimate Std.Error z value P(>|z|) ─────────────────────────────────────────────────────────────── (Intercept) 2181.94 45.6362 47.81 <1e-99 spkr: old 68.3021 19.2464 3.55 0.0004 spkr: new & prec: break 311.673 27.2107 11.45 <1e-29 spkr: old & prec: break 355.062 27.2263 13.04 <1e-38 ───────────────────────────────────────────────────────────────
Let's say you want something like Helmert contrasts, but where the βs are interpretable as the difference between the $n$th level and the average of levels $1\ldots n-1$. Here are the hypotheses that correspond to that:
helmert_hypotheses = [-1 -1/2 -1/3 1 -1/2 -1/3 0 1 -1/3 0 0 1]
4×3 Array{Float64,2}: -1.0 -0.5 -0.333333 1.0 -0.5 -0.333333 0.0 1.0 -0.333333 0.0 0.0 1.0
And the resulting contrasts matrix:
contr_helm_hyp = HypothesisCoding(helmert_hypotheses', levels=levels, labels=levels[2:end]) pretty_mat(contr_helm_hyp.contrasts)
4×3 Array{Rational{Int64},2}: -1//2 -1//3 -1//4 1//2 -1//3 -1//4 0//1 2//3 -1//4 0//1 0//1 3//4
Which is similar but not identical to the contrats matrix for HelmertCoding! In a way that I would not be able to derive off the top of my head.
Now we fit the model:
fit(MixedModel, f2, kb07ex, contrasts = Dict(:spkr_prec => contr_helm_hyp))
Linear mixed model fit by maximum likelihood rt_trunc ~ 1 + spkr_prec + (1 | subj) logLik -2 logLik AIC BIC -1.45767208×10⁴ 2.91534417×10⁴ 2.91654417×10⁴ 2.91983781×10⁴ Variance components: Column Variance Std.Dev. subj (Intercept) 95885.39 309.65366 Residual 662657.47 814.03776 Number of obs: 1789; levels of grouping factors: 56 Fixed-effects parameters: ────────────────────────────────────────────────────────────── Estimate Std.Error z value P(>|z|) ────────────────────────────────────────────────────────────── (Intercept) 2181.94 45.6362 47.81 <1e-99 spkr_prec: new-maintain -623.347 54.4214 -11.45 <1e-29 spkr_prec: old-break 491.666 47.1123 10.44 <1e-24 spkr_prec: old-maintain -382.346 44.4902 -8.59 <1e-17 ──────────────────────────────────────────────────────────────
...and we can see that the βs for levels are very close to the cumulative means minus the mean for that level, computed manually
lev_means = @pipe kb07ex |> groupby(_, :spkr_prec) |> combine(_, :rt_trunc => mean => :mean_rt) |> innerjoin(DataFrame(spkr_prec=levels), _, on=:spkr_prec) |> transform(_, :mean_rt => (x -> cumsum(x) ./ (1:length(x))) => :cum_mean) |> transform(_, [:mean_rt, :cum_mean] => ((x,y) -> x - lag(y)) => :diff_with_last_mean)
spkr_prec | mean_rt | cum_mean | diff_with_last_mean | |
---|---|---|---|---|
String | Float64 | Float64 | Float64? | |
1 | new-break | 2425.43 | 2425.43 | missing |
2 | new-maintain | 1801.97 | 2113.7 | -623.465 |
3 | old-break | 2605.31 | 2277.57 | 491.607 |
4 | old-maintain | 1895.35 | 2182.02 | -382.218 |
@formula
A formula in Julia is created with the @formula
macro. Between the macro and fitting a model, the formula goes through a number of steps.
The @formula
macro itself does some transformations on the syntax, and creates Term
s
Then a Schema
is extracted from the data, which says which Term
s are ContinuousTerm
s and which are CategoricalTerm
s
The Schema
is then used to transform the original formula into a "concrete formula".
The concrete formula (with all Term
s replaced by continuous/categorical versions) generates model matrix columns when given some data.
The details are described in the documentation, and for the most part modeling packages handle these steps for you. But in the interest of allowing you to do your own weird things, here are a few examples.
f = @formula(y ~ 1 + a + b + a&b)
FormulaTerm Response: y(unknown) Predictors: 1 a(unknown) b(unknown) a(unknown) & b(unknown)
You can inpsect the internal structure with (it's like str
in R):
dump(f)
StatsModels.FormulaTerm{StatsModels.Term,Tuple{StatsModels.ConstantTerm{Int 64},StatsModels.Term,StatsModels.Term,StatsModels.InteractionTerm{Tuple{Sta tsModels.Term,StatsModels.Term}}}} lhs: StatsModels.Term sym: Symbol y rhs: Tuple{StatsModels.ConstantTerm{Int64},StatsModels.Term,StatsModels.T erm,StatsModels.InteractionTerm{Tuple{StatsModels.Term,StatsModels.Term}}} 1: StatsModels.ConstantTerm{Int64} n: Int64 1 2: StatsModels.Term sym: Symbol a 3: StatsModels.Term sym: Symbol b 4: StatsModels.InteractionTerm{Tuple{StatsModels.Term,StatsModels.Term} } terms: Tuple{StatsModels.Term,StatsModels.Term} 1: StatsModels.Term sym: Symbol a 2: StatsModels.Term sym: Symbol b
We can build the same formula directly, using terms:
t_a = term(:a) t_b = term(:b) t_1 = term(1) t_y = term(:y) dump(t_a)
StatsModels.Term sym: Symbol a
f_dir = FormulaTerm(t_y, (t_1, t_a, t_b, InteractionTerm((t_a, t_b))))
FormulaTerm Response: y(unknown) Predictors: 1 a(unknown) b(unknown) a(unknown) & b(unknown)
Or, using operator overloading:
f_op = t_y ~ t_1 + t_a + t_b + t_a & t_b
FormulaTerm Response: y(unknown) Predictors: 1 a(unknown) b(unknown) a(unknown) & b(unknown)
These three are all equivalent:
f == f_dir == f_op
true
If we have some fake data:
df = DataFrame(y = rand(100), a = rand(100), b = repeat([:Q, :R, :S, :T], 25)) first(df, 5)
y | a | b | |
---|---|---|---|
Float64 | Float64 | Symbol | |
1 | 0.307495 | 0.256856 | Q |
2 | 0.670065 | 0.858086 | R |
3 | 0.11368 | 0.276019 | S |
4 | 0.914106 | 0.298955 | T |
5 | 0.7041 | 0.379508 | Q |
We can extract a Schema
:
sch = schema(df)
StatsModels.Schema with 3 entries: y => y a => a b => b
This maps un-typed Term
s to concrete verisons. Now we know that :a
is a continuous variable in this dataset:
sch[term(:a)]
a(continuous)
Categorical terms hold the contrasts matrix and levels:
t_b_concrete = sch[term(:b)]
b(StatsModels.DummyCoding:4→3)
cmat = t_b_concrete.contrasts cmat.matrix
4×3 Array{Float64,2}: 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
cmat.levels
4-element Array{Symbol,1}: :Q :R :S :T
apply_schema
combines terms and schema to get concrete versionsThe canonical case is to apply the schema to the whole formula:
apply_schema(f, sch)
FormulaTerm Response: y(continuous) Predictors: 1 a(continuous) b(StatsModels.DummyCoding:4→3) a(continuous) & b(StatsModels.DummyCoding:4→3)
Note that the schema gets pushed through the interaction term, too.
We can also apply the schema to a single term:
apply_schema(term(:a) & term(:b), sch)
a(continuous) & b(StatsModels.DummyCoding:4→3)
Of course if the schema doesn't have enough information, we'll get an error:
apply_schema(term(:argle_bargle), sch)
ERROR: KeyError: key argle_bargle not found
modelcols
Any term can generate model columns with modelcols
:
t_ab_concrete = apply_schema(term(:a) & term(:b), sch) modelcols(t_ab_concrete, first(df, 6))
6×3 Array{Float64,2}: 0.0 0.0 0.0 0.858086 0.0 0.0 0.0 0.276019 0.0 0.0 0.0 0.298955 0.0 0.0 0.0 0.797318 0.0 0.0
Compare with:
f_concrete = apply_schema(f, sch) @show t_ab_concrete_formula = f_concrete.rhs.terms[end]
t_ab_concrete_formula = f_concrete.rhs.terms[end] = a & b
modelcols(t_ab_concrete_formula, first(df, 6))
6×3 Array{Float64,2}: 0.0 0.0 0.0 0.858086 0.0 0.0 0.0 0.276019 0.0 0.0 0.0 0.298955 0.0 0.0 0.0 0.797318 0.0 0.0
Of course you can generate columns for the whole formula (it returns a tuple of left-hand side, right-hand side columns):
y, X = modelcols(f_concrete, first(df, 6)) X
6×8 Array{Float64,2}: 1.0 0.256856 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.858086 1.0 0.0 0.0 0.858086 0.0 0.0 1.0 0.276019 0.0 1.0 0.0 0.0 0.276019 0.0 1.0 0.298955 0.0 0.0 1.0 0.0 0.0 0.298955 1.0 0.379508 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.797318 1.0 0.0 0.0 0.797318 0.0 0.0
Any table with the right columns can be passed to modelcols
and the right columns are generated, even if some levels are missing:
df2 = DataFrame(a = rand(5), b = [:R, :R, :Q, :S, :R]) modelcols(f_concrete.rhs, df2)
5×8 Array{Float64,2}: 1.0 0.358765 1.0 0.0 0.0 0.358765 0.0 0.0 1.0 0.415072 1.0 0.0 0.0 0.415072 0.0 0.0 1.0 0.975042 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.485075 0.0 1.0 0.0 0.0 0.485075 0.0 1.0 0.542844 1.0 0.0 0.0 0.542844 0.0 0.0
A single row of the model matrix can be generated from a NamedTuple
of data:
data_row = (a = 1.5, b = :T) modelcols(f_concrete.rhs, data_row)
8-element Array{Float64,1}: 1.0 1.5 0.0 0.0 1.0 0.0 0.0 1.5
coefnames
coefnames(f_concrete)
("y", ["(Intercept)", "a", "b: R", "b: S", "b: T", "a & b: R", "a & b: S", "a & b: T"])
coefnames(f_concrete.rhs.terms[end])
3-element Array{String,1}: "a & b: R" "a & b: S" "a & b: T"
coefnames(sch[term(:b)])
3-element Array{String,1}: "b: R" "b: S" "b: T"
The formula syntax is very similar to R, with the exception that an interaction is specified with &
, and that some R syntax is not supported by default (^
, /
outside of MixedModels.jl).
Any function calls that are not special syntax (+
, &
, *
, and ~
) are treated as normal julia code, so you can write things like
f2 = @formula(log(y) ~ 1 + (a + a^2) * b)
FormulaTerm Response: (y)->log(y) Predictors: 1 a(unknown) (a)->a ^ 2 b(unknown) a(unknown) & b(unknown) (a)->a ^ 2 & b(unknown)
f2_concrete = apply_schema(f2, sch)
FormulaTerm Response: (y)->log(y) Predictors: 1 a(continuous) (a)->a ^ 2 b(StatsModels.DummyCoding:4→3) a(continuous) & b(StatsModels.DummyCoding:4→3) (a)->a ^ 2 & b(StatsModels.DummyCoding:4→3)
y2, X2 = modelcols(f2_concrete, first(df, 5))
([-1.1792956706381146, -0.40038061953493465, -2.1743636432356452, -0.089808 88454433106, -0.3508347943014509], [1.0 0.2568563742620089 … 0.0 0.0; 1.0 0 .8580859902013884 … 0.0 0.0; … ; 1.0 0.2989550398201468 … 0.0 0.08937411583 386556; 1.0 0.3795084943411997 … 0.0 0.0])
y2 == log.(df[1:5, :y])
true
X2
5×12 Array{Float64,2}: 1.0 0.256856 0.0659752 0.0 0.0 0.0 … 0.0 0.0 0.0 1.0 0.858086 0.736312 1.0 0.0 0.0 0.736312 0.0 0.0 1.0 0.276019 0.0761864 0.0 1.0 0.0 0.0 0.0761864 0.0 1.0 0.298955 0.0893741 0.0 0.0 1.0 0.0 0.0 0.0893741 1.0 0.379508 0.144027 0.0 0.0 0.0 0.0 0.0 0.0
coefnames(f2_concrete.rhs)
12-element Array{String,1}: "(Intercept)" "a" "a ^ 2" "b: R" "b: S" "b: T" "a & b: R" "a & b: S" "a & b: T" "a ^ 2 & b: R" "a ^ 2 & b: S" "a ^ 2 & b: T"
You may have noticed that zercocorr
and |
were not included in the list of special syntax above. StatsModels.jl provides a method to add special syntax for the @formula
that's specific to certain models. This works using the standard Julia techniques of multiple dispatch, by providing methods that intercept apply_schema
for particular combinations of functions, schema, and context (model type), like so:
function StatsModels.apply_schema(
t::FunctionTerm{typeof(|)},
schema::StatsModels.FullRank,
Mod::Type{<:MixedModel},
)
schema = StatsModels.FullRank(schema.schema)
lhs, rhs = t.args_parsed
if !StatsModels.hasintercept(lhs) && !StatsModels.omitsintercept(lhs)
lhs = InterceptTerm{true}() + lhs
end
lhs, rhs = apply_schema.((lhs, rhs), Ref(schema), Mod)
RandomEffectsTerm(MatrixTerm(lhs), rhs)
end
There's a simpler example in the StatsModels docs which adds a poly(x, n)
syntax for polynomial regression.
Let's see how each of the predictors in the KB07 dataset does on its own.
template = @formula(rt_trunc ~ 1 + (1|subj) + (1|subj)) fits = map([:spkr, :prec, :load]) do p f = template.lhs ~ template.rhs + term(p) fit(MixedModel, f, kb07) end
3-element Array{MixedModels.LinearMixedModel{Float64},1}: Linear mixed model fit by maximum likelihood rt_trunc ~ 1 + spkr + (1 | subj) logLik -2 logLik AIC BIC -1.47156041×10⁴ 2.94312081×10⁴ 2.94392081×10⁴ 2.94611658×10⁴ Variance components: Column Variance Std.Dev. subj (Intercept) 92268.768 303.75775 Residual 777856.310 881.96163 Number of obs: 1789; levels of grouping factors: 56 Fixed-effects parameters: ────────────────────────────────────────────────── Estimate Std.Error z value P(>|z|) ────────────────────────────────────────────────── (Intercept) 2113.29 50.1677 42.12 <1e-99 spkr: old 137.777 41.7046 3.30 0.0010 ────────────────────────────────────────────────── Linear mixed model fit by maximum likelihood rt_trunc ~ 1 + prec + (1 | subj) logLik -2 logLik AIC BIC -1.45836279×10⁴ 2.91672558×10⁴ 2.91752558×10⁴ 2.91972134×10⁴ Variance components: Column Variance Std.Dev. subj (Intercept) 95717.665 309.38272 Residual 667961.065 817.28885 Number of obs: 1789; levels of grouping factors: 56 Fixed-effects parameters: ───────────────────────────────────────────────────── Estimate Std.Error z value P(>|z|) ───────────────────────────────────────────────────── (Intercept) 2515.42 49.5539 50.76 <1e-99 prec: maintain -666.945 38.6465 -17.26 <1e-66 ───────────────────────────────────────────────────── Linear mixed model fit by maximum likelihood rt_trunc ~ 1 + load + (1 | subj) logLik -2 logLik AIC BIC -1.4713984×10⁴ 2.94279681×10⁴ 2.94359681×10⁴ 2.94579257×10⁴ Variance components: Column Variance Std.Dev. subj (Intercept) 92327.49 303.8544 Residual 776400.55 881.1359 Number of obs: 1789; levels of grouping factors: 56 Fixed-effects parameters: ────────────────────────────────────────────────── Estimate Std.Error z value P(>|z|) ────────────────────────────────────────────────── (Intercept) 2103.56 50.1817 41.92 <1e-99 load: yes 156.884 41.6655 3.77 0.0002 ──────────────────────────────────────────────────
Which predictor provides the best fit to the data on its own?
sort!(fits, by=objective) foreach(fits) do fit println(round(fit.objective), ": ", fit.formula) end
29167.0: rt_trunc ~ 1 + prec + (1 | subj) 29428.0: rt_trunc ~ 1 + load + (1 | subj) 29431.0: rt_trunc ~ 1 + spkr + (1 | subj)
Looks like it's prec
, followed by load
, and then spkr
.