More on shrinkage plots

Author

Phillip Alday, Douglas Bates, and Reinhold Kliegl

Published

2024-06-27

I have stated that the likelihood criterion used to fit linear mixed-effects can be considered as balancing fidelity to the data (i.e. fits the observed data well) versus model complexity.
This is similar to some of the criterion used in Machine Learning (ML), except that the criterion for LMMs has a rigorous mathematical basis.
In the shrinkage plot we consider the values of the random-effects coefficients for the fitted values of the model versus those from a model in which there is no penalty for model complexity.
If there is strong subject-to-subject variation then the model fit will tend to values of the random effects similar to those without a penalty on complexity.
If the random effects term is not contributing much (i.e. it is “inert”) then the random effects will be shrunk considerably towards zero in some directions.

Code

using CairoMakie
using DataFrames
using LinearAlgebra
using MixedModels
using MixedModelsMakie
using Random
using ProgressMeter

ProgressMeter.ijulia_behavior(:clear);

Load the kb07 data set (don’t tell Reinhold that I used these data).

kb07 = MixedModels.dataset(:kb07)

Arrow.Table with 1789 rows, 7 columns, and schema:
 :subj      String
 :item      String
 :spkr      String
 :prec      String
 :load      String
 :rt_trunc  Int16
 :rt_raw    Int16

contrasts = Dict(
  :subj => Grouping(),
  :item => Grouping(),
  :spkr => HelmertCoding(),
  :prec => HelmertCoding(),
  :load => HelmertCoding(),
)
m1 = let
  form = @formula(
    rt_trunc ~
      1 +
      spkr * prec * load +
      (1 + spkr + prec + load | subj) +
      (1 + spkr + prec + load | item)
  )
  fit(MixedModel, form, kb07; contrasts)
end

Minimizing 874    Time: 0:00:01 ( 1.43 ms/it)
  objective:  28637.971010073215

	Est.	SE	z	p	σ_subj	σ_item
(Intercept)	2181.6424	77.3515	28.20	<1e-99	301.8721	362.4695
spkr: old	67.7496	17.9607	3.77	0.0002	33.0588	41.1159
prec: maintain	-333.9200	47.1563	-7.08	<1e-11	58.8512	247.3299
load: yes	78.8007	19.7269	3.99	<1e-04	66.9526	43.3991
spkr: old & prec: maintain	-21.9960	15.8191	-1.39	0.1644
spkr: old & load: yes	18.3832	15.8191	1.16	0.2452
prec: maintain & load: yes	4.5327	15.8191	0.29	0.7745
spkr: old & prec: maintain & load: yes	23.6377	15.8191	1.49	0.1351
Residual	669.0515

VarCorr(m1)

	Column	Variance	Std.Dev	Corr.
subj	(Intercept)	91126.7545	301.8721
	spkr: old	1092.8862	33.0588	+1.00
	prec: maintain	3463.4603	58.8512	-0.62	-0.62
	load: yes	4482.6493	66.9526	+0.36	+0.36	+0.51
item	(Intercept)	131384.1084	362.4695
	spkr: old	1690.5210	41.1159	+0.42
	prec: maintain	61172.0781	247.3299	-0.69	+0.37
	load: yes	1883.4785	43.3991	+0.29	+0.14	-0.13
Residual		447629.9270	669.0515

issingular(m1)

true

print(m1)

Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + prec + load + spkr & prec + spkr & load + prec & load + spkr & prec & load + (1 + spkr + prec + load | subj) + (1 + spkr + prec + load | item)
    logLik   -2 logLik      AIC         AICc        BIC     
 -14318.9855  28637.9710  28695.9710  28696.9602  28855.1640

Variance components:
             Column       Variance  Std.Dev.   Corr.
subj     (Intercept)      91126.7545 301.8721
         spkr: old         1092.8862  33.0588 +1.00
         prec: maintain    3463.4603  58.8512 -0.62 -0.62
         load: yes         4482.6493  66.9526 +0.36 +0.36 +0.51
item     (Intercept)     131384.1084 362.4695
         spkr: old         1690.5210  41.1159 +0.42
         prec: maintain   61172.0781 247.3299 -0.69 +0.37
         load: yes         1883.4785  43.3991 +0.29 +0.14 -0.13
Residual                 447629.9270 669.0515
 Number of obs: 1789; levels of grouping factors: 56, 32

  Fixed-effects parameters:
───────────────────────────────────────────────────────────────────────────────
                                             Coef.  Std. Error      z  Pr(>|z|)
───────────────────────────────────────────────────────────────────────────────
(Intercept)                             2181.64        77.3515  28.20    <1e-99
spkr: old                                 67.7496      17.9607   3.77    0.0002
prec: maintain                          -333.92        47.1563  -7.08    <1e-11
load: yes                                 78.8007      19.7269   3.99    <1e-04
spkr: old & prec: maintain               -21.996       15.8191  -1.39    0.1644
spkr: old & load: yes                     18.3832      15.8191   1.16    0.2452
prec: maintain & load: yes                 4.53273     15.8191   0.29    0.7745
spkr: old & prec: maintain & load: yes    23.6377      15.8191   1.49    0.1351
───────────────────────────────────────────────────────────────────────────────

1 Expressing the covariance of random effects

Earlier today we mentioned that the parameters being optimized are from a “matrix square root” of the covariance matrix for the random effects. There is one such lower triangular matrix for each grouping factor.

l1 = first(m1.λ)   # Cholesky factor of relative covariance for subj

4×4 LowerTriangular{Float64, Matrix{Float64}}:
  0.451194     ⋅          ⋅          ⋅ 
  0.0494115   0.0         ⋅          ⋅ 
 -0.0547474  -0.0357256  0.0588535   ⋅ 
  0.0359075  -0.0484787  0.0798414  0.0

Notice the zero on the diagonal. A triangular matrix with zeros on the diagonal is singular.

l2 = last(m1.λ)    # this one is also singular

4×4 LowerTriangular{Float64, Matrix{Float64}}:
  0.541766    ⋅            ⋅          ⋅ 
  0.0259255  0.0557178     ⋅          ⋅ 
 -0.253795   0.26783      0.0226493   ⋅ 
  0.0189316  0.000867924  0.0620364  0.0

To regenerate the covariance matrix we need to know that the covariance is not the square of l1, it is l1 * l1' (so that the result is symmetric) and multiplied by σ̂²

Σ₁ = varest(m1) .* (l1 * l1')

4×4 Matrix{Float64}:
  91126.8    9979.54   -11057.2   7252.16
   9979.54   1092.89    -1210.91   794.203
 -11057.2   -1210.91     3463.46  1998.68
   7252.16    794.203    1998.68  4482.65

diag(Σ₁)  # compare to the variance column in the VarCorr output

4-element Vector{Float64}:
 91126.75451781915
  1092.8862206704753
  3463.4602904990634
  4482.649272250181

sqrt.(diag(Σ₁))

4-element Vector{Float64}:
 301.8720830381954
  33.05882969299542
  58.85117068078649
  66.95258973520129

2 Shrinkage plots

Code

shrinkageplot(m1)

The upper left panel shows the perfect negative correlation for those two components of the random effects.

shrinkageplot(m1, :item)

X1 = Int.(m1.X')

8×1789 Matrix{Int64}:
  1   1   1   1   1  1   1   1   1   1  …   1   1   1   1   1   1   1  1   1
 -1   1   1  -1  -1  1   1  -1  -1   1      1  -1  -1   1   1  -1  -1  1   1
 -1   1  -1   1  -1  1  -1   1  -1   1     -1   1  -1   1  -1   1  -1  1  -1
  1  -1  -1  -1  -1  1   1   1   1  -1      1   1   1  -1  -1  -1  -1  1   1
  1   1  -1  -1   1  1  -1  -1   1   1     -1  -1   1   1  -1  -1   1  1  -1
 -1  -1  -1   1   1  1   1  -1  -1  -1  …   1  -1  -1  -1  -1   1   1  1   1
 -1  -1   1  -1   1  1  -1   1  -1  -1     -1   1  -1  -1   1  -1   1  1  -1
  1  -1   1   1  -1  1  -1  -1   1  -1     -1  -1   1  -1   1   1  -1  1  -1

X1 * X1'

8×8 Matrix{Int64}:
 1789    -1    -1     3    -3     1     1     3
   -1  1789    -3     1    -1     3     3     1
   -1    -3  1789     1    -1     3     3     1
    3     1     1  1789     3    -1    -1    -3
   -3    -1    -1     3  1789     1     1     3
    1     3     3    -1     1  1789    -3    -1
    1     3     3    -1     1    -3  1789    -1
    3     1     1    -3     3    -1    -1  1789

3 How to interpret a shrinkage plot

Extreme shrinkage (shrunk to a line or to a point) is easy to interpret - the term is not providing benefit and can be removed.
When the range of the blue dots (shrunk values) is comparable to those of the red dots (unshrunk) it indicates that the term after shrinkage is about as strong as without shrinkage.
By itself, this doesn’t mean that the term is important. In some ways you need to get a feeling for the absolute magnitude of the random effects in addition to the relative magnitude.
Small magnitude and small relative magnitude indicate you can drop that term

4 Conclusions from these plots

Only the intercept for the subj appears to be contributing explanatory power
For the item both the intercept and the spkr appear to be contributing

m2 = let
  form = @formula(
    rt_trunc ~
      1 + prec * spkr * load + (1 | subj) + (1 + prec | item)
  )
  fit(MixedModel, form, kb07; contrasts)
end

	Est.	SE	z	p	σ_item	σ_subj
(Intercept)	2181.7582	77.4709	28.16	<1e-99	364.7286	298.1109
prec: maintain	-333.8582	47.4629	-7.03	<1e-11	252.6687
spkr: old	67.8114	16.0526	4.22	<1e-04
load: yes	78.6849	16.0525	4.90	<1e-06
prec: maintain & spkr: old	-21.8802	16.0525	-1.36	0.1729
prec: maintain & load: yes	4.4710	16.0526	0.28	0.7806
spkr: old & load: yes	18.3214	16.0526	1.14	0.2537
prec: maintain & spkr: old & load: yes	23.5219	16.0525	1.47	0.1428
Residual	678.9318

VarCorr(m2)

	Column	Variance	Std.Dev	Corr.
item	(Intercept)	133026.918	364.729
	prec: maintain	63841.496	252.669	-0.70
subj	(Intercept)	88870.080	298.111
Residual		460948.432	678.932

Code

shrinkageplot(m2)

m3 = let
  form = @formula(
    rt_trunc ~
      1 + prec + spkr + load + (1 | subj) + (1 + prec | item)
  )
  fit(MixedModel, form, kb07; contrasts)
end

	Est.	SE	z	p	σ_item	σ_subj
(Intercept)	2181.8526	77.4681	28.16	<1e-99	364.7126	298.0259
prec: maintain	-333.7906	47.4472	-7.03	<1e-11	252.5212
spkr: old	67.8790	16.0785	4.22	<1e-04
load: yes	78.5904	16.0785	4.89	<1e-05
Residual	680.0319

VarCorr(m3)

	Column	Variance	Std.Dev	Corr.
item	(Intercept)	133015.244	364.713
	prec: maintain	63766.937	252.521	-0.70
subj	(Intercept)	88819.436	298.026
Residual		462443.388	680.032

rng = Random.seed!(1234321);

m3btstrp = parametricbootstrap(rng, 2000, m3);

DataFrame(shortestcovint(m3btstrp))

9×5 DataFrame

Row	type	group	names	lower	upper
	String	String?	String?	Float64	Float64
1	β	missing	(Intercept)	2013.95	2319.53
2	β	missing	prec: maintain	-429.807	-241.429
3	β	missing	spkr: old	35.3339	95.7272
4	β	missing	load: yes	47.067	111.04
5	σ	item	(Intercept)	267.788	452.9
6	σ	item	prec: maintain	171.547	314.7
7	ρ	item	(Intercept), prec: maintain	-0.893081	-0.457084
8	σ	subj	(Intercept)	235.921	364.717
9	σ	residual	missing	657.736	703.054

ridgeplot(m3btstrp)

Figure 3: Ridge plot of the fixed-effects coefficients from the bootstrap sample

ridgeplot(m3btstrp; show_intercept=false)

Figure 4: Ridge plot of the fixed-effects coefficients from the bootstrap sample (with the intercept)

m4 = let
  form = @formula(
    rt_trunc ~
      1 + prec + spkr + load + (1 + prec | item) + (1 | subj)
  )
  fit(MixedModel, form, kb07; contrasts)
end

	Est.	SE	z	p	σ_item	σ_subj
(Intercept)	2181.8526	77.4681	28.16	<1e-99	364.7126	298.0259
prec: maintain	-333.7906	47.4472	-7.03	<1e-11	252.5212
spkr: old	67.8790	16.0785	4.22	<1e-04
load: yes	78.5904	16.0785	4.89	<1e-05
Residual	680.0319

m4bstrp = parametricbootstrap(rng, 2000, m4);

ridgeplot(m4bstrp; show_intercept=false)

DataFrame(shortestcovint(m4bstrp))

9×5 DataFrame

Row	type	group	names	lower	upper
	String	String?	String?	Float64	Float64
1	β	missing	(Intercept)	2030.75	2342.66
2	β	missing	prec: maintain	-432.619	-249.063
3	β	missing	spkr: old	35.4729	97.9576
4	β	missing	load: yes	47.0469	108.272
5	σ	item	(Intercept)	261.52	444.426
6	σ	item	prec: maintain	177.803	318.382
7	ρ	item	(Intercept), prec: maintain	-0.904897	-0.477346
8	σ	subj	(Intercept)	234.301	361.964
9	σ	residual	missing	657.11	701.879

VarCorr(m4)

	Column	Variance	Std.Dev	Corr.
item	(Intercept)	133015.244	364.713
	prec: maintain	63766.937	252.521	-0.70
subj	(Intercept)	88819.436	298.026
Residual		462443.388	680.032

Code

let mods = [m1, m2, m4]
  DataFrame(;
    geomdof=(sum ∘ leverage).(mods),
    npar=dof.(mods),
    deviance=deviance.(mods),
    AIC=aic.(mods),
    BIC=bic.(mods),
    AICc=aicc.(mods),
  )
end

3×6 DataFrame

Row	geomdof	npar	deviance	AIC	BIC	AICc
	Float64	Int64	Float64	Float64	Float64	Float64
1	130.126	29	28638.0	28696.0	28855.2	28697.0
2	107.543	13	28658.5	28684.5	28755.8	28684.7
3	103.478	9	28663.9	28681.9	28731.3	28682.0

scatter(fitted(m4), residuals(m4))

Figure 5: Residuals versus fitted values for model m4