More on shrinkage plots

Author

Phillip Alday, Douglas Bates, and Reinhold Kliegl

Published

2024-09-11

I have stated that the likelihood criterion used to fit linear mixed-effects can be considered as balancing fidelity to the data (i.e. fits the observed data well) versus model complexity.
This is similar to some of the criterion used in Machine Learning (ML), except that the criterion for LMMs has a rigorous mathematical basis.
In the shrinkage plot we consider the values of the random-effects coefficients for the fitted values of the model versus those from a model in which there is no penalty for model complexity.
If there is strong subject-to-subject variation then the model fit will tend to values of the random effects similar to those without a penalty on complexity.
If the random effects term is not contributing much (i.e. it is “inert”) then the random effects will be shrunk considerably towards zero in some directions.

Code

using CairoMakie
using DataFrames
using LinearAlgebra
using MixedModels
using MixedModelsMakie
using Random
using ProgressMeter

const progress = false

false

Load the kb07 data set (don’t tell Reinhold that I used these data).

kb07 = MixedModels.dataset(:kb07)

Arrow.Table with 1789 rows, 7 columns, and schema:
 :subj      String
 :item      String
 :spkr      String
 :prec      String
 :load      String
 :rt_trunc  Int16
 :rt_raw    Int16

contrasts = Dict(
  :spkr => HelmertCoding(),
  :prec => HelmertCoding(),
  :load => HelmertCoding(),
)
m1 = let
  form = @formula(
    rt_trunc ~
      1 +
      spkr * prec * load +
      (1 + spkr + prec + load | subj) +
      (1 + spkr + prec + load | item)
  )
  fit(MixedModel, form, kb07; contrasts, progress)
end

	Est.	SE	z	p	σ_subj	σ_item
(Intercept)	2181.6729	77.3131	28.22	<1e-99	301.6736	362.3182
spkr: old	67.7490	18.2747	3.71	0.0002	42.5441	40.7073
prec: maintain	-333.9206	47.1388	-7.08	<1e-11	61.9978	246.8078
load: yes	78.7702	19.5387	4.03	<1e-04	64.9874	42.5139
spkr: old & prec: maintain	-21.9655	15.8070	-1.39	0.1646
spkr: old & load: yes	18.3838	15.8070	1.16	0.2448
prec: maintain & load: yes	4.5333	15.8070	0.29	0.7743
spkr: old & prec: maintain & load: yes	23.6072	15.8070	1.49	0.1353
Residual	668.5371

VarCorr(m1)

	Column	Variance	Std.Dev	Corr.
subj	(Intercept)	91006.9887	301.6736
	spkr: old	1810.0012	42.5441	+0.79
	prec: maintain	3843.7268	61.9978	-0.59	+0.02
	load: yes	4223.3647	64.9874	+0.36	+0.85	+0.54
item	(Intercept)	131274.4558	362.3182
	spkr: old	1657.0830	40.7073	+0.44
	prec: maintain	60914.1141	246.8078	-0.69	+0.35
	load: yes	1807.4303	42.5139	+0.32	+0.15	-0.14
Residual		446941.7901	668.5371

issingular(m1)

true

print(m1)

Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + prec + load + spkr & prec + spkr & load + prec & load + spkr & prec & load + (1 + spkr + prec + load | subj) + (1 + spkr + prec + load | item)
    logLik   -2 logLik      AIC         AICc        BIC     
 -14318.5619  28637.1239  28695.1239  28696.1131  28854.3168

Variance components:
             Column       Variance  Std.Dev.   Corr.
subj     (Intercept)      91006.9887 301.6736
         spkr: old         1810.0012  42.5441 +0.79
         prec: maintain    3843.7268  61.9978 -0.59 +0.02
         load: yes         4223.3647  64.9874 +0.36 +0.85 +0.54
item     (Intercept)     131274.4558 362.3182
         spkr: old         1657.0830  40.7073 +0.44
         prec: maintain   60914.1141 246.8078 -0.69 +0.35
         load: yes         1807.4303  42.5139 +0.32 +0.15 -0.14
Residual                 446941.7901 668.5371
 Number of obs: 1789; levels of grouping factors: 56, 32

  Fixed-effects parameters:
───────────────────────────────────────────────────────────────────────────────
                                             Coef.  Std. Error      z  Pr(>|z|)
───────────────────────────────────────────────────────────────────────────────
(Intercept)                             2181.67        77.3131  28.22    <1e-99
spkr: old                                 67.749       18.2747   3.71    0.0002
prec: maintain                          -333.921       47.1388  -7.08    <1e-11
load: yes                                 78.7702      19.5387   4.03    <1e-04
spkr: old & prec: maintain               -21.9655      15.807   -1.39    0.1646
spkr: old & load: yes                     18.3838      15.807    1.16    0.2448
prec: maintain & load: yes                 4.53334     15.807    0.29    0.7743
spkr: old & prec: maintain & load: yes    23.6072      15.807    1.49    0.1353
───────────────────────────────────────────────────────────────────────────────

1 Expressing the covariance of random effects

Earlier today we mentioned that the parameters being optimized are from a “matrix square root” of the covariance matrix for the random effects. There is one such lower triangular matrix for each grouping factor.

l1 = first(m1.λ)   # Cholesky factor of relative covariance for subj

4×4 LowerTriangular{Float64, Matrix{Float64}}:
  0.451244    ⋅          ⋅           ⋅ 
  0.0502516  0.0390451   ⋅           ⋅ 
 -0.0549019  0.074379   0.00732171   ⋅ 
  0.0352204  0.0894228  0.0145792   0.0

Notice the zero on the diagonal. A triangular matrix with zeros on the diagonal is singular.

l2 = last(m1.λ)    # this one is also singular

4×4 LowerTriangular{Float64, Matrix{Float64}}:
  0.541957    ⋅           ⋅          ⋅ 
  0.0268894  0.0546312    ⋅          ⋅ 
 -0.252892   0.267823    0.024643    ⋅ 
  0.0200364  0.00105896  0.0595843  0.0095462

To regenerate the covariance matrix we need to know that the covariance is not the square of l1, it is l1 * l1' (so that the result is symmetric) and multiplied by σ̂²

Σ₁ = varest(m1) .* (l1 * l1')

4×4 Matrix{Float64}:
  91007.0   10134.8     -11072.6     7103.25
  10134.8    1810.0         64.9058  2351.54
 -11072.6      64.9058    3843.73    2156.16
   7103.25   2351.54      2156.16    4223.36

diag(Σ₁)  # compare to the variance column in the VarCorr output

4-element Vector{Float64}:
 91006.98870968402
  1810.0012414879016
  3843.7267826016114
  4223.364731548873

sqrt.(diag(Σ₁))

4-element Vector{Float64}:
 301.6736460310778
  42.54410936296471
  61.99779659473078
  64.98741979451772

2 Shrinkage plots

Code

shrinkageplot(m1)

The upper left panel shows the perfect negative correlation for those two components of the random effects.

shrinkageplot(m1, :item)

X1 = Int.(m1.X')

8×1789 Matrix{Int64}:
  1   1   1   1   1  1   1   1   1   1  …   1   1   1   1   1   1   1  1   1
 -1   1   1  -1  -1  1   1  -1  -1   1      1  -1  -1   1   1  -1  -1  1   1
 -1   1  -1   1  -1  1  -1   1  -1   1     -1   1  -1   1  -1   1  -1  1  -1
  1  -1  -1  -1  -1  1   1   1   1  -1      1   1   1  -1  -1  -1  -1  1   1
  1   1  -1  -1   1  1  -1  -1   1   1     -1  -1   1   1  -1  -1   1  1  -1
 -1  -1  -1   1   1  1   1  -1  -1  -1  …   1  -1  -1  -1  -1   1   1  1   1
 -1  -1   1  -1   1  1  -1   1  -1  -1     -1   1  -1  -1   1  -1   1  1  -1
  1  -1   1   1  -1  1  -1  -1   1  -1     -1  -1   1  -1   1   1  -1  1  -1

X1 * X1'

8×8 Matrix{Int64}:
 1789    -1    -1     3    -3     1     1     3
   -1  1789    -3     1    -1     3     3     1
   -1    -3  1789     1    -1     3     3     1
    3     1     1  1789     3    -1    -1    -3
   -3    -1    -1     3  1789     1     1     3
    1     3     3    -1     1  1789    -3    -1
    1     3     3    -1     1    -3  1789    -1
    3     1     1    -3     3    -1    -1  1789

3 How to interpret a shrinkage plot

Extreme shrinkage (shrunk to a line or to a point) is easy to interpret – the term is not providing benefit and can be removed.
When the range of the blue dots (shrunk values) is comparable to those of the red dots (unshrunk) it indicates that the term after shrinkage is about as strong as without shrinkage.
By itself, this doesn’t mean that the term is important. In some ways you need to get a feeling for the absolute magnitude of the random effects in addition to the relative magnitude.
Small magnitude and small relative magnitude indicate you can drop that term

4 Conclusions from these plots

Only the intercept for the subj appears to be contributing explanatory power
For the item both the intercept and the spkr appear to be contributing

m2 = let
  form = @formula(
    rt_trunc ~
      1 + prec * spkr * load + (1 | subj) + (1 + prec | item)
  )
  fit(MixedModel, form, kb07; contrasts, progress)
end

	Est.	SE	z	p	σ_item	σ_subj
(Intercept)	2181.7582	77.4709	28.16	<1e-99	364.7286	298.1109
prec: maintain	-333.8582	47.4629	-7.03	<1e-11	252.6687
spkr: old	67.8114	16.0526	4.22	<1e-04
load: yes	78.6849	16.0525	4.90	<1e-06
prec: maintain & spkr: old	-21.8802	16.0525	-1.36	0.1729
prec: maintain & load: yes	4.4710	16.0526	0.28	0.7806
spkr: old & load: yes	18.3214	16.0526	1.14	0.2537
prec: maintain & spkr: old & load: yes	23.5219	16.0525	1.47	0.1428
Residual	678.9318

VarCorr(m2)

	Column	Variance	Std.Dev	Corr.
item	(Intercept)	133026.918	364.729
	prec: maintain	63841.496	252.669	-0.70
subj	(Intercept)	88870.080	298.111
Residual		460948.432	678.932

Code

shrinkageplot(m2)

m3 = let
  form = @formula(
    rt_trunc ~
      1 + prec + spkr + load + (1 | subj) + (1 + prec | item)
  )
  fit(MixedModel, form, kb07; contrasts, progress)
end

	Est.	SE	z	p	σ_item	σ_subj
(Intercept)	2181.8526	77.4681	28.16	<1e-99	364.7125	298.0259
prec: maintain	-333.7906	47.4472	-7.03	<1e-11	252.5212
spkr: old	67.8790	16.0785	4.22	<1e-04
load: yes	78.5904	16.0785	4.89	<1e-05
Residual	680.0319

VarCorr(m3)

	Column	Variance	Std.Dev	Corr.
item	(Intercept)	133015.242	364.713
	prec: maintain	63766.935	252.521	-0.70
subj	(Intercept)	88819.438	298.026
Residual		462443.388	680.032

rng = Random.seed!(1234321);

m3btstrp = parametricbootstrap(rng, 2000, m3);

DataFrame(shortestcovint(m3btstrp))

9×5 DataFrame

Row	type	group	names	lower	upper
	String	String?	String?	Float64	Float64
1	β	missing	(Intercept)	2035.14	2326.99
2	β	missing	prec: maintain	-426.345	-243.517
3	β	missing	spkr: old	36.6549	98.5666
4	β	missing	load: yes	44.5338	108.327
5	σ	item	(Intercept)	260.63	453.848
6	σ	item	prec: maintain	181.992	318.503
7	ρ	item	(Intercept), prec: maintain	-0.888974	-0.439872
8	σ	subj	(Intercept)	232.996	359.237
9	σ	residual	missing	656.681	702.679

ridgeplot(m3btstrp)

Figure 3: Ridge plot of the fixed-effects coefficients from the bootstrap sample

ridgeplot(m3btstrp; show_intercept=false)

Figure 4: Ridge plot of the fixed-effects coefficients from the bootstrap sample (with the intercept)

m4 = let
  form = @formula(
    rt_trunc ~
      1 + prec + spkr + load + (1 + prec | item) + (1 | subj)
  )
  fit(MixedModel, form, kb07; contrasts, progress)
end

	Est.	SE	z	p	σ_item	σ_subj
(Intercept)	2181.8526	77.4681	28.16	<1e-99	364.7125	298.0259
prec: maintain	-333.7906	47.4472	-7.03	<1e-11	252.5212
spkr: old	67.8790	16.0785	4.22	<1e-04
load: yes	78.5904	16.0785	4.89	<1e-05
Residual	680.0319

m4bstrp = parametricbootstrap(rng, 2000, m4);

ridgeplot(m4bstrp; show_intercept=false)

DataFrame(shortestcovint(m4bstrp))

9×5 DataFrame

Row	type	group	names	lower	upper
	String	String?	String?	Float64	Float64
1	β	missing	(Intercept)	2036.61	2326.9
2	β	missing	prec: maintain	-428.062	-248.716
3	β	missing	spkr: old	33.4999	95.5996
4	β	missing	load: yes	47.776	109.14
5	σ	item	(Intercept)	257.632	446.425
6	σ	item	prec: maintain	177.382	318.478
7	ρ	item	(Intercept), prec: maintain	-0.903413	-0.46731
8	σ	subj	(Intercept)	226.312	359.338
9	σ	residual	missing	655.65	700.819

VarCorr(m4)

	Column	Variance	Std.Dev	Corr.
item	(Intercept)	133015.242	364.713
	prec: maintain	63766.935	252.521	-0.70
subj	(Intercept)	88819.438	298.026
Residual		462443.388	680.032

Code

let mods = [m1, m2, m4]
  DataFrame(;
    geomdof=(sum ∘ leverage).(mods),
    npar=dof.(mods),
    deviance=deviance.(mods),
    AIC=aic.(mods),
    BIC=bic.(mods),
    AICc=aicc.(mods),
  )
end

3×6 DataFrame

Row	geomdof	npar	deviance	AIC	BIC	AICc
	Float64	Int64	Float64	Float64	Float64	Float64
1	131.391	29	28637.1	28695.1	28854.3	28696.1
2	107.543	13	28658.5	28684.5	28755.8	28684.7
3	103.478	9	28663.9	28681.9	28731.3	28682.0

scatter(fitted(m4), residuals(m4))

Figure 5: Residuals versus fitted values for model m4