More on shrinkage plots

Author

Phillip Alday, Douglas Bates, and Reinhold Kliegl

Published

2025-09-03

I have stated that the likelihood criterion used to fit linear mixed-effects can be considered as balancing fidelity to the data (i.e. fits the observed data well) versus model complexity.
This is similar to some of the criterion used in Machine Learning (ML), except that the criterion for LMMs has a rigorous mathematical basis.
In the shrinkage plot we consider the values of the random-effects coefficients for the fitted values of the model versus those from a model in which there is no penalty for model complexity.
If there is strong subject-to-subject variation then the model fit will tend to values of the random effects similar to those without a penalty on complexity.
If the random effects term is not contributing much (i.e. it is “inert”) then the random effects will be shrunk considerably towards zero in some directions.

Code

using CairoMakie
using DataFrames
using LinearAlgebra
using MixedModels
using MixedModelsMakie
using Random
using ProgressMeter

const progress = false

false

Load the kb07 data set (don’t tell Reinhold that I used these data).

kb07 = MixedModels.dataset(:kb07)

Arrow.Table with 1789 rows, 7 columns, and schema:
 :subj      String
 :item      String
 :spkr      String
 :prec      String
 :load      String
 :rt_trunc  Int16
 :rt_raw    Int16

contrasts = Dict(
  :spkr => HelmertCoding(),
  :prec => HelmertCoding(),
  :load => HelmertCoding(),
)
m1 = let
  form = @formula(
    rt_trunc ~
      1 +
      spkr * prec * load +
      (1 + spkr + prec + load | subj) +
      (1 + spkr + prec + load | item)
  )
  fit(MixedModel, form, kb07; contrasts, progress)
end

	Est.	SE	z	p	σ_subj	σ_item
(Intercept)	2181.6728	77.3040	28.22	<1e-99	301.7798	362.2071
spkr: old	67.7485	18.2919	3.70	0.0002	42.9978	40.6939
prec: maintain	-333.9212	47.1520	-7.08	<1e-11	62.0047	246.8895
load: yes	78.7703	19.5373	4.03	<1e-04	65.1516	42.3617
spkr: old & prec: maintain	-21.9656	15.8060	-1.39	0.1646
spkr: old & load: yes	18.3843	15.8060	1.16	0.2448
prec: maintain & load: yes	4.5339	15.8060	0.29	0.7742
spkr: old & prec: maintain & load: yes	23.6074	15.8060	1.49	0.1353
Residual	668.4953

VarCorr(m1)

	Column	Variance	Std.Dev	Corr.
subj	(Intercept)	91071.0419	301.7798
	spkr: old	1848.8141	42.9978	+0.78
	prec: maintain	3844.5766	62.0047	-0.59	+0.03
	load: yes	4244.7317	65.1516	+0.36	+0.83	+0.53
item	(Intercept)	131193.9807	362.2071
	spkr: old	1655.9905	40.6939	+0.44
	prec: maintain	60954.4361	246.8895	-0.69	+0.35
	load: yes	1794.5158	42.3617	+0.32	+0.16	-0.14
Residual		446885.9859	668.4953

issingular(m1)

true

print(m1)

Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + prec + load + spkr & prec + spkr & load + prec & load + spkr & prec & load + (1 + spkr + prec + load | subj) + (1 + spkr + prec + load | item)
    logLik   -2 logLik      AIC         AICc        BIC     
 -14318.5613  28637.1227  28695.1227  28696.1119  28854.3156

Variance components:
             Column       Variance  Std.Dev.   Corr.
subj     (Intercept)      91071.0419 301.7798
         spkr: old         1848.8141  42.9978 +0.78
         prec: maintain    3844.5766  62.0047 -0.59 +0.03
         load: yes         4244.7317  65.1516 +0.36 +0.83 +0.53
item     (Intercept)     131193.9807 362.2071
         spkr: old         1655.9905  40.6939 +0.44
         prec: maintain   60954.4361 246.8895 -0.69 +0.35
         load: yes         1794.5158  42.3617 +0.32 +0.16 -0.14
Residual                 446885.9859 668.4953
 Number of obs: 1789; levels of grouping factors: 56, 32

  Fixed-effects parameters:
───────────────────────────────────────────────────────────────────────────────
                                             Coef.  Std. Error      z  Pr(>|z|)
───────────────────────────────────────────────────────────────────────────────
(Intercept)                             2181.67        77.304   28.22    <1e-99
spkr: old                                 67.7485      18.2919   3.70    0.0002
prec: maintain                          -333.921       47.152   -7.08    <1e-11
load: yes                                 78.7703      19.5373   4.03    <1e-04
spkr: old & prec: maintain               -21.9656      15.806   -1.39    0.1646
spkr: old & load: yes                     18.3843      15.806    1.16    0.2448
prec: maintain & load: yes                 4.53388     15.806    0.29    0.7742
spkr: old & prec: maintain & load: yes    23.6074      15.806    1.49    0.1353
───────────────────────────────────────────────────────────────────────────────

1 Expressing the covariance of random effects

Earlier today we mentioned that the parameters being optimized are from a “matrix square root” of the covariance matrix for the random effects. There is one such lower triangular matrix for each grouping factor.

l1 = first(m1.λ)   # Cholesky factor of relative covariance for subj

4×4 LowerTriangular{Float64, Matrix{Float64}}:
  0.451431    ⋅          ⋅          ⋅ 
  0.0502513  0.0401487   ⋅          ⋅ 
 -0.0550302  0.0726991  0.0170162   ⋅ 
  0.0351681  0.0850799  0.0319858  0.0

Notice the zero on the diagonal. A triangular matrix with zeros on the diagonal is singular.

l2 = last(m1.λ)    # this one is also singular

4×4 LowerTriangular{Float64, Matrix{Float64}}:
  0.541824    ⋅           ⋅          ⋅ 
  0.0268639  0.0546256    ⋅          ⋅ 
 -0.253073   0.268004    0.0229392   ⋅ 
  0.0200491  0.00127555  0.0601     0.0

To regenerate the covariance matrix we need to know that the covariance is not the square of l1, it is l1 * l1' (so that the result is symmetric) and multiplied by σ̂²

Σ₁ = varest(m1) .* (l1 * l1')

4×4 Matrix{Float64}:
  91071.0   10137.6     -11101.7     7094.75
  10137.6    1848.81        68.5686  2316.25
 -11101.7      68.5686    3844.58    2142.46
   7094.75   2316.25      2142.46    4244.73

diag(Σ₁)  # compare to the variance column in the VarCorr output

4-element Vector{Float64}:
 91071.04188496091
  1848.8141310725193
  3844.576646699094
  4244.73166099835

sqrt.(diag(Σ₁))

4-element Vector{Float64}:
 301.7797903852425
  42.99783867908385
  62.00465020221543
  65.15160520661291

2 Shrinkage plots

Code

shrinkageplot(m1)

The upper left panel shows the perfect negative correlation for those two components of the random effects.

shrinkageplot(m1, :item)

X1 = Int.(m1.X')

8×1789 Matrix{Int64}:
  1   1   1   1   1  1   1   1   1   1  …   1   1   1   1   1   1   1  1   1
 -1   1   1  -1  -1  1   1  -1  -1   1      1  -1  -1   1   1  -1  -1  1   1
 -1   1  -1   1  -1  1  -1   1  -1   1     -1   1  -1   1  -1   1  -1  1  -1
  1  -1  -1  -1  -1  1   1   1   1  -1      1   1   1  -1  -1  -1  -1  1   1
  1   1  -1  -1   1  1  -1  -1   1   1     -1  -1   1   1  -1  -1   1  1  -1
 -1  -1  -1   1   1  1   1  -1  -1  -1  …   1  -1  -1  -1  -1   1   1  1   1
 -1  -1   1  -1   1  1  -1   1  -1  -1     -1   1  -1  -1   1  -1   1  1  -1
  1  -1   1   1  -1  1  -1  -1   1  -1     -1  -1   1  -1   1   1  -1  1  -1

X1 * X1'

8×8 Matrix{Int64}:
 1789    -1    -1     3    -3     1     1     3
   -1  1789    -3     1    -1     3     3     1
   -1    -3  1789     1    -1     3     3     1
    3     1     1  1789     3    -1    -1    -3
   -3    -1    -1     3  1789     1     1     3
    1     3     3    -1     1  1789    -3    -1
    1     3     3    -1     1    -3  1789    -1
    3     1     1    -3     3    -1    -1  1789

3 How to interpret a shrinkage plot

Extreme shrinkage (shrunk to a line or to a point) is easy to interpret – the term is not providing benefit and can be removed.
When the range of the blue dots (shrunk values) is comparable to those of the red dots (unshrunk) it indicates that the term after shrinkage is about as strong as without shrinkage.
By itself, this doesn’t mean that the term is important. In some ways you need to get a feeling for the absolute magnitude of the random effects in addition to the relative magnitude.
Small magnitude and small relative magnitude indicate you can drop that term

4 Conclusions from these plots

Only the intercept for the subj appears to be contributing explanatory power
For the item both the intercept and the spkr appear to be contributing

m2 = let
  form = @formula(
    rt_trunc ~
      1 + prec * spkr * load + (1 | subj) + (1 + prec | item)
  )
  fit(MixedModel, form, kb07; contrasts, progress)
end

	Est.	SE	z	p	σ_item	σ_subj
(Intercept)	2181.7582	77.4710	28.16	<1e-99	364.7293	298.1107
prec: maintain	-333.8582	47.4631	-7.03	<1e-11	252.6694
spkr: old	67.8114	16.0526	4.22	<1e-04
load: yes	78.6849	16.0525	4.90	<1e-06
prec: maintain & spkr: old	-21.8802	16.0525	-1.36	0.1729
prec: maintain & load: yes	4.4710	16.0526	0.28	0.7806
spkr: old & load: yes	18.3214	16.0526	1.14	0.2537
prec: maintain & spkr: old & load: yes	23.5219	16.0525	1.47	0.1428
Residual	678.9319

VarCorr(m2)

	Column	Variance	Std.Dev	Corr.
item	(Intercept)	133027.438	364.729
	prec: maintain	63841.835	252.669	-0.70
subj	(Intercept)	88870.014	298.111
Residual		460948.573	678.932

Code

shrinkageplot(m2)

m3 = let
  form = @formula(
    rt_trunc ~
      1 + prec + spkr + load + (1 | subj) + (1 + prec | item)
  )
  fit(MixedModel, form, kb07; contrasts, progress)
end

	Est.	SE	z	p	σ_item	σ_subj
(Intercept)	2181.8526	77.4680	28.16	<1e-99	364.7121	298.0257
prec: maintain	-333.7906	47.4476	-7.03	<1e-11	252.5236
spkr: old	67.8790	16.0785	4.22	<1e-04
load: yes	78.5904	16.0785	4.89	<1e-05
Residual	680.0318

VarCorr(m3)

	Column	Variance	Std.Dev	Corr.
item	(Intercept)	133014.905	364.712
	prec: maintain	63768.184	252.524	-0.70
subj	(Intercept)	88819.302	298.026
Residual		462443.261	680.032

rng = Random.seed!(1234321);

m3btstrp = parametricbootstrap(rng, 2000, m3);

DataFrame(shortestcovint(m3btstrp))

9×5 DataFrame

Row	type	group	names	lower	upper
	String	String?	String?	Float64	Float64
1	β	missing	(Intercept)	2022.91	2334.0
2	β	missing	prec: maintain	-430.239	-239.802
3	β	missing	spkr: old	34.0592	96.72
4	β	missing	load: yes	46.5349	109.526
5	σ	item	(Intercept)	270.065	451.979
6	σ	item	prec: maintain	181.739	325.127
7	ρ	item	(Intercept), prec: maintain	-0.907255	-0.490094
8	σ	subj	(Intercept)	233.834	364.504
9	σ	residual	missing	657.341	702.655

ridgeplot(m3btstrp)

Figure 3: Ridge plot of the fixed-effects coefficients from the bootstrap sample

ridgeplot(m3btstrp; show_intercept=false)

Figure 4: Ridge plot of the fixed-effects coefficients from the bootstrap sample (with the intercept)

m4 = let
  form = @formula(
    rt_trunc ~
      1 + prec + spkr + load + (1 + prec | item) + (1 | subj)
  )
  fit(MixedModel, form, kb07; contrasts, progress)
end

	Est.	SE	z	p	σ_item	σ_subj
(Intercept)	2181.8526	77.4680	28.16	<1e-99	364.7121	298.0257
prec: maintain	-333.7906	47.4476	-7.03	<1e-11	252.5236
spkr: old	67.8790	16.0785	4.22	<1e-04
load: yes	78.5904	16.0785	4.89	<1e-05
Residual	680.0318

m4bstrp = parametricbootstrap(rng, 2000, m4);

ridgeplot(m4bstrp; show_intercept=false)

DataFrame(shortestcovint(m4bstrp))

9×5 DataFrame

Row	type	group	names	lower	upper
	String	String?	String?	Float64	Float64
1	β	missing	(Intercept)	2034.15	2335.45
2	β	missing	prec: maintain	-427.769	-248.337
3	β	missing	spkr: old	35.6938	97.8081
4	β	missing	load: yes	45.3095	107.368
5	σ	item	(Intercept)	260.498	451.649
6	σ	item	prec: maintain	179.495	315.414
7	ρ	item	(Intercept), prec: maintain	-0.901831	-0.471356
8	σ	subj	(Intercept)	236.684	360.363
9	σ	residual	missing	657.178	702.739

VarCorr(m4)

	Column	Variance	Std.Dev	Corr.
item	(Intercept)	133014.905	364.712
	prec: maintain	63768.184	252.524	-0.70
subj	(Intercept)	88819.302	298.026
Residual		462443.261	680.032

Code

let mods = [m1, m2, m4]
  DataFrame(;
    geomdof=(sum ∘ leverage).(mods),
    npar=dof.(mods),
    deviance=deviance.(mods),
    AIC=aic.(mods),
    BIC=bic.(mods),
    AICc=aicc.(mods),
  )
end

3×6 DataFrame

Row	geomdof	npar	deviance	AIC	BIC	AICc
	Float64	Int64	Float64	Float64	Float64	Float64
1	131.593	29	28637.1	28695.1	28854.3	28696.1
2	107.543	13	28658.5	28684.5	28755.8	28684.7
3	103.478	9	28663.9	28681.9	28731.3	28682.0

scatter(fitted(m4), residuals(m4))

Figure 5: Residuals versus fitted values for model m4

This page was rendered from git revision f875caa using Quarto 1.8.26.