More on shrinkage plots

Author

Phillip Alday, Douglas Bates, and Reinhold Kliegl

Published

2022-09-27

I have stated that the likelihood criterion used to fit linear mixed-effects can be considered as balancing fidelity to the data (i.e. fits the observed data well) versus model complexity.
This is similar to some of the criterion used in Machine Learning (ML), except that the criterion for LMMs has a rigorous mathematical basis.
In the shrinkage plot we consider the values of the random-effects coefficients for the fitted values of the model versus those from a model in which there is no penalty for model complexity.
If there is strong subject-to-subject variation then the model fit will tend to values of the random effects similar to those without a penalty on complexity.
If the random effects term is not contributing much (i.e. it is “inert”) then the random effects will be shrunk considerably towards zero in some directions.

Code

using CairoMakie
using DataFrames
using LinearAlgebra
using MixedModels
using MixedModelsMakie
using Random
using ProgressMeter

ProgressMeter.ijulia_behavior(:clear);

Load the kb07 data set (don’t tell Reinhold that I used these data).

kb07 = MixedModels.dataset(:kb07)

Arrow.Table with 1789 rows, 7 columns, and schema:
 :subj      String
 :item      String
 :spkr      String
 :prec      String
 :load      String
 :rt_trunc  Int16
 :rt_raw    Int16

contrasts = Dict(
  :subj => Grouping(),
  :item => Grouping(),
  :spkr => HelmertCoding(),
  :prec => HelmertCoding(),
  :load => HelmertCoding(),
)
m1 = let
  form = @formula(
    rt_trunc ~
      1 +
      spkr * prec * load +
      (1 + spkr + prec + load | subj) +
      (1 + spkr + prec + load | item)
  )
  fit(MixedModel, form, kb07; contrasts)
end

Minimizing 934   Time: 0:00:01 ( 1.52 ms/it)
  objective:  28637.97101024215

	Est.	SE	z	p	σ_subj	σ_item
(Intercept)	2181.6424	77.3512	28.20	<1e-99	301.8676	362.4694
spkr: old	67.7496	17.9607	3.77	0.0002	33.0524	41.1185
prec: maintain	-333.9200	47.1564	-7.08	<1e-11	58.8607	247.3292
load: yes	78.8007	19.7260	3.99	<1e-04	66.9505	43.3885
spkr: old & prec: maintain	-21.9960	15.8191	-1.39	0.1644
spkr: old & load: yes	18.3832	15.8191	1.16	0.2452
prec: maintain & load: yes	4.5327	15.8191	0.29	0.7745
spkr: old & prec: maintain & load: yes	23.6377	15.8191	1.49	0.1351
Residual	669.0516

VarCorr(m1)

	Column	Variance	Std.Dev	Corr.
subj	(Intercept)	91124.0779	301.8676
	spkr: old	1092.4628	33.0524	+1.00
	prec: maintain	3464.5848	58.8607	-0.62	-0.62
	load: yes	4482.3662	66.9505	+0.36	+0.36	+0.51
item	(Intercept)	131384.0733	362.4694
	spkr: old	1690.7347	41.1185	+0.42
	prec: maintain	61171.7458	247.3292	-0.69	+0.37
	load: yes	1882.5594	43.3885	+0.29	+0.14	-0.13
Residual		447630.0111	669.0516

issingular(m1)

true

print(m1)

Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + prec + load + spkr & prec + spkr & load + prec & load + spkr & prec & load + (1 + spkr + prec + load | subj) + (1 + spkr + prec + load | item)
    logLik   -2 logLik      AIC         AICc        BIC     
 -14318.9855  28637.9710  28695.9710  28696.9602  28855.1640

Variance components:
             Column       Variance  Std.Dev.   Corr.
subj     (Intercept)      91124.0779 301.8676
         spkr: old         1092.4628  33.0524 +1.00
         prec: maintain    3464.5848  58.8607 -0.62 -0.62
         load: yes         4482.3662  66.9505 +0.36 +0.36 +0.51
item     (Intercept)     131384.0733 362.4694
         spkr: old         1690.7347  41.1185 +0.42
         prec: maintain   61171.7458 247.3292 -0.69 +0.37
         load: yes         1882.5594  43.3885 +0.29 +0.14 -0.13
Residual                 447630.0111 669.0516
 Number of obs: 1789; levels of grouping factors: 56, 32

  Fixed-effects parameters:
───────────────────────────────────────────────────────────────────────────────


                                             Coef.  Std. Error      z  Pr(>|z|)
───────────────────────────────────────────────────────────────────────────────
(Intercept)                             2181.64        77.3512  28.20    <1e-99
spkr: old                                 67.7496      17.9607   3.77    0.0002
prec: maintain                          -333.92        47.1564  -7.08    <1e-11
load: yes                                 78.8007      19.726    3.99    <1e-04
spkr: old & prec: maintain               -21.996       15.8191  -1.39    0.1644
spkr: old & load: yes                     18.3832      15.8191   1.16    0.2452
prec: maintain & load: yes                 4.53274     15.8191   0.29    0.7745
spkr: old & prec: maintain & load: yes    23.6377      15.8191   1.49    0.1351
───────────────────────────────────────────────────────────────────────────────

Expressing the covariance of random effects

Earlier today we mentioned that the parameters being optimized are from a “matrix square root” of the covariance matrix for the random effects. There is one such lower triangular matrix for each grouping factor.

l1 = first(m1.λ)   # Cholesky factor of relative covariance for subj

4×4 LowerTriangular{Float64, Matrix{Float64}}:
  0.451187     ⋅           ⋅          ⋅ 
  0.0494019   0.0          ⋅          ⋅ 
 -0.0547554  -0.00478924  0.0686931   ⋅ 
  0.0359018  -0.00650109  0.0931791  0.0

Notice the zero on the diagonal. A triangular matrix with zeros on the diagonal is singular.

l2 = last(m1.λ)    # this one is not singular

4×4 LowerTriangular{Float64, Matrix{Float64}}:
  0.541766    ⋅            ⋅          ⋅ 
  0.0259253  0.0557222     ⋅          ⋅ 
 -0.253781   0.267843     0.0226331   ⋅ 
  0.0189249  0.000871109  0.0620218  0.0

To regenerate the covariance matrix we need to know that the covariance is not the square of l1, it is l1 * l1' (so that the result is symmetric) and multiplied by σ̂²

Σ₁ = varest(m1) .* (l1 * l1')

4×4 Matrix{Float64}:
  91124.1    9977.46   -11058.7   7250.91
   9977.46   1092.46    -1210.85   793.924
 -11058.7   -1210.85     3464.58  1999.15
   7250.91    793.924    1999.15  4482.37

diag(Σ₁)  # compare to the variance column in the VarCorr output

4-element Vector{Float64}:
 91124.07791836817
  1092.4628467755617
  3464.584800597221
  4482.366224216353

sqrt.(diag(Σ₁))

4-element Vector{Float64}:
 301.86764967178607
  33.05242573209358
  58.86072375189776
  66.95047590731788

Shrinkage plots

Code

shrinkageplot(m1)

The upper left panel shows the perfect negative correlation for those two components of the random effects.

shrinkageplot(m1, :item)

X1 = Int.(m1.X')

8×1789 Matrix{Int64}:
  1   1   1   1   1  1   1   1   1   1  …   1   1   1   1   1   1   1  1   1
 -1   1   1  -1  -1  1   1  -1  -1   1      1  -1  -1   1   1  -1  -1  1   1
 -1   1  -1   1  -1  1  -1   1  -1   1     -1   1  -1   1  -1   1  -1  1  -1
  1  -1  -1  -1  -1  1   1   1   1  -1      1   1   1  -1  -1  -1  -1  1   1
  1   1  -1  -1   1  1  -1  -1   1   1     -1  -1   1   1  -1  -1   1  1  -1
 -1  -1  -1   1   1  1   1  -1  -1  -1  …   1  -1  -1  -1  -1   1   1  1   1
 -1  -1   1  -1   1  1  -1   1  -1  -1     -1   1  -1  -1   1  -1   1  1  -1
  1  -1   1   1  -1  1  -1  -1   1  -1     -1  -1   1  -1   1   1  -1  1  -1

X1 * X1'

8×8 Matrix{Int64}:
 1789    -1    -1     3    -3     1     1     3
   -1  1789    -3     1    -1     3     3     1
   -1    -3  1789     1    -1     3     3     1
    3     1     1  1789     3    -1    -1    -3
   -3    -1    -1     3  1789     1     1     3
    1     3     3    -1     1  1789    -3    -1
    1     3     3    -1     1    -3  1789    -1
    3     1     1    -3     3    -1    -1  1789

How to interpret a shrinkage plot

Extreme shrinkage (shrunk to a line or to a point) is easy to interpret - the term is not providing benefit and can be removed.
When the range of the blue dots (shrunk values) is comparable to those of the red dots (unshrunk) it indicates that the term after shrinkage is about as strong as without shrinkage.
By itself, this doesn’t mean that the term is important. In some ways you need to get a feeling for the absolute magnitude of the random effects in addition to the relative magnitude.
Small magnitude and small relative magnitude indicate you can drop that term

Conclusions from these plots

Only the intercept for the subj appears to be contributing explanatory power
For the item both the intercept and the spkr appear to be contributing

m2 = let
  form = @formula(
    rt_trunc ~
      1 + prec * spkr * load + (1 | subj) + (1 + prec | item)
  )
  fit(MixedModel, form, kb07; contrasts)
end

	Est.	SE	z	p	σ_item	σ_subj
(Intercept)	2181.7582	77.4709	28.16	<1e-99	364.7286	298.1109
prec: maintain	-333.8582	47.4629	-7.03	<1e-11	252.6687
spkr: old	67.8114	16.0526	4.22	<1e-04
load: yes	78.6849	16.0525	4.90	<1e-06
prec: maintain & spkr: old	-21.8802	16.0525	-1.36	0.1729
prec: maintain & load: yes	4.4710	16.0526	0.28	0.7806
spkr: old & load: yes	18.3214	16.0526	1.14	0.2537
prec: maintain & spkr: old & load: yes	23.5219	16.0525	1.47	0.1428
Residual	678.9318

VarCorr(m2)

	Column	Variance	Std.Dev	Corr.
item	(Intercept)	133026.916	364.729
	prec: maintain	63841.497	252.669	-0.70
subj	(Intercept)	88870.081	298.111
Residual		460948.432	678.932

Code

shrinkageplot(m2)

m3 = let
  form = @formula(
    rt_trunc ~
      1 + prec + spkr + load + (1 | subj) + (1 + prec | item)
  )
  fit(MixedModel, form, kb07; contrasts)
end

	Est.	SE	z	p	σ_item	σ_subj
(Intercept)	2181.8526	77.4681	28.16	<1e-99	364.7125	298.0259
prec: maintain	-333.7906	47.4472	-7.03	<1e-11	252.5212
spkr: old	67.8790	16.0785	4.22	<1e-04
load: yes	78.5904	16.0785	4.89	<1e-05
Residual	680.0319

VarCorr(m3)

	Column	Variance	Std.Dev	Corr.
item	(Intercept)	133015.243	364.713
	prec: maintain	63766.936	252.521	-0.70
subj	(Intercept)	88819.436	298.026
Residual		462443.388	680.032

rng = Random.seed!(1234321);

m3btstrp = parametricbootstrap(rng, 2000, m3);

DataFrame(shortestcovint(m3btstrp))

9 rows × 5 columns

	type	group	names	lower	upper
	String	String?	String?	Float64	Float64
1	β	missing	(Intercept)	2013.95	2319.53
2	β	missing	prec: maintain	-429.807	-241.429
3	β	missing	spkr: old	35.3339	95.7272
4	β	missing	load: yes	47.067	111.04
5	σ	item	(Intercept)	267.788	452.9
6	σ	item	prec: maintain	171.547	314.702
7	ρ	item	(Intercept), prec: maintain	-0.893081	-0.457084
8	σ	subj	(Intercept)	235.922	364.717
9	σ	residual	missing	657.736	703.054

ridgeplot(m3btstrp)

Figure 3: Ridge plot of the fixed-effects coefficients from the bootstrap sample

ridgeplot(m3btstrp; show_intercept=false)

Figure 4: ?(caption)

m4 = let
  form = @formula(
    rt_trunc ~
      1 + prec + spkr + load + (1 + prec | item) + (1 | subj)
  )
  fit(MixedModel, form, kb07; contrasts)
end

	Est.	SE	z	p	σ_item	σ_subj
(Intercept)	2181.8526	77.4681	28.16	<1e-99	364.7125	298.0259
prec: maintain	-333.7906	47.4472	-7.03	<1e-11	252.5212
spkr: old	67.8790	16.0785	4.22	<1e-04
load: yes	78.5904	16.0785	4.89	<1e-05
Residual	680.0319

m4bstrp = parametricbootstrap(rng, 2000, m4);

ridgeplot(m4bstrp; show_intercept=false)

DataFrame(shortestcovint(m4bstrp))

9 rows × 5 columns

	type	group	names	lower	upper
	String	String?	String?	Float64	Float64
1	β	missing	(Intercept)	2008.72	2319.8
2	β	missing	prec: maintain	-433.808	-248.858
3	β	missing	spkr: old	35.4729	97.9576
4	β	missing	load: yes	47.0078	108.437
5	σ	item	(Intercept)	261.52	444.426
6	σ	item	prec: maintain	177.438	318.854
7	ρ	item	(Intercept), prec: maintain	-0.898522	-0.477346
8	σ	subj	(Intercept)	229.031	356.407
9	σ	residual	missing	656.918	701.946

VarCorr(m4)

	Column	Variance	Std.Dev	Corr.
item	(Intercept)	133015.243	364.713
	prec: maintain	63766.936	252.521	-0.70
subj	(Intercept)	88819.436	298.026
Residual		462443.388	680.032

Code

let mods = [m1, m2, m4]
  DataFrame(;
    geomdof=(sum ∘ leverage).(mods),
    npar=dof.(mods),
    deviance=deviance.(mods),
    AIC=aic.(mods),
    BIC=bic.(mods),
    AICc=aicc.(mods),
  )
end

3 rows × 6 columns

	geomdof	npar	deviance	AIC	BIC	AICc
	Float64	Int64	Float64	Float64	Float64	Float64
1	130.125	29	28638.0	28696.0	28855.2	28697.0
2	107.543	13	28658.5	28684.5	28755.8	28684.7
3	103.478	9	28663.9	28681.9	28731.3	28682.0

scatter(fitted(m4), residuals(m4))

Figure 5: Residuals versus fitted values for model m4