Convergence, singularity and all that

Author

Douglas Bates

Published

2024-06-27

0.1 What does it mean for a converged model to be singular?

Add the packages to be used

Code

using CairoMakie
using DataFrames
using LinearAlgebra
using MixedModels
using MixedModelsMakie
using ProgressMeter

ProgressMeter.ijulia_behavior(:clear)
CairoMakie.activate!(; type="svg")

Fit a model for reaction time in the sleepstudy example, preserving information on the estimation progress (the thin=1 optional argument)

m01 = let f = @formula reaction ~ 1 + days + (1 + days|subj)
  fit(MixedModel, f, MixedModels.dataset(:sleepstudy); thin=1)
end
print(m01)

Linear mixed model fit by maximum likelihood
 reaction ~ 1 + days + (1 + days | subj)
   logLik   -2 logLik     AIC       AICc        BIC    
  -875.9697  1751.9393  1763.9393  1764.4249  1783.0971

Variance components:
            Column    Variance Std.Dev.   Corr.
subj     (Intercept)  565.51067 23.78047
         days          32.68212  5.71683 +0.08
Residual              654.94145 25.59182
 Number of obs: 180; levels of grouping factors: 18

  Fixed-effects parameters:
──────────────────────────────────────────────────
                Coef.  Std. Error      z  Pr(>|z|)
──────────────────────────────────────────────────
(Intercept)  251.405      6.63226  37.91    <1e-99
days          10.4673     1.50224   6.97    <1e-11
──────────────────────────────────────────────────

The covariance matrix for each subject’s random effects is evaluated from its “matrix square root”, called the Cholesky factor.

λ = only(m01.λ)

2×2 LowerTriangular{Float64, Matrix{Float64}}:
 0.929221    ⋅ 
 0.0181684  0.222645

The transpose of \(\lambda\), written \(\lambda'\), is an upper triangular matrix generated by “flipping” \(\lambda\) about the main diagonal.

λ'

2×2 UpperTriangular{Float64, Adjoint{Float64, Matrix{Float64}}}:
 0.929221  0.0181684
  ⋅        0.222645

The product \(\lambda * \lambda'\) will be symmetric. The covariance matrix of the random effects, \(\Sigma\), is this symmetric matrix scaled by \(\sigma^2\)

Σ = m01.σ^2 * λ * λ'

2×2 Matrix{Float64}:
 565.511  11.057
  11.057  32.6821

The estimated variances of the random effects, which are the diagonal elements of \(\Sigma\), correspond to the values shown in the table. To evaluate the covariance, isolate the correlation

# m01.σρs extracts the `σρs` property of the model.
# This property is a NamedTuple where the names
# correspond to grouping factors - in this case, `subj`.
# So `m01.σρs.subj.ρ` is the estimated correlation(s) for
# this grouping factor.  Because there is only one such correlation
# we can extract it with `only()`, which also verifies that
# there is exactly one.
ρ = only(m01.σρs.subj.ρ)

0.08133219589180817

and multiply by the standard deviations

ρ * sqrt(first(Σ) * last(Σ))

11.057017778812778

0.2 The factor is generated from a parameter vector

In practice we optimize the log-likelihood with respect to a parameter vector called \(\theta\) that generates \(\lambda\).

m01.θ

3-element Vector{Float64}:
 0.9292213124664976
 0.01816838673115054
 0.22264486480463275

The elements of this parameter vector are subject to constraints. In particular, two of the three elements have a lower bound of zero.

m01.lowerbd

3-element Vector{Float64}:
   0.0
 -Inf
   0.0

That is, the first and third elements of \(\theta\), corresponding to diagonal elements of \(\lambda\), must be non-negative, whereas the second component is unconstrained (has a lower bound of \(-\infty\)).

0.3 Progress of the iterations

The optsum.fitlog property of the model is a vector of tuples where each tuple contains the value of the \(\theta\) vector and the value of the objective at that \(\theta\). The fitlog always contains the first and the last evaluation. When the thin named argument is set, this property has a row for every thin’th evaluation.

m01.optsum.fitlog

57-element Vector{Tuple{Vector{Float64}, Float64}}:
 ([1.0, 0.0, 1.0], 1784.642296192456)
 ([1.75, 0.0, 1.0], 1790.1256369894486)
 ([1.0, 1.0, 1.0], 1798.9996244965819)
 ([1.0, 0.0, 1.75], 1803.8532002844051)
 ([0.25, 0.0, 1.0], 1800.6139807455452)
 ([1.0, -1.0, 1.0], 1798.6046308389245)
 ([1.0, 0.0, 0.25], 1752.2607369909213)
 ([1.1832612965368465, -0.008661887958066326, 0.0], 1797.5876920198457)
 ([1.075, 0.0, 0.32499999999999996], 1754.9541095798602)
 ([0.8166315695343028, 0.01116725445704456, 0.28823768689703677], 1753.695681656805)
 ([1.0, -0.07071067811865475, 0.19696699141100893], 1754.8169985163813)
 ([0.9436827046395643, 0.06383542916407543, 0.2626963029646229], 1753.1067335474727)
 ([0.9801419885633994, -0.026656844944244907, 0.27474275609535526], 1752.939376719002)
 ⋮
 ([0.9289855999666455, 0.01823660250666579, 0.22248440076567516], 1751.93935485824)
 ([0.9286970797848914, 0.018293694810667324, 0.2231753585201247], 1751.939489774018)
 ([0.9282426145734711, 0.01826952230111312, 0.22258371399259966], 1751.9393630914042)
 ([0.9291127489262951, 0.018179125424596244, 0.22262388984119313], 1751.9393448296826)
 ([0.9291905988237738, 0.01816575408997575, 0.22264320543969665], 1751.9393444889993)
 ([0.9292543047658205, 0.018209270421704643, 0.2226208143044214], 1751.9393450338166)
 ([0.929189229695291, 0.01812979185998945, 0.22257323648679392], 1751.9393475283666)
 ([0.9292535809765234, 0.018167625515921226, 0.22264990504608415], 1751.939344490362)
 ([0.9292145006688363, 0.01817173915920612, 0.22264674299586876], 1751.93934447441)
 ([0.9292083468967692, 0.018171477280421077, 0.22264619293728768], 1751.939344475018)
 ([0.9292092940780088, 0.018172966348127092, 0.22265206223369866], 1751.939344506563)
 ([0.9292213124664976, 0.01816838673115054, 0.22264486480463275], 1751.9393444646873)

There were 57 evaluations of the objective before convergence was declared, according to rather stringent convergence criteria. We can see that the last 10 evaluations only produced changes in the fourth decimal place of the objective or even smaller. That is, effective convergence occurred after about 40 or 45 evaluations.

1 A model that converges to a degenerate distribution

When the iterations converge to a value of \(\theta\) that is on the boundary, which means that one of the diagonal elements of \(\lambda\) is exactly zero, we say that \(\lambda\) is singular, in the sense that its inverse does not exist.

Another, more evocative characterization, is that the distribution of the random effects is a degenerate distribution, in the sense that the values of the random effects are constrained to a plane (technically, a linear subspace) in the space of all possible random effects.

This is shown in section 3.5.3 of Embrace Uncertainty, especially Figure 3.13.

2 Appendix

These are some appendices, mostly so that I can keep track of them

2.1 Evaluating the random effects correlation from \(\theta\)

There is a short-cut for evaluating the correlation which is to “normalize” the second row of \(\lambda\), in the sense that the row is scaled so that it has unit length.

normed = normalize!(λ[2, :])

2-element Vector{Float64}:
 0.08133219589180815
 0.9966870491339879

providing the correlation as

first(normed)

0.08133219589180815

2.2 Optimizing with a fixed correlation

To profile the correlation we need optimize the objective while fixing a value of the correlation. The way we will do this is to determine \(\theta_2\) as a function of the fixed \(\rho\) and \(\theta_3\).

We need to solve \[ \rho = \frac{\theta_2}{\sqrt{\theta_2^2 + \theta_3^2}} \]

for \(\theta_2\) as a function of \(\rho\) and \(\theta_3\).

Notice that \(\theta_2\) and \(\rho\) have the same sign. Thus it is sufficient to determine the absolute value of \(\theta_2\) then transfer the sign from \(\rho\). \[ \theta_2^2=\rho^2(\theta_2^2 + \theta_3^2) \] which implies \[ \theta_2^2 = \frac{\rho^2}{1-\rho^2}\theta_3^2, \quad \theta_3\ge 0 \] and thus \[ \theta_2=\frac{\rho}{\sqrt{1-\rho^2}}\theta_3 \]