Simulating an Experiment from Scratch

Here is a worked example of simulating a partially crossed design from scratch.

First, some setup:

using DataFrames
using MixedModels, MixedModelsSim
using Random

Assemble the Design

We're going to do a 2 x 2 x 2 design. For concreteness, let's think of this as a linguistic design:

age old vs. young, between subjects
frequency high vs. low, between items
context matched vs. unmatched, within both.

Further, let's assume we want 40 subjects and 80 items. We can specify this design as follows:

n_subj = 40
n_item = 80
subj_btwn = Dict(:age => ["old", "young"])
item_btwn = Dict(:frequency => ["high", "low"])
both_win = Dict(:context => ["matched", "unmatched"])

Dict{Symbol, Vector{String}} with 1 entry:
  :context => ["matched", "unmatched"]

and then generate it:

rng = MersenneTwister(42)  # specify our random number generator for reproducibility
design = simdat_crossed(rng, n_subj, n_item;
                        subj_btwn = subj_btwn,
                        item_btwn = item_btwn,
                        both_win = both_win)

6400-element Vector{@NamedTuple{subj::String, age::String, item::String, frequency::String, context::String, dv::Float64}}:
 (subj = "S01", age = "old", item = "I01", frequency = "high", context = "matched", dv = 1.210282022049037)
 (subj = "S02", age = "young", item = "I01", frequency = "high", context = "matched", dv = -0.07899860438858432)
 (subj = "S03", age = "old", item = "I01", frequency = "high", context = "matched", dv = 0.4035118074894221)
 (subj = "S04", age = "young", item = "I01", frequency = "high", context = "matched", dv = 0.289951117313061)
 (subj = "S05", age = "old", item = "I01", frequency = "high", context = "matched", dv = -0.06704223413021088)
 (subj = "S06", age = "young", item = "I01", frequency = "high", context = "matched", dv = 0.613979770863605)
 (subj = "S07", age = "old", item = "I01", frequency = "high", context = "matched", dv = 1.0169446776646136)
 (subj = "S08", age = "young", item = "I01", frequency = "high", context = "matched", dv = -0.05386399409001742)
 (subj = "S09", age = "old", item = "I01", frequency = "high", context = "matched", dv = -0.5703872372611605)
 (subj = "S10", age = "young", item = "I01", frequency = "high", context = "matched", dv = -0.8577367570688208)
 ⋮
 (subj = "S32", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = 0.4899567497698174)
 (subj = "S33", age = "old", item = "I80", frequency = "low", context = "unmatched", dv = -1.3026758228769602)
 (subj = "S34", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = 0.9246823304854404)
 (subj = "S35", age = "old", item = "I80", frequency = "low", context = "unmatched", dv = -2.0723054534294856)
 (subj = "S36", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = -0.6743250590998087)
 (subj = "S37", age = "old", item = "I80", frequency = "low", context = "unmatched", dv = 0.29409929753559283)
 (subj = "S38", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = 0.3219137697456554)
 (subj = "S39", age = "old", item = "I80", frequency = "low", context = "unmatched", dv = 0.6900970863582011)
 (subj = "S40", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = 1.5271989375021289)

Note that simdat_crossed returns a row table, which MixedModels.jl can process directly. For nicely displaying it, we can again use pretty_table:

pretty_table(first(design, 5))

┌────────┬────────┬────────┬───────────┬─────────┬────────────┐
│   subj │    age │   item │ frequency │ context │         dv │
│ String │ String │ String │    String │  String │    Float64 │
├────────┼────────┼────────┼───────────┼─────────┼────────────┤
│    S01 │    old │    I01 │      high │ matched │    1.21028 │
│    S02 │  young │    I01 │      high │ matched │ -0.0789986 │
│    S03 │    old │    I01 │      high │ matched │   0.403512 │
│    S04 │  young │    I01 │      high │ matched │   0.289951 │
│    S05 │    old │    I01 │      high │ matched │ -0.0670422 │
└────────┴────────┴────────┴───────────┴─────────┴────────────┘

We can also convert it to a DataFrame and change the factors to use pooled arrays to save a bit of memory.

df = pooled!(DataFrame(design))
first(df, 5)

5×6 DataFrame

Row	subj	age	item	frequency	context	dv
	String	String	String	String	String	Float64
1	S01	old	I01	high	matched	1.21028
2	S02	young	I01	high	matched	-0.0789986
3	S03	old	I01	high	matched	0.403512
4	S04	young	I01	high	matched	0.289951
5	S05	old	I01	high	matched	-0.0670422

Note that simdat_crossed generated a column dv for our dependent variable that has been pre-populated with noise from a standard normal distribution ($N(0,1)$). Typically, we will want to scale that, but we can do that in the simulation step. Also, this dependent variable is pure noise: we haven't yet added in effects. Adding in effects also comes in the simulation step.

But before we get to simulating, let's fit the model to the noise, just to see how things look. We're going to use effects coding for our contrasts.

contrasts = Dict(:age => EffectsCoding(base="young"),
                 :frequency => EffectsCoding(base="high"),
                 :context => EffectsCoding(base="matched"))
form = @formula(dv ~ 1 + age * frequency * context +
                    (1 + frequency + context | subj) +
                    (1 + age + context | item))
m0 = fit(MixedModel, form, design; contrasts=contrasts)

	Est.	SE	z	p	σ_item	σ_subj
(Intercept)	-0.0147	0.0141	-1.04	0.2967	0.0595	0.0000
age: old	0.0073	0.0129	0.56	0.5737	0.0320
frequency: low	0.0043	0.0162	0.27	0.7906		0.0500
context: unmatched	-0.0108	0.0141	-0.76	0.4457	0.0515	0.0210
age: old & frequency: low	-0.0248	0.0152	-1.63	0.1021
age: old & context: unmatched	-0.0100	0.0129	-0.78	0.4374
frequency: low & context: unmatched	-0.0115	0.0137	-0.84	0.4029
age: old & frequency: low & context: unmatched	-0.0183	0.0124	-1.47	0.1407
Residual	0.9950

Assemble the Random Effects

The hard part in simulating right now is specifying the random effects. We're working on making this bit easier, but you need to specify the variance-covariance matrix of the random effects. You can see what this looks like:

vc = VarCorr(m0)

	Column	Variance	Std.Dev	Corr.
item	(Intercept)	0.0035360	0.0594639
	age: old	0.0010269	0.0320457	+0.43
	context: unmatched	0.0026542	0.0515188	-0.29	-0.99
subj	(Intercept)	0.0000000	0.0000000
	frequency: low	0.0025018	0.0500184	+NaN
	context: unmatched	0.0004421	0.0210273	+NaN	+1.00
Residual		0.9900871	0.9950312

For each grouping variable (subjects and items), there are two major components: the standard deviations ahd the correlations.

For this example, we'll just assume all the correlations and thus the covariances are 0 in order to make things simple. Then we only have to worry about the standard deviations.

Let's assume that the variability

between items
- in the intercept is 1.3 times the residual variability
- in age is 0.35 times the residual variability
- in context is 0.75 times the residual variability
between subjects
- in the intercept is 1.5 times the residual variability
- in frequency is 0.5 times the residual variability
- in context is 0.75 times the residual variability

Note these are always specified relative to the residual standard deviation. In other words, we think about how big the between-subject and between-item differences are relative to the between-observation differences.

We can now create the associated covariance matrices.^[cholesky]

re_item = create_re(1.3, 0.35, 0.75)

3×3 LinearAlgebra.LowerTriangular{Float64, LinearAlgebra.Diagonal{Float64, Vector{Float64}}}:
 1.3   ⋅     ⋅ 
 0.0  0.35   ⋅ 
 0.0  0.0   0.75

re_subj = create_re(1.5, 0.5, 0.75)

3×3 LinearAlgebra.LowerTriangular{Float64, LinearAlgebra.Diagonal{Float64, Vector{Float64}}}:
 1.5   ⋅    ⋅ 
 0.0  0.5   ⋅ 
 0.0  0.0  0.75

We can check that we got these right by installing these parameter values into the model.

update!(m0; subj=re_subj, item=re_item)
VarCorr(m0)

	Column	Variance	Std.Dev	Corr.
item	(Intercept)	1.586719	1.259650
	age: old	0.115014	0.339137	+0.00
	context: unmatched	0.528124	0.726721	+0.00	+0.00
subj	(Intercept)	2.112496	1.453443
	frequency: low	0.234722	0.484481	+0.00
	context: unmatched	0.528124	0.726721	+0.00	+0.00
Residual		0.938887	0.968962

Looks good. The values don't exactly match the values in our parameter vector because the residual standard deviation isn't exactly 1.0.

For the actual simulation, we'll need the compact form of these covariance matrices that MixedModels.jl uses internally. This compact form is the parameter vector θ and we can get it back out of the model where we just installed it:

show(m0.θ)

[1.3, 0.0, 0.0, 0.35, 0.0, 0.75, 1.5, 0.0, 0.0, 0.5, 0.0, 0.75]

Alternatively, we can also just generate θ directly from the random-effects matrices:

θ = createθ(m0; subj=re_subj, item=re_item)
show(θ)

[1.3, 0.0, 0.0, 0.35, 0.0, 0.75, 1.5, 0.0, 0.0, 0.5, 0.0, 0.75]

We could also create it directly from the covariance matrices we created, but in this case we need to make sure they're in the same order as in the VarCorr output:

θhand = vcat( flatlowertri(re_item), flatlowertri(re_subj) )
show(θhand)

[1.3, 0.0, 0.0, 0.35, 0.0, 0.75, 1.5, 0.0, 0.0, 0.5, 0.0, 0.75]

Warning

In storing the parameter vector θ, MixedModels.jl uses an ordering that yields maximum sparseness, which enables better computational efficiency. The ordering is thus dependent on the entirety of the design – both the choice of the random effects and the relative number of subjects and items. For this reason, we strongly recommend using the helper methods that allow specifying the grouping variable by name.

Assemble the Fixed Effects

The last two components we need are the residual variance and the effect sizes for the fixed effects.

σ = 5;
β = [1.0, -1.0, 2.0, -1.5, 0.3, -1.3, 1.4, 0];

The entries in the β correspond to the coefficients in the model given by

coefnames(m0)

8-element Vector{String}:
 "(Intercept)"
 "age: old"
 "frequency: low"
 "context: unmatched"
 "age: old & frequency: low"
 "age: old & context: unmatched"
 "frequency: low & context: unmatched"
 "age: old & frequency: low & context: unmatched"

Simulate

Now we're ready to actually simulate our data. We can use parametricbootstrap to do this: the parametric bootstrap actually works by simulating new data from an existing model and then looking at how the estimates fit to that new data look. In MixedModels.jl, you can specify different parameter values, such as the ones we made up for our fake data.

# typically we would use a lot more simulations
# but we want to be quick in this example
sim = parametricbootstrap(MersenneTwister(12321), 20, m0; β=β, σ=σ, θ=θ)

MixedModelBootstrap with 20 samples
      parameter  min         q25         median      mean        q75        ⋯
    ┌────────────────────────────────────────────────────────────────────────
 1  │ β1         -3.46254    -0.0275183  0.951426    1.05389     2.30673    ⋯
 2  │ β2         -3.15569    -2.21155    -1.65746    -1.40177    -0.725325  ⋯
 3  │ β3         0.0472057   1.47878     1.78613     1.84538     2.20689    ⋯
 4  │ β4         -2.78027    -1.84879    -1.38575    -1.31679    -0.506349  ⋯
 5  │ β5         -0.409183   0.00758897  0.388057    0.367569    0.706757   ⋯
 6  │ β6         -2.54446    -1.7043     -1.27281    -1.31153    -0.760097  ⋯
 7  │ β7         0.963731    1.17698     1.36732     1.44359     1.6929     ⋯
 8  │ β8         -0.0696417  -0.0416162  -0.0185569  0.00739991  0.0592813  ⋯
 9  │ σ          4.92495     4.96082     4.97544     4.99539     5.0203     ⋯
 10 │ σ1         5.64404     6.09127     6.40182     6.38416     6.63392    ⋯
 11 │ σ2         1.51382     1.60649     1.70629     1.71681     1.80014    ⋯
 12 │ σ3         3.29793     3.46935     3.72588     3.71249     3.92566    ⋯
 13 │ ρ1         -0.192243   -0.120512   -0.0325252  -0.0118603  0.0457625  ⋯
 14 │ ρ2         -0.274868   -0.0555564  0.039018    0.00786589  0.0880654  ⋯
 15 │ ρ3         -0.198017   -0.0877441  -0.0367911  -0.0258897  0.0270686  ⋯
 16 │ σ4         5.88261     6.33686     7.16773     7.21947     8.15155    ⋯
 17 │ σ5         2.01433     2.2599      2.49648     2.47639     2.69413    ⋯
 ⋮  │     ⋮          ⋮           ⋮           ⋮           ⋮           ⋮      ⋱

As mentioned above, the ordering within θ is dependent on the entire design, so if you embed the simulation code in a loop iterating over different numbers of subjects and items, it's probably better to write it as:

sim = parametricbootstrap(MersenneTwister(12321), 20, m0;
                          β=β, σ=σ, θ=createθ(m0; subj=re_subj, item=re_item))

See your power and profit!

Finally, we can turn this into a power table:

ptbl = power_table(sim)

8-element Vector{@NamedTuple{coefname::String, power::Float64}}:
 (coefname = "age: old & frequency: low", power = 0.15)
 (coefname = "frequency: low & context: unmatched", power = 1.0)
 (coefname = "age: old", power = 0.25)
 (coefname = "context: unmatched", power = 0.5)
 (coefname = "(Intercept)", power = 0.25)
 (coefname = "frequency: low", power = 0.6)
 (coefname = "age: old & context: unmatched", power = 0.65)
 (coefname = "age: old & frequency: low & context: unmatched", power = 0.0)

For nicely displaying it, we can again use pretty_table:

pretty_table(ptbl)

┌────────────────────────────────────────────────┬─────────┐
│                                       coefname │   power │
│                                         String │ Float64 │
├────────────────────────────────────────────────┼─────────┤
│                      age: old & frequency: low │    0.15 │
│            frequency: low & context: unmatched │     1.0 │
│                                       age: old │    0.25 │
│                             context: unmatched │     0.5 │
│                                    (Intercept) │    0.25 │
│                                 frequency: low │     0.6 │
│                  age: old & context: unmatched │    0.65 │
│ age: old & frequency: low & context: unmatched │     0.0 │
└────────────────────────────────────────────────┴─────────┘

We can of course convert the row table into a DataFrame:

DataFrame(ptbl)

8×2 DataFrame

Row	coefname	power
	String	Float64
1	age: old & frequency: low	0.15
2	frequency: low & context: unmatched	1.0
3	age: old	0.25
4	context: unmatched	0.5
5	(Intercept)	0.25
6	frequency: low	0.6
7	age: old & context: unmatched	0.65
8	age: old & frequency: low & context: unmatched	0.0

choleskyTechnically, we're creating the lower Cholesky factor of these matrices, which is a bit like the matrix square root. In other words, we're creating the matrix form of standard deviations instead of the matrix form of the variances.