Simulating an Experiment from Scratch
Here is a worked example of simulating a partially crossed design from scratch.
First, some setup:
using DataFrames
using MixedModels, MixedModelsSim
using Random
Assemble the Design
We're going to do a 2 x 2 x 2 design. For concreteness, let's think of this as a linguistic design:
- age
old
vs.young
, between subjects - frequency
high
vs.low
, between items - context
matched
vs.unmatched
, within both.
Further, let's assume we want 40 subjects and 80 items. We can specify this design as follows:
n_subj = 40
n_item = 80
subj_btwn = Dict(:age => ["old", "young"])
item_btwn = Dict(:frequency => ["high", "low"])
both_win = Dict(:context => ["matched", "unmatched"])
Dict{Symbol, Vector{String}} with 1 entry:
:context => ["matched", "unmatched"]
and then generate it:
rng = MersenneTwister(42) # specify our random number generator for reproducibility
design = simdat_crossed(rng, n_subj, n_item;
subj_btwn = subj_btwn,
item_btwn = item_btwn,
both_win = both_win)
6400-element Vector{@NamedTuple{subj::String, age::String, item::String, frequency::String, context::String, dv::Float64}}:
(subj = "S01", age = "old", item = "I01", frequency = "high", context = "matched", dv = -0.5560268761463861)
(subj = "S02", age = "young", item = "I01", frequency = "high", context = "matched", dv = -0.444383357109696)
(subj = "S03", age = "old", item = "I01", frequency = "high", context = "matched", dv = 0.027155338009193845)
(subj = "S04", age = "young", item = "I01", frequency = "high", context = "matched", dv = -0.29948409035891055)
(subj = "S05", age = "old", item = "I01", frequency = "high", context = "matched", dv = 1.7778610980573246)
(subj = "S06", age = "young", item = "I01", frequency = "high", context = "matched", dv = -1.14490153172882)
(subj = "S07", age = "old", item = "I01", frequency = "high", context = "matched", dv = -0.46860588216767457)
(subj = "S08", age = "young", item = "I01", frequency = "high", context = "matched", dv = 0.15614346264074028)
(subj = "S09", age = "old", item = "I01", frequency = "high", context = "matched", dv = -2.641991008076796)
(subj = "S10", age = "young", item = "I01", frequency = "high", context = "matched", dv = 1.0033099014594844)
⋮
(subj = "S32", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = -0.0720635904250138)
(subj = "S33", age = "old", item = "I80", frequency = "low", context = "unmatched", dv = -0.5636260084084584)
(subj = "S34", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = -0.48302653966207154)
(subj = "S35", age = "old", item = "I80", frequency = "low", context = "unmatched", dv = 0.5635515836664501)
(subj = "S36", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = -1.5342115330081054)
(subj = "S37", age = "old", item = "I80", frequency = "low", context = "unmatched", dv = -0.7925886663164663)
(subj = "S38", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = 1.1344191438635207)
(subj = "S39", age = "old", item = "I80", frequency = "low", context = "unmatched", dv = 1.3406462534138157)
(subj = "S40", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = 0.7969024201220799)
Note that simdat_crossed
returns a row table, which MixedModels.jl
can process directly. For nicely displaying it, we can again use pretty_table
:
pretty_table(first(design, 5))
┌────────┬────────┬────────┬───────────┬─────────┬───────────┐
│ subj │ age │ item │ frequency │ context │ dv │
│ String │ String │ String │ String │ String │ Float64 │
├────────┼────────┼────────┼───────────┼─────────┼───────────┤
│ S01 │ old │ I01 │ high │ matched │ -0.556027 │
│ S02 │ young │ I01 │ high │ matched │ -0.444383 │
│ S03 │ old │ I01 │ high │ matched │ 0.0271553 │
│ S04 │ young │ I01 │ high │ matched │ -0.299484 │
│ S05 │ old │ I01 │ high │ matched │ 1.77786 │
└────────┴────────┴────────┴───────────┴─────────┴───────────┘
We can also convert it to a DataFrame and change the factors to use pooled arrays to save a bit of memory.
df = pooled!(DataFrame(design))
first(df, 5)
Row | subj | age | item | frequency | context | dv |
---|---|---|---|---|---|---|
String | String | String | String | String | Float64 | |
1 | S01 | old | I01 | high | matched | -0.556027 |
2 | S02 | young | I01 | high | matched | -0.444383 |
3 | S03 | old | I01 | high | matched | 0.0271553 |
4 | S04 | young | I01 | high | matched | -0.299484 |
5 | S05 | old | I01 | high | matched | 1.77786 |
Note that simdat_crossed
generated a column dv
for our dependent variable that has been pre-populated with noise from a standard normal distribution ($N(0,1)$). Typically, we will want to scale that, but we can do that in the simulation step. Also, this dependent variable is pure noise: we haven't yet added in effects. Adding in effects also comes in the simulation step.
But before we get to simulating, let's fit the model to the noise, just to see how things look. We're going to use effects coding for our contrasts.
contrasts = Dict(:age => EffectsCoding(base="young"),
:frequency => EffectsCoding(base="high"),
:context => EffectsCoding(base="matched"))
form = @formula(dv ~ 1 + age * frequency * context +
(1 + frequency + context | subj) +
(1 + age + context | item))
m0 = fit(MixedModel, form, design; contrasts=contrasts)
Est. | SE | z | p | σ_item | σ_subj | |
---|---|---|---|---|---|---|
(Intercept) | -0.0168 | 0.0128 | -1.31 | 0.1895 | 0.0094 | 0.0111 |
age: old | -0.0004 | 0.0128 | -0.03 | 0.9734 | 0.0107 | |
frequency: low | 0.0180 | 0.0131 | 1.37 | 0.1696 | 0.0217 | |
context: unmatched | 0.0099 | 0.0132 | 0.75 | 0.4534 | 0.0062 | 0.0245 |
age: old & frequency: low | 0.0014 | 0.0131 | 0.11 | 0.9147 | ||
age: old & context: unmatched | 0.0068 | 0.0132 | 0.51 | 0.6089 | ||
frequency: low & context: unmatched | -0.0084 | 0.0126 | -0.67 | 0.5043 | ||
age: old & frequency: low & context: unmatched | 0.0000 | 0.0126 | 0.00 | 0.9972 | ||
Residual | 1.0095 |
Assemble the Random Effects
The hard part in simulating right now is specifying the random effects. We're working on making this bit easier, but you need to specify the variance-covariance matrix of the random effects. You can see what this looks like:
vc = VarCorr(m0)
Column | Variance | Std.Dev | Corr. | ||
---|---|---|---|---|---|
item | (Intercept) | 0.00008910 | 0.00943915 | ||
age: old | 0.00011375 | 0.01066532 | +1.00 | ||
context: unmatched | 0.00003830 | 0.00618839 | -1.00 | -1.00 | |
subj | (Intercept) | 0.00012356 | 0.01111553 | ||
frequency: low | 0.00047281 | 0.02174427 | +1.00 | ||
context: unmatched | 0.00060098 | 0.02451495 | +1.00 | +1.00 | |
Residual | 1.01911763 | 1.00951356 |
For each grouping variable (subjects and items), there are two major components: the standard deviations ahd the correlations.
For this example, we'll just assume all the correlations and thus the covariances are 0 in order to make things simple. Then we only have to worry about the standard deviations.
Let's assume that the variability
- between items
- in the intercept is 1.3 times the residual variability
- in age is 0.35 times the residual variability
- in context is 0.75 times the residual variability
- between subjects
- in the intercept is 1.5 times the residual variability
- in frequency is 0.5 times the residual variability
- in context is 0.75 times the residual variability
Note these are always specified relative to the residual standard deviation. In other words, we think about how big the between-subject and between-item differences are relative to the between-observation differences.
We can now create the associated covariance matrices.[cholesky]
re_item = create_re(1.3, 0.35, 0.75)
3×3 LinearAlgebra.LowerTriangular{Float64, LinearAlgebra.Diagonal{Float64, Vector{Float64}}}:
1.3 ⋅ ⋅
0.0 0.35 ⋅
0.0 0.0 0.75
re_subj = create_re(1.5, 0.5, 0.75)
3×3 LinearAlgebra.LowerTriangular{Float64, LinearAlgebra.Diagonal{Float64, Vector{Float64}}}:
1.5 ⋅ ⋅
0.0 0.5 ⋅
0.0 0.0 0.75
We can check that we got these right by installing these parameter values into the model.
update!(m0; subj=re_subj, item=re_item)
VarCorr(m0)
Column | Variance | Std.Dev | Corr. | ||
---|---|---|---|---|---|
item | (Intercept) | 1.645411 | 1.282736 | ||
age: old | 0.119268 | 0.345352 | +0.00 | ||
context: unmatched | 0.547659 | 0.740040 | +0.00 | +0.00 | |
subj | (Intercept) | 2.190636 | 1.480080 | ||
frequency: low | 0.243404 | 0.493360 | +0.00 | ||
context: unmatched | 0.547659 | 0.740040 | +0.00 | +0.00 | |
Residual | 0.973616 | 0.986720 |
Looks good. The values don't exactly match the values in our parameter vector because the residual standard deviation isn't exactly 1.0.
For the actual simulation, we'll need the compact form of these covariance matrices that MixedModels.jl uses internally. This compact form is the parameter vector θ and we can get it back out of the model where we just installed it:
show(m0.θ)
[1.3, 0.0, 0.0, 0.35, 0.0, 0.75, 1.5, 0.0, 0.0, 0.5, 0.0, 0.75]
Alternatively, we can also just generate θ directly from the random-effects matrices:
θ = createθ(m0; subj=re_subj, item=re_item)
show(θ)
[1.3, 0.0, 0.0, 0.35, 0.0, 0.75, 1.5, 0.0, 0.0, 0.5, 0.0, 0.75]
We could also create it directly from the covariance matrices we created, but in this case we need to make sure they're in the same order as in the VarCorr
output:
θhand = vcat( flatlowertri(re_item), flatlowertri(re_subj) )
show(θhand)
[1.3, 0.0, 0.0, 0.35, 0.0, 0.75, 1.5, 0.0, 0.0, 0.5, 0.0, 0.75]
In storing the parameter vector θ, MixedModels.jl uses an ordering that yields maximum sparseness, which enables better computational efficiency. The ordering is thus dependent on the entirety of the design – both the choice of the random effects and the relative number of subjects and items. For this reason, we strongly recommend using the helper methods that allow specifying the grouping variable by name.
Assemble the Fixed Effects
The last two components we need are the residual variance and the effect sizes for the fixed effects.
σ = 5;
β = [1.0, -1.0, 2.0, -1.5, 0.3, -1.3, 1.4, 0];
The entries in the β correspond to the coefficients in the model given by
coefnames(m0)
8-element Vector{String}:
"(Intercept)"
"age: old"
"frequency: low"
"context: unmatched"
"age: old & frequency: low"
"age: old & context: unmatched"
"frequency: low & context: unmatched"
"age: old & frequency: low & context: unmatched"
Simulate
Now we're ready to actually simulate our data. We can use parametricbootstrap
to do this: the parametric bootstrap actually works by simulating new data from an existing model and then looking at how the estimates fit to that new data look. In MixedModels.jl, you can specify different parameter values, such as the ones we made up for our fake data.
# typically we would use a lot more simulations
# but we want to be quick in this example
sim = parametricbootstrap(MersenneTwister(12321), 20, m0; β=β, σ=σ, θ=θ)
MixedModelBootstrap with 20 samples
parameter min q25 median mean q75 ⋯
┌─────────────────────────────────────────────────────────────────────────
1 │ β1 -1.79664 -0.053643 0.312708 0.685845 1.86688 ⋯
2 │ β2 -3.11269 -1.95453 -1.50575 -1.287 -0.404404 ⋯
3 │ β3 0.651713 1.55415 2.10404 2.24626 2.47888 ⋯
4 │ β4 -2.66413 -1.91465 -1.37466 -1.34444 -0.632574 ⋯
5 │ β5 -0.519033 0.041313 0.315207 0.292365 0.544831 ⋯
6 │ β6 -1.93195 -1.32172 -0.914209 -0.990873 -0.735804 ⋯
7 │ β7 0.930514 1.43584 1.55431 1.60366 1.89026 ⋯
8 │ β8 -0.107651 -0.0487993 -0.0122841 -0.000321999 0.054041 ⋯
9 │ σ 4.93184 4.96237 5.00943 5.01465 5.06505 ⋯
10 │ σ1 5.82462 6.244 6.52838 6.51395 6.70905 ⋯
11 │ σ2 1.4006 1.73083 1.81814 1.7934 1.86468 ⋯
12 │ σ3 3.21113 3.41969 3.60164 3.68955 3.9548 ⋯
13 │ ρ1 -0.304399 -0.0673391 -0.0238728 -0.00615448 0.103682 ⋯
14 │ ρ2 -0.280769 -0.133889 -0.025639 -0.0436935 0.0720842 ⋯
15 │ ρ3 -0.17859 -0.0504068 0.022198 0.0146995 0.087991 ⋯
16 │ σ4 5.34141 6.83697 7.21402 7.17868 7.55147 ⋯
17 │ σ5 1.71548 2.26589 2.39412 2.38782 2.50665 ⋯
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
As mentioned above, the ordering within θ is dependent on the entire design, so if you embed the simulation code in a loop iterating over different numbers of subjects and items, it's probably better to write it as:
sim = parametricbootstrap(MersenneTwister(12321), 20, m0;
β=β, σ=σ, θ=createθ(m0; subj=re_subj, item=re_item))
See your power and profit!
Finally, we can turn this into a power table:
ptbl = power_table(sim)
8-element Vector{@NamedTuple{coefname::String, power::Float64}}:
(coefname = "age: old & frequency: low", power = 0.1)
(coefname = "frequency: low & context: unmatched", power = 1.0)
(coefname = "age: old", power = 0.15)
(coefname = "context: unmatched", power = 0.5)
(coefname = "(Intercept)", power = 0.15)
(coefname = "frequency: low", power = 0.7)
(coefname = "age: old & context: unmatched", power = 0.35)
(coefname = "age: old & frequency: low & context: unmatched", power = 0.0)
For nicely displaying it, we can again use pretty_table
:
pretty_table(ptbl)
┌────────────────────────────────────────────────┬─────────┐
│ coefname │ power │
│ String │ Float64 │
├────────────────────────────────────────────────┼─────────┤
│ age: old & frequency: low │ 0.1 │
│ frequency: low & context: unmatched │ 1.0 │
│ age: old │ 0.15 │
│ context: unmatched │ 0.5 │
│ (Intercept) │ 0.15 │
│ frequency: low │ 0.7 │
│ age: old & context: unmatched │ 0.35 │
│ age: old & frequency: low & context: unmatched │ 0.0 │
└────────────────────────────────────────────────┴─────────┘
We can of course convert the row table into a DataFrame:
DataFrame(ptbl)
Row | coefname | power |
---|---|---|
String | Float64 | |
1 | age: old & frequency: low | 0.1 |
2 | frequency: low & context: unmatched | 1.0 |
3 | age: old | 0.15 |
4 | context: unmatched | 0.5 |
5 | (Intercept) | 0.15 |
6 | frequency: low | 0.7 |
7 | age: old & context: unmatched | 0.35 |
8 | age: old & frequency: low & context: unmatched | 0.0 |
- choleskyTechnically, we're creating the lower Cholesky factor of these matrices, which is a bit like the matrix square root. In other words, we're creating the matrix form of standard deviations instead of the matrix form of the variances.