Simulating an Experiment from Scratch

Here is a worked example of simulating a partially crossed design from scratch.

First, some setup:

using DataFrames
using MixedModels, MixedModelsSim
using Random

Assemble the Design

We're going to do a 2 x 2 x 2 design. For concreteness, let's think of this as a linguistic design:

  • age old vs. young, between subjects
  • frequency high vs. low, between items
  • context matched vs. unmatched, within both.

Further, let's assume we want 40 subjects and 80 items. We can specify this design as follows:

n_subj = 40
n_item = 80
subj_btwn = Dict(:age => ["old", "young"])
item_btwn = Dict(:frequency => ["high", "low"])
both_win = Dict(:context => ["matched", "unmatched"])
Dict{Symbol, Vector{String}} with 1 entry:
  :context => ["matched", "unmatched"]

and then generate it:

rng = MersenneTwister(42)  # specify our random number generator for reproducibility
design = simdat_crossed(rng, n_subj, n_item;
                        subj_btwn = subj_btwn,
                        item_btwn = item_btwn,
                        both_win = both_win)
6400-element Vector{@NamedTuple{subj::String, age::String, item::String, frequency::String, context::String, dv::Float64}}:
 (subj = "S01", age = "old", item = "I01", frequency = "high", context = "matched", dv = -0.5560268761463861)
 (subj = "S02", age = "young", item = "I01", frequency = "high", context = "matched", dv = -0.444383357109696)
 (subj = "S03", age = "old", item = "I01", frequency = "high", context = "matched", dv = 0.027155338009193845)
 (subj = "S04", age = "young", item = "I01", frequency = "high", context = "matched", dv = -0.29948409035891055)
 (subj = "S05", age = "old", item = "I01", frequency = "high", context = "matched", dv = 1.7778610980573246)
 (subj = "S06", age = "young", item = "I01", frequency = "high", context = "matched", dv = -1.14490153172882)
 (subj = "S07", age = "old", item = "I01", frequency = "high", context = "matched", dv = -0.46860588216767457)
 (subj = "S08", age = "young", item = "I01", frequency = "high", context = "matched", dv = 0.15614346264074028)
 (subj = "S09", age = "old", item = "I01", frequency = "high", context = "matched", dv = -2.641991008076796)
 (subj = "S10", age = "young", item = "I01", frequency = "high", context = "matched", dv = 1.0033099014594844)
 ⋮
 (subj = "S32", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = -0.0720635904250138)
 (subj = "S33", age = "old", item = "I80", frequency = "low", context = "unmatched", dv = -0.5636260084084584)
 (subj = "S34", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = -0.48302653966207154)
 (subj = "S35", age = "old", item = "I80", frequency = "low", context = "unmatched", dv = 0.5635515836664501)
 (subj = "S36", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = -1.5342115330081054)
 (subj = "S37", age = "old", item = "I80", frequency = "low", context = "unmatched", dv = -0.7925886663164663)
 (subj = "S38", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = 1.1344191438635207)
 (subj = "S39", age = "old", item = "I80", frequency = "low", context = "unmatched", dv = 1.3406462534138157)
 (subj = "S40", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = 0.7969024201220799)

Note that simdat_crossed returns a row table, which MixedModels.jl can process directly. For nicely displaying it, we can again use pretty_table:

pretty_table(first(design, 5))
┌────────┬────────┬────────┬───────────┬─────────┬───────────┐
│   subj     age    item  frequency  context         dv │
│ String  String  String     String   String    Float64 │
├────────┼────────┼────────┼───────────┼─────────┼───────────┤
│    S01 │    old │    I01 │      high │ matched │ -0.556027 │
│    S02 │  young │    I01 │      high │ matched │ -0.444383 │
│    S03 │    old │    I01 │      high │ matched │ 0.0271553 │
│    S04 │  young │    I01 │      high │ matched │ -0.299484 │
│    S05 │    old │    I01 │      high │ matched │   1.77786 │
└────────┴────────┴────────┴───────────┴─────────┴───────────┘

We can also convert it to a DataFrame and change the factors to use pooled arrays to save a bit of memory.

df = pooled!(DataFrame(design))
first(df, 5)
5×6 DataFrame
Rowsubjageitemfrequencycontextdv
StringStringStringStringStringFloat64
1S01oldI01highmatched-0.556027
2S02youngI01highmatched-0.444383
3S03oldI01highmatched0.0271553
4S04youngI01highmatched-0.299484
5S05oldI01highmatched1.77786

Note that simdat_crossed generated a column dv for our dependent variable that has been pre-populated with noise from a standard normal distribution ($N(0,1)$). Typically, we will want to scale that, but we can do that in the simulation step. Also, this dependent variable is pure noise: we haven't yet added in effects. Adding in effects also comes in the simulation step.

But before we get to simulating, let's fit the model to the noise, just to see how things look. We're going to use effects coding for our contrasts.

contrasts = Dict(:age => EffectsCoding(base="young"),
                 :frequency => EffectsCoding(base="high"),
                 :context => EffectsCoding(base="matched"))
form = @formula(dv ~ 1 + age * frequency * context +
                    (1 + frequency + context | subj) +
                    (1 + age + context | item))
m0 = fit(MixedModel, form, design; contrasts=contrasts)
Est.SEzpσ_itemσ_subj
(Intercept)-0.01680.0128-1.310.18950.00940.0111
age: old-0.00040.0128-0.030.97340.0107
frequency: low0.01800.01311.370.1696 0.0217
context: unmatched0.00990.01320.750.45340.00620.0245
age: old & frequency: low0.00140.01310.110.9147
age: old & context: unmatched0.00680.01320.510.6089
frequency: low & context: unmatched-0.00840.0126-0.670.5043
age: old & frequency: low & context: unmatched0.00000.01260.000.9972
Residual1.0095

Assemble the Random Effects

The hard part in simulating right now is specifying the random effects. We're working on making this bit easier, but you need to specify the variance-covariance matrix of the random effects. You can see what this looks like:

vc = VarCorr(m0)
Column VarianceStd.DevCorr.
item(Intercept)0.000089100.00943915
age: old0.000113750.01066532+1.00
context: unmatched0.000038300.00618839-1.00-1.00
subj(Intercept)0.000123560.01111553
frequency: low0.000472810.02174427+1.00
context: unmatched0.000600980.02451495+1.00+1.00
Residual 1.019117631.00951356

For each grouping variable (subjects and items), there are two major components: the standard deviations ahd the correlations.

For this example, we'll just assume all the correlations and thus the covariances are 0 in order to make things simple. Then we only have to worry about the standard deviations.

Let's assume that the variability

  • between items
    • in the intercept is 1.3 times the residual variability
    • in age is 0.35 times the residual variability
    • in context is 0.75 times the residual variability
  • between subjects
    • in the intercept is 1.5 times the residual variability
    • in frequency is 0.5 times the residual variability
    • in context is 0.75 times the residual variability

Note these are always specified relative to the residual standard deviation. In other words, we think about how big the between-subject and between-item differences are relative to the between-observation differences.

We can now create the associated covariance matrices.[cholesky]

re_item = create_re(1.3, 0.35, 0.75)
3×3 LinearAlgebra.LowerTriangular{Float64, LinearAlgebra.Diagonal{Float64, Vector{Float64}}}:
 1.3   ⋅     ⋅ 
 0.0  0.35   ⋅ 
 0.0  0.0   0.75
re_subj = create_re(1.5, 0.5, 0.75)
3×3 LinearAlgebra.LowerTriangular{Float64, LinearAlgebra.Diagonal{Float64, Vector{Float64}}}:
 1.5   ⋅    ⋅ 
 0.0  0.5   ⋅ 
 0.0  0.0  0.75

We can check that we got these right by installing these parameter values into the model.

update!(m0; subj=re_subj, item=re_item)
VarCorr(m0)
Column VarianceStd.DevCorr.
item(Intercept)1.6454111.282736
age: old0.1192680.345352+0.00
context: unmatched0.5476590.740040+0.00+0.00
subj(Intercept)2.1906361.480080
frequency: low0.2434040.493360+0.00
context: unmatched0.5476590.740040+0.00+0.00
Residual 0.9736160.986720

Looks good. The values don't exactly match the values in our parameter vector because the residual standard deviation isn't exactly 1.0.

For the actual simulation, we'll need the compact form of these covariance matrices that MixedModels.jl uses internally. This compact form is the parameter vector θ and we can get it back out of the model where we just installed it:

show(m0.θ)
[1.3, 0.0, 0.0, 0.35, 0.0, 0.75, 1.5, 0.0, 0.0, 0.5, 0.0, 0.75]

Alternatively, we can also just generate θ directly from the random-effects matrices:

θ = createθ(m0; subj=re_subj, item=re_item)
show(θ)
[1.3, 0.0, 0.0, 0.35, 0.0, 0.75, 1.5, 0.0, 0.0, 0.5, 0.0, 0.75]

We could also create it directly from the covariance matrices we created, but in this case we need to make sure they're in the same order as in the VarCorr output:

θhand = vcat( flatlowertri(re_item), flatlowertri(re_subj) )
show(θhand)
[1.3, 0.0, 0.0, 0.35, 0.0, 0.75, 1.5, 0.0, 0.0, 0.5, 0.0, 0.75]
Warning

In storing the parameter vector θ, MixedModels.jl uses an ordering that yields maximum sparseness, which enables better computational efficiency. The ordering is thus dependent on the entirety of the design – both the choice of the random effects and the relative number of subjects and items. For this reason, we strongly recommend using the helper methods that allow specifying the grouping variable by name.

Assemble the Fixed Effects

The last two components we need are the residual variance and the effect sizes for the fixed effects.

σ = 5;
β = [1.0, -1.0, 2.0, -1.5, 0.3, -1.3, 1.4, 0];

The entries in the β correspond to the coefficients in the model given by

coefnames(m0)
8-element Vector{String}:
 "(Intercept)"
 "age: old"
 "frequency: low"
 "context: unmatched"
 "age: old & frequency: low"
 "age: old & context: unmatched"
 "frequency: low & context: unmatched"
 "age: old & frequency: low & context: unmatched"

Simulate

Now we're ready to actually simulate our data. We can use parametricbootstrap to do this: the parametric bootstrap actually works by simulating new data from an existing model and then looking at how the estimates fit to that new data look. In MixedModels.jl, you can specify different parameter values, such as the ones we made up for our fake data.

# typically we would use a lot more simulations
# but we want to be quick in this example
sim = parametricbootstrap(MersenneTwister(12321), 20, m0; β=β, σ=σ, θ=θ)
MixedModelBootstrap with 20 samples
      parameter  min        q25         median      mean          q75        ⋯
    ┌─────────────────────────────────────────────────────────────────────────
 1  │ β1         -1.79664   -0.053643   0.312708    0.685845      1.86688    ⋯
 2  │ β2         -3.11269   -1.95453    -1.50575    -1.287        -0.404404  ⋯
 3  │ β3         0.651713   1.55415     2.10404     2.24626       2.47888    ⋯
 4  │ β4         -2.66413   -1.91465    -1.37466    -1.34444      -0.632574  ⋯
 5  │ β5         -0.519033  0.041313    0.315207    0.292365      0.544831   ⋯
 6  │ β6         -1.93195   -1.32172    -0.914209   -0.990873     -0.735804  ⋯
 7  │ β7         0.930514   1.43584     1.55431     1.60366       1.89026    ⋯
 8  │ β8         -0.107651  -0.0487993  -0.0122841  -0.000321999  0.054041   ⋯
 9  │ σ          4.93184    4.96237     5.00943     5.01465       5.06505    ⋯
 10 │ σ1         5.82462    6.244       6.52838     6.51395       6.70905    ⋯
 11 │ σ2         1.4006     1.73083     1.81814     1.7934        1.86468    ⋯
 12 │ σ3         3.21113    3.41969     3.60164     3.68955       3.9548     ⋯
 13 │ ρ1         -0.304399  -0.0673391  -0.0238728  -0.00615448   0.103682   ⋯
 14 │ ρ2         -0.280769  -0.133889   -0.025639   -0.0436935    0.0720842  ⋯
 15 │ ρ3         -0.17859   -0.0504068  0.022198    0.0146995     0.087991   ⋯
 16 │ σ4         5.34141    6.83697     7.21402     7.17868       7.55147    ⋯
 17 │ σ5         1.71548    2.26589     2.39412     2.38782       2.50665    ⋯
 ⋮  │     ⋮          ⋮          ⋮           ⋮            ⋮            ⋮      ⋱

As mentioned above, the ordering within θ is dependent on the entire design, so if you embed the simulation code in a loop iterating over different numbers of subjects and items, it's probably better to write it as:

sim = parametricbootstrap(MersenneTwister(12321), 20, m0;
                          β=β, σ=σ, θ=createθ(m0; subj=re_subj, item=re_item))

See your power and profit!

Finally, we can turn this into a power table:

ptbl = power_table(sim)
8-element Vector{@NamedTuple{coefname::String, power::Float64}}:
 (coefname = "age: old & frequency: low", power = 0.1)
 (coefname = "frequency: low & context: unmatched", power = 1.0)
 (coefname = "age: old", power = 0.15)
 (coefname = "context: unmatched", power = 0.5)
 (coefname = "(Intercept)", power = 0.15)
 (coefname = "frequency: low", power = 0.7)
 (coefname = "age: old & context: unmatched", power = 0.35)
 (coefname = "age: old & frequency: low & context: unmatched", power = 0.0)

For nicely displaying it, we can again use pretty_table:

pretty_table(ptbl)
┌────────────────────────────────────────────────┬─────────┐
│                                       coefname    power │
│                                         String  Float64 │
├────────────────────────────────────────────────┼─────────┤
│                      age: old & frequency: low │     0.1 │
│            frequency: low & context: unmatched │     1.0 │
│                                       age: old │    0.15 │
│                             context: unmatched │     0.5 │
│                                    (Intercept) │    0.15 │
│                                 frequency: low │     0.7 │
│                  age: old & context: unmatched │    0.35 │
│ age: old & frequency: low & context: unmatched │     0.0 │
└────────────────────────────────────────────────┴─────────┘

We can of course convert the row table into a DataFrame:

DataFrame(ptbl)
8×2 DataFrame
Rowcoefnamepower
StringFloat64
1age: old & frequency: low0.1
2frequency: low & context: unmatched1.0
3age: old0.15
4context: unmatched0.5
5(Intercept)0.15
6frequency: low0.7
7age: old & context: unmatched0.35
8age: old & frequency: low & context: unmatched0.0
  • choleskyTechnically, we're creating the lower Cholesky factor of these matrices, which is a bit like the matrix square root. In other words, we're creating the matrix form of standard deviations instead of the matrix form of the variances.