Simulating an Experiment from Scratch

Here is a worked example of simulating a partially crossed design from scratch.

First, some setup:

using DataFrames
using MixedModels, MixedModelsSim
using Random

Assemble the Design

We're going to do a 2 x 2 x 2 design. For concreteness, let's think of this as a linguistic design:

  • age old vs. young, between subjects
  • frequency high vs. low, between items
  • context matched vs. unmatched, within both.

Further, let's assume we want 40 subjects and 80 items. We can specify this design as follows:

n_subj = 40
n_item = 80
subj_btwn = Dict(:age => ["old", "young"])
item_btwn = Dict(:frequency => ["high", "low"])
both_win = Dict(:context => ["matched", "unmatched"])
Dict{Symbol, Vector{String}} with 1 entry:
  :context => ["matched", "unmatched"]

and then generate it:

rng = MersenneTwister(42)  # specify our random number generator for reproducibility
design = simdat_crossed(rng, n_subj, n_item;
                        subj_btwn = subj_btwn,
                        item_btwn = item_btwn,
                        both_win = both_win)
6400-element Vector{@NamedTuple{subj::String, age::String, item::String, frequency::String, context::String, dv::Float64}}:
 (subj = "S01", age = "old", item = "I01", frequency = "high", context = "matched", dv = 1.210282022049037)
 (subj = "S02", age = "young", item = "I01", frequency = "high", context = "matched", dv = -0.07899860438858432)
 (subj = "S03", age = "old", item = "I01", frequency = "high", context = "matched", dv = 0.4035118074894221)
 (subj = "S04", age = "young", item = "I01", frequency = "high", context = "matched", dv = 0.289951117313061)
 (subj = "S05", age = "old", item = "I01", frequency = "high", context = "matched", dv = -0.06704223413021088)
 (subj = "S06", age = "young", item = "I01", frequency = "high", context = "matched", dv = 0.613979770863605)
 (subj = "S07", age = "old", item = "I01", frequency = "high", context = "matched", dv = 1.0169446776646136)
 (subj = "S08", age = "young", item = "I01", frequency = "high", context = "matched", dv = -0.05386399409001742)
 (subj = "S09", age = "old", item = "I01", frequency = "high", context = "matched", dv = -0.5703872372611605)
 (subj = "S10", age = "young", item = "I01", frequency = "high", context = "matched", dv = -0.8577367570688208)
 ⋮
 (subj = "S32", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = 0.4899567497698174)
 (subj = "S33", age = "old", item = "I80", frequency = "low", context = "unmatched", dv = -1.3026758228769602)
 (subj = "S34", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = 0.9246823304854404)
 (subj = "S35", age = "old", item = "I80", frequency = "low", context = "unmatched", dv = -2.0723054534294856)
 (subj = "S36", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = -0.6743250590998087)
 (subj = "S37", age = "old", item = "I80", frequency = "low", context = "unmatched", dv = 0.29409929753559283)
 (subj = "S38", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = 0.3219137697456554)
 (subj = "S39", age = "old", item = "I80", frequency = "low", context = "unmatched", dv = 0.6900970863582011)
 (subj = "S40", age = "young", item = "I80", frequency = "low", context = "unmatched", dv = 1.5271989375021289)

Note that simdat_crossed returns a row table, which MixedModels.jl can process directly. For nicely displaying it, we can again use pretty_table:

pretty_table(first(design, 5))
┌────────┬────────┬────────┬───────────┬─────────┬────────────┐
│   subj     age    item  frequency  context          dv │
│ String  String  String     String   String     Float64 │
├────────┼────────┼────────┼───────────┼─────────┼────────────┤
│    S01 │    old │    I01 │      high │ matched │    1.21028 │
│    S02 │  young │    I01 │      high │ matched │ -0.0789986 │
│    S03 │    old │    I01 │      high │ matched │   0.403512 │
│    S04 │  young │    I01 │      high │ matched │   0.289951 │
│    S05 │    old │    I01 │      high │ matched │ -0.0670422 │
└────────┴────────┴────────┴───────────┴─────────┴────────────┘

We can also convert it to a DataFrame and change the factors to use pooled arrays to save a bit of memory.

df = pooled!(DataFrame(design))
first(df, 5)
5×6 DataFrame
Rowsubjageitemfrequencycontextdv
StringStringStringStringStringFloat64
1S01oldI01highmatched1.21028
2S02youngI01highmatched-0.0789986
3S03oldI01highmatched0.403512
4S04youngI01highmatched0.289951
5S05oldI01highmatched-0.0670422

Note that simdat_crossed generated a column dv for our dependent variable that has been pre-populated with noise from a standard normal distribution ($N(0,1)$). Typically, we will want to scale that, but we can do that in the simulation step. Also, this dependent variable is pure noise: we haven't yet added in effects. Adding in effects also comes in the simulation step.

But before we get to simulating, let's fit the model to the noise, just to see how things look. We're going to use effects coding for our contrasts.

contrasts = Dict(:age => EffectsCoding(base="young"),
                 :frequency => EffectsCoding(base="high"),
                 :context => EffectsCoding(base="matched"))
form = @formula(dv ~ 1 + age * frequency * context +
                    (1 + frequency + context | subj) +
                    (1 + age + context | item))
m0 = fit(MixedModel, form, design; contrasts=contrasts)
Est.SEzpσ_itemσ_subj
(Intercept)-0.01470.0141-1.040.29670.05950.0000
age: old0.00730.01290.560.57370.0320
frequency: low0.00430.01620.270.7906 0.0500
context: unmatched-0.01080.0141-0.760.44570.05150.0210
age: old & frequency: low-0.02480.0152-1.630.1021
age: old & context: unmatched-0.01000.0129-0.780.4374
frequency: low & context: unmatched-0.01150.0137-0.840.4029
age: old & frequency: low & context: unmatched-0.01830.0124-1.470.1407
Residual0.9950

Assemble the Random Effects

The hard part in simulating right now is specifying the random effects. We're working on making this bit easier, but you need to specify the variance-covariance matrix of the random effects. You can see what this looks like:

vc = VarCorr(m0)
Column VarianceStd.DevCorr.
item(Intercept)0.00353600.0594639
age: old0.00102690.0320457+0.43
context: unmatched0.00265420.0515188-0.29-0.99
subj(Intercept)0.00000000.0000000
frequency: low0.00250180.0500184+NaN
context: unmatched0.00044210.0210273+NaN+1.00
Residual 0.99008710.9950312

For each grouping variable (subjects and items), there are two major components: the standard deviations ahd the correlations.

For this example, we'll just assume all the correlations and thus the covariances are 0 in order to make things simple. Then we only have to worry about the standard deviations.

Let's assume that the variability

  • between items
    • in the intercept is 1.3 times the residual variability
    • in age is 0.35 times the residual variability
    • in context is 0.75 times the residual variability
  • between subjects
    • in the intercept is 1.5 times the residual variability
    • in frequency is 0.5 times the residual variability
    • in context is 0.75 times the residual variability

Note these are always specified relative to the residual standard deviation. In other words, we think about how big the between-subject and between-item differences are relative to the between-observation differences.

We can now create the associated covariance matrices.[cholesky]

re_item = create_re(1.3, 0.35, 0.75)
3×3 LinearAlgebra.LowerTriangular{Float64, LinearAlgebra.Diagonal{Float64, Vector{Float64}}}:
 1.3   ⋅     ⋅ 
 0.0  0.35   ⋅ 
 0.0  0.0   0.75
re_subj = create_re(1.5, 0.5, 0.75)
3×3 LinearAlgebra.LowerTriangular{Float64, LinearAlgebra.Diagonal{Float64, Vector{Float64}}}:
 1.5   ⋅    ⋅ 
 0.0  0.5   ⋅ 
 0.0  0.0  0.75

We can check that we got these right by installing these parameter values into the model.

update!(m0; subj=re_subj, item=re_item)
VarCorr(m0)
Column VarianceStd.DevCorr.
item(Intercept)1.5867191.259650
age: old0.1150140.339137+0.00
context: unmatched0.5281240.726721+0.00+0.00
subj(Intercept)2.1124961.453443
frequency: low0.2347220.484481+0.00
context: unmatched0.5281240.726721+0.00+0.00
Residual 0.9388870.968962

Looks good. The values don't exactly match the values in our parameter vector because the residual standard deviation isn't exactly 1.0.

For the actual simulation, we'll need the compact form of these covariance matrices that MixedModels.jl uses internally. This compact form is the parameter vector θ and we can get it back out of the model where we just installed it:

show(m0.θ)
[1.3, 0.0, 0.0, 0.35, 0.0, 0.75, 1.5, 0.0, 0.0, 0.5, 0.0, 0.75]

Alternatively, we can also just generate θ directly from the random-effects matrices:

θ = createθ(m0; subj=re_subj, item=re_item)
show(θ)
[1.3, 0.0, 0.0, 0.35, 0.0, 0.75, 1.5, 0.0, 0.0, 0.5, 0.0, 0.75]

We could also create it directly from the covariance matrices we created, but in this case we need to make sure they're in the same order as in the VarCorr output:

θhand = vcat( flatlowertri(re_item), flatlowertri(re_subj) )
show(θhand)
[1.3, 0.0, 0.0, 0.35, 0.0, 0.75, 1.5, 0.0, 0.0, 0.5, 0.0, 0.75]
Warning

In storing the parameter vector θ, MixedModels.jl uses an ordering that yields maximum sparseness, which enables better computational efficiency. The ordering is thus dependent on the entirety of the design – both the choice of the random effects and the relative number of subjects and items. For this reason, we strongly recommend using the helper methods that allow specifying the grouping variable by name.

Assemble the Fixed Effects

The last two components we need are the residual variance and the effect sizes for the fixed effects.

σ = 5;
β = [1.0, -1.0, 2.0, -1.5, 0.3, -1.3, 1.4, 0];

The entries in the β correspond to the coefficients in the model given by

coefnames(m0)
8-element Vector{String}:
 "(Intercept)"
 "age: old"
 "frequency: low"
 "context: unmatched"
 "age: old & frequency: low"
 "age: old & context: unmatched"
 "frequency: low & context: unmatched"
 "age: old & frequency: low & context: unmatched"

Simulate

Now we're ready to actually simulate our data. We can use parametricbootstrap to do this: the parametric bootstrap actually works by simulating new data from an existing model and then looking at how the estimates fit to that new data look. In MixedModels.jl, you can specify different parameter values, such as the ones we made up for our fake data.

# typically we would use a lot more simulations
# but we want to be quick in this example
sim = parametricbootstrap(MersenneTwister(12321), 20, m0; β=β, σ=σ, θ=θ)
MixedModelBootstrap with 20 samples
      parameter  min         q25         median      mean        q75        ⋯
    ┌────────────────────────────────────────────────────────────────────────
 1  │ β1         -3.46254    -0.0275183  0.951426    1.05389     2.30673    ⋯
 2  │ β2         -3.15569    -2.21155    -1.65746    -1.40177    -0.725325  ⋯
 3  │ β3         0.0472057   1.47878     1.78613     1.84538     2.20689    ⋯
 4  │ β4         -2.78027    -1.84879    -1.38575    -1.31679    -0.506349  ⋯
 5  │ β5         -0.409183   0.00758897  0.388057    0.367569    0.706757   ⋯
 6  │ β6         -2.54446    -1.7043     -1.27281    -1.31153    -0.760097  ⋯
 7  │ β7         0.963731    1.17698     1.36732     1.44359     1.6929     ⋯
 8  │ β8         -0.0696417  -0.0416162  -0.0185569  0.00739991  0.0592813  ⋯
 9  │ σ          4.92495     4.96082     4.97544     4.99539     5.0203     ⋯
 10 │ σ1         5.64404     6.09127     6.40182     6.38416     6.63392    ⋯
 11 │ σ2         1.51382     1.60649     1.70629     1.71681     1.80014    ⋯
 12 │ σ3         3.29793     3.46935     3.72588     3.71249     3.92566    ⋯
 13 │ ρ1         -0.192243   -0.120512   -0.0325252  -0.0118603  0.0457625  ⋯
 14 │ ρ2         -0.274868   -0.0555564  0.039018    0.00786589  0.0880654  ⋯
 15 │ ρ3         -0.198017   -0.0877441  -0.0367911  -0.0258897  0.0270686  ⋯
 16 │ σ4         5.88261     6.33686     7.16773     7.21947     8.15155    ⋯
 17 │ σ5         2.01433     2.2599      2.49648     2.47639     2.69413    ⋯
 ⋮  │     ⋮          ⋮           ⋮           ⋮           ⋮           ⋮      ⋱

As mentioned above, the ordering within θ is dependent on the entire design, so if you embed the simulation code in a loop iterating over different numbers of subjects and items, it's probably better to write it as:

sim = parametricbootstrap(MersenneTwister(12321), 20, m0;
                          β=β, σ=σ, θ=createθ(m0; subj=re_subj, item=re_item))

See your power and profit!

Finally, we can turn this into a power table:

ptbl = power_table(sim)
8-element Vector{@NamedTuple{coefname::String, power::Float64}}:
 (coefname = "age: old & frequency: low", power = 0.15)
 (coefname = "frequency: low & context: unmatched", power = 1.0)
 (coefname = "age: old", power = 0.25)
 (coefname = "context: unmatched", power = 0.5)
 (coefname = "(Intercept)", power = 0.25)
 (coefname = "frequency: low", power = 0.6)
 (coefname = "age: old & context: unmatched", power = 0.65)
 (coefname = "age: old & frequency: low & context: unmatched", power = 0.0)

For nicely displaying it, we can again use pretty_table:

pretty_table(ptbl)
┌────────────────────────────────────────────────┬─────────┐
│                                       coefname    power │
│                                         String  Float64 │
├────────────────────────────────────────────────┼─────────┤
│                      age: old & frequency: low │    0.15 │
│            frequency: low & context: unmatched │     1.0 │
│                                       age: old │    0.25 │
│                             context: unmatched │     0.5 │
│                                    (Intercept) │    0.25 │
│                                 frequency: low │     0.6 │
│                  age: old & context: unmatched │    0.65 │
│ age: old & frequency: low & context: unmatched │     0.0 │
└────────────────────────────────────────────────┴─────────┘

We can of course convert the row table into a DataFrame:

DataFrame(ptbl)
8×2 DataFrame
Rowcoefnamepower
StringFloat64
1age: old & frequency: low0.15
2frequency: low & context: unmatched1.0
3age: old0.25
4context: unmatched0.5
5(Intercept)0.25
6frequency: low0.6
7age: old & context: unmatched0.65
8age: old & frequency: low & context: unmatched0.0
  • choleskyTechnically, we're creating the lower Cholesky factor of these matrices, which is a bit like the matrix square root. In other words, we're creating the matrix form of standard deviations instead of the matrix form of the variances.