using DataFrames
using SMLP2026: datasetDataset catalog
This is a reference for the datasets that recur throughout these materials. For each one you’ll find what it is, where it came from, how to load it, and which pages use it. Skim it now to get a feel for the recurring examples, and come back whenever you need the details for a particular analysis.
1 How datasets are loaded
All example datasets are loaded through a single function re-exported by the course package:
sleepstudy = DataFrame(dataset(:sleepstudy))
first(sleepstudy, 5)| Row | subj | days | reaction |
|---|---|---|---|
| String | Int8 | Float64 | |
| 1 | S308 | 0 | 249.56 |
| 2 | S308 | 1 | 258.705 |
| 3 | S308 | 2 | 250.801 |
| 4 | S308 | 3 | 321.44 |
| 5 | S308 | 4 | 356.852 |
dataset(name) resolves a name from one of two sources:
- Datasets re-exported from MixedModelsDatasets.jl — the standard collection bundled with the Julia mixed-models ecosystem (e.g.
sleepstudy,kb07,mrk17_exp1). - Course-specific datasets hosted on OSF — downloaded on first use, checksum-verified, and cached locally (e.g.
fggk21,kwdyz11,kkl15). The download is automatic; you don’t need to fetch anything by hand.
To list everything available in this project:
using SMLP2026
sort(SMLP2026.datasets())32-element Vector{String}:
"ELP_ldt_item"
"ELP_ldt_subj"
"ELP_ldt_trial"
"box"
"cbpp"
"contra"
"d3"
"dyestuff"
"dyestuff2"
"elstongrizzle"
⋮
"mrk17_exp1"
"oxboys"
"oxide"
"pastes"
"penicillin"
"ratings"
"sizespeed"
"sleepstudy"
"verbagg"
A few datasets in these materials are also distributed as files under data/ (e.g. the raw Emotikon data as emotikon.rds). Those are used in the data-wrangling example Saving data with Arrow. Outside of that tutorial, prefer dataset(...).
2 Datasets bundled with MixedModelsDatasets.jl
These four are documented in full upstream — design, every column, and the primary citation — in the MixedModelsDatasets.jl dataset reference. We summarize their role in this course and link out for the details.
2.1 dyestuff
Yield of dyestuff (Naphthalene Black 12B) from five preparations of each of six batches of an intermediate product. The canonical minimal example of a linear mixed model: an intercept-only model with a single grouping factor (batch), used to introduce variance components in their simplest form. See the upstream entry.
Used in: bootstrap.
2.2 sleepstudy
Reaction times from a sleep-deprivation study: average response time per day for 18 subjects restricted to 3 hours in bed per night. The canonical first example for a linear mixed model with a random slope for time. See the upstream entry.
Used in: sleepstudy, singularity, transformations, speed metrics, bootstrap, GLMM.
2.3 kb07
Response times from Kronmüller & Barr (2007), a 2×2×2 visual-world experiment widely used to illustrate maximal versus parsimonious random-effects structures. See the upstream entry.
Used in: shrinkage plots, bootstrap, profiling, power simulation.
2.4 mrk17_exp1
Lexical-decision response times from Masson, Rabe & Kliegl (2017), Experiment 1 — a 2×2×2 design that is a standard benchmark for high-dimensional crossed random effects. The course refers to it as mrk17. See the upstream entry.
Used in: model specification.
3 Course-specific datasets
These datasets are hosted on OSF and downloaded automatically by dataset(...). Column summaries below are generated directly from the data, so they stay accurate as the datasets are updated.
3.1 fggk21 (the Emotikon project)
Full name: Physical fitness of third-graders (Emotikon study)
Physical-fitness component scores from Fühner et al. (2021): 108,295 third-graders from 515 primary schools across 9 cohorts, tested on a battery of fitness tasks (running, jumping, balancing, and so on). The size and nested structure (children within schools within cohorts) make it a realistic large-scale example for contrast coding of a multi-level factor (Test), for PCA-guided model complexification, and for visualization at scale.
Citation: Fühner et al. (2021).
columns of fggk21
describe(DataFrame(dataset(:fggk21)))| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
| 1 | Cohort | 2011 | 2019 | 0 | String | ||
| 2 | School | S100043 | S800200 | 0 | String | ||
| 3 | Child | C002352 | C117966 | 0 | String | ||
| 4 | Sex | female | male | 0 | String | ||
| 5 | age | 8.56073 | 7.99452 | 8.55852 | 9.10609 | 0 | Float64 |
| 6 | Test | BPT | Star_r | 0 | String | ||
| 7 | score | 226.141 | 1.14152 | 4.65116 | 1530.0 | 0 | Float64 |
Used in: contrast coding, AoG plots, Emotikon capstone, comparing transformed metrics.
Pre-aggregated variants fggk21_Child and fggk21_Score are also available via dataset(...), and the raw emotikon.rds lives under data/. The data-wrangling steps that produce the Arrow files are walked through in Saving data with Arrow.
3.2 kwdyz11
Full name: Kliegl, Wei, Dambacher, Yan & Zhou (2011) visual-attention experiment
Reaction times from a cueing experiment with four cue-target relations (CTRs), used to estimate spatial, object-based, and attractor-like effects of visual attention together with the individual differences in those effects. A key teaching point: the attraction effect is near zero as a fixed effect but has a reliable variance component, and the originally reported model turned out to be singular — motivating the replication in kkl15.
Citation: Kliegl et al. (2011).
columns of kwdyz11
describe(DataFrame(dataset(:kwdyz11)))| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
| 1 | Item | I002 | I603 | 0 | String | ||
| 2 | Subj | S01 | S61 | 0 | String | ||
| 3 | CTR | dod | val | 0 | String | ||
| 4 | dir | hor | ver | 0 | String | ||
| 5 | rt | 370.426 | 150.1 | 358.6 | 705.7 | 0 | Float32 |
Used in: contrast coding of visual-attention effects, visual-attention capstone.
3.3 kkl15
Full name: Kliegl, Kushela & Laubrock (2015) replication and extension
A larger conceptual replication of kwdyz11 (Kliegl et al., 2015), run at Potsdam with additional manipulations of target size and rectangle orientation. With more data and more conditions it estimates the theoretically critical spatial–attraction correlation parameter that was the source of the singularity in the original study, and is the running example for principled reduction of model complexity.
Citation: Kliegl et al. (2015).
columns of kkl15
describe(DataFrame(dataset(:kkl15)))| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
| 1 | Subj | S001 | S147 | 0 | String | ||
| 2 | CTR | dod | val | 0 | String | ||
| 3 | rt | 293.147 | 150.22 | 276.594 | 749.481 | 0 | Float32 |
| 4 | cardinal | cardinal | diagonal | 0 | String | ||
| 5 | size | big | small | 0 | String |
Used in: replication & model reduction, and as the larger companion to kwdyz11.
References
This page was rendered from git revision e563d55 using Quarto 1.9.38 and Julia 1.12.6.