Dataset catalog

Authors

Phillip Alday

Reinhold Kliegl

Published

2026-06-21

This is a reference for the datasets that recur throughout these materials. For each one you’ll find what it is, where it came from, how to load it, and which pages use it. Skim it now to get a feel for the recurring examples, and come back whenever you need the details for a particular analysis.

1 How datasets are loaded

All example datasets are loaded through a single function re-exported by the course package:

using DataFrames
using SMLP2026: dataset
sleepstudy = DataFrame(dataset(:sleepstudy))
first(sleepstudy, 5)
5×3 DataFrame
Row subj days reaction
String Int8 Float64
1 S308 0 249.56
2 S308 1 258.705
3 S308 2 250.801
4 S308 3 321.44
5 S308 4 356.852

dataset(name) resolves a name from one of two sources:

  1. Datasets re-exported from MixedModelsDatasets.jl — the standard collection bundled with the Julia mixed-models ecosystem (e.g. sleepstudy, kb07, mrk17_exp1).
  2. Course-specific datasets hosted on OSF — downloaded on first use, checksum-verified, and cached locally (e.g. fggk21, kwdyz11, kkl15). The download is automatic; you don’t need to fetch anything by hand.

To list everything available in this project:

using SMLP2026
sort(SMLP2026.datasets())
32-element Vector{String}:
 "ELP_ldt_item"
 "ELP_ldt_subj"
 "ELP_ldt_trial"
 "box"
 "cbpp"
 "contra"
 "d3"
 "dyestuff"
 "dyestuff2"
 "elstongrizzle"
 ⋮
 "mrk17_exp1"
 "oxboys"
 "oxide"
 "pastes"
 "penicillin"
 "ratings"
 "sizespeed"
 "sleepstudy"
 "verbagg"
Note

A few datasets in these materials are also distributed as files under data/ (e.g. the raw Emotikon data as emotikon.rds). Those are used in the data-wrangling example Saving data with Arrow. Outside of that tutorial, prefer dataset(...).

2 Datasets bundled with MixedModelsDatasets.jl

These four are documented in full upstream — design, every column, and the primary citation — in the MixedModelsDatasets.jl dataset reference. We summarize their role in this course and link out for the details.

2.1 dyestuff

Yield of dyestuff (Naphthalene Black 12B) from five preparations of each of six batches of an intermediate product. The canonical minimal example of a linear mixed model: an intercept-only model with a single grouping factor (batch), used to introduce variance components in their simplest form. See the upstream entry.

Used in: bootstrap.

2.2 sleepstudy

Reaction times from a sleep-deprivation study: average response time per day for 18 subjects restricted to 3 hours in bed per night. The canonical first example for a linear mixed model with a random slope for time. See the upstream entry.

Used in: sleepstudy, singularity, transformations, speed metrics, bootstrap, GLMM.

2.3 kb07

Response times from Kronmüller & Barr (2007), a 2×2×2 visual-world experiment widely used to illustrate maximal versus parsimonious random-effects structures. See the upstream entry.

Used in: shrinkage plots, bootstrap, profiling, power simulation.

2.4 mrk17_exp1

Lexical-decision response times from Masson, Rabe & Kliegl (2017), Experiment 1 — a 2×2×2 design that is a standard benchmark for high-dimensional crossed random effects. The course refers to it as mrk17. See the upstream entry.

Used in: model specification.

3 Course-specific datasets

These datasets are hosted on OSF and downloaded automatically by dataset(...). Column summaries below are generated directly from the data, so they stay accurate as the datasets are updated.

3.1 fggk21 (the Emotikon project)

Full name: Physical fitness of third-graders (Emotikon study)

Physical-fitness component scores from Fühner et al. (2021): 108,295 third-graders from 515 primary schools across 9 cohorts, tested on a battery of fitness tasks (running, jumping, balancing, and so on). The size and nested structure (children within schools within cohorts) make it a realistic large-scale example for contrast coding of a multi-level factor (Test), for PCA-guided model complexification, and for visualization at scale.

Citation: Fühner et al. (2021).

columns of fggk21
describe(DataFrame(dataset(:fggk21)))
7×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 DataType
1 Cohort 2011 2019 0 String
2 School S100043 S800200 0 String
3 Child C002352 C117966 0 String
4 Sex female male 0 String
5 age 8.56073 7.99452 8.55852 9.10609 0 Float64
6 Test BPT Star_r 0 String
7 score 226.141 1.14152 4.65116 1530.0 0 Float64

Used in: contrast coding, AoG plots, Emotikon capstone, comparing transformed metrics.

NoteRelated files

Pre-aggregated variants fggk21_Child and fggk21_Score are also available via dataset(...), and the raw emotikon.rds lives under data/. The data-wrangling steps that produce the Arrow files are walked through in Saving data with Arrow.

3.2 kwdyz11

Full name: Kliegl, Wei, Dambacher, Yan & Zhou (2011) visual-attention experiment

Reaction times from a cueing experiment with four cue-target relations (CTRs), used to estimate spatial, object-based, and attractor-like effects of visual attention together with the individual differences in those effects. A key teaching point: the attraction effect is near zero as a fixed effect but has a reliable variance component, and the originally reported model turned out to be singular — motivating the replication in kkl15.

Citation: Kliegl et al. (2011).

columns of kwdyz11
describe(DataFrame(dataset(:kwdyz11)))
5×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 DataType
1 Item I002 I603 0 String
2 Subj S01 S61 0 String
3 CTR dod val 0 String
4 dir hor ver 0 String
5 rt 370.426 150.1 358.6 705.7 0 Float32

Used in: contrast coding of visual-attention effects, visual-attention capstone.

3.3 kkl15

Full name: Kliegl, Kushela & Laubrock (2015) replication and extension

A larger conceptual replication of kwdyz11 (Kliegl et al., 2015), run at Potsdam with additional manipulations of target size and rectangle orientation. With more data and more conditions it estimates the theoretically critical spatial–attraction correlation parameter that was the source of the singularity in the original study, and is the running example for principled reduction of model complexity.

Citation: Kliegl et al. (2015).

columns of kkl15
describe(DataFrame(dataset(:kkl15)))
5×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 DataType
1 Subj S001 S147 0 String
2 CTR dod val 0 String
3 rt 293.147 150.22 276.594 749.481 0 Float32
4 cardinal cardinal diagonal 0 String
5 size big small 0 String

Used in: replication & model reduction, and as the larger companion to kwdyz11.


References

Fühner, T., Granacher, U., Golle, K., & Kliegl, R. (2021). Age and sex effects in physical fitness components of 108,295 third graders including 515 primary schools and 9 cohorts. Scientific Reports, 11(1). https://doi.org/10.1038/s41598-021-97000-4
Kliegl, R., Kushela, J., & Laubrock, J. (2015). Object orientation and target size modulate the speed of visual attention. Department of Psychology, University of Potsdam.
Kliegl, R., Wei, P., Dambacher, M., Yan, M., & Zhou, X. (2011). Experimental effects and individual differences in linear mixed models: Estimating the relationship between spatial, object, and attraction effects in visual attention. Frontiers in Psychology. https://doi.org/10.3389/fpsyg.2010.00238

This page was rendered from git revision e563d55 using Quarto 1.9.38 and Julia 1.12.6.

Back to top