Dataset catalog

Authors

Phillip Alday

Reinhold Kliegl

Published

2026-06-21

This is a reference for the datasets that recur throughout these materials. For each one you’ll find what it is, where it came from, how to load it, and which pages use it. Skim it now to get a feel for the recurring examples, and come back whenever you need the details for a particular analysis.

1 How datasets are loaded

All example datasets are loaded through a single function re-exported by the course package:

using DataFrames
using SMLP2026: dataset

sleepstudy = DataFrame(dataset(:sleepstudy))
first(sleepstudy, 5)

5×3 DataFrame

Row	subj	days	reaction
	String	Int8	Float64
1	S308	0	249.56
2	S308	1	258.705
3	S308	2	250.801
4	S308	3	321.44
5	S308	4	356.852

dataset(name) resolves a name from one of two sources:

Datasets re-exported from MixedModelsDatasets.jl — the standard collection bundled with the Julia mixed-models ecosystem (e.g. sleepstudy, kb07, mrk17_exp1).
Course-specific datasets hosted on OSF — downloaded on first use, checksum-verified, and cached locally (e.g. fggk21, kwdyz11, kkl15). The download is automatic; you don’t need to fetch anything by hand.

To list everything available in this project:

using SMLP2026
sort(SMLP2026.datasets())

32-element Vector{String}:
 "ELP_ldt_item"
 "ELP_ldt_subj"
 "ELP_ldt_trial"
 "box"
 "cbpp"
 "contra"
 "d3"
 "dyestuff"
 "dyestuff2"
 "elstongrizzle"
 ⋮
 "mrk17_exp1"
 "oxboys"
 "oxide"
 "pastes"
 "penicillin"
 "ratings"
 "sizespeed"
 "sleepstudy"
 "verbagg"

Note

A few datasets in these materials are also distributed as files under data/ (e.g. the raw Emotikon data as emotikon.rds). Those are used in the data-wrangling example Saving data with Arrow. Outside of that tutorial, prefer dataset(...).

2 Datasets bundled with MixedModelsDatasets.jl

These four are documented in full upstream — design, every column, and the primary citation — in the MixedModelsDatasets.jl dataset reference. We summarize their role in this course and link out for the details.

2.1 `dyestuff`

Yield of dyestuff (Naphthalene Black 12B) from five preparations of each of six batches of an intermediate product. The canonical minimal example of a linear mixed model: an intercept-only model with a single grouping factor (batch), used to introduce variance components in their simplest form. See the upstream entry.

Used in: bootstrap.

2.2 `sleepstudy`

Reaction times from a sleep-deprivation study: average response time per day for 18 subjects restricted to 3 hours in bed per night. The canonical first example for a linear mixed model with a random slope for time. See the upstream entry.

Used in: sleepstudy, singularity, transformations, speed metrics, bootstrap, GLMM.

2.3 `kb07`

Response times from Kronmüller & Barr (2007), a 2×2×2 visual-world experiment widely used to illustrate maximal versus parsimonious random-effects structures. See the upstream entry.

Used in: shrinkage plots, bootstrap, profiling, power simulation.

2.4 `mrk17_exp1`

Lexical-decision response times from Masson, Rabe & Kliegl (2017), Experiment 1 — a 2×2×2 design that is a standard benchmark for high-dimensional crossed random effects. The course refers to it as mrk17. See the upstream entry.

Used in: model specification.

3 Course-specific datasets

These datasets are hosted on OSF and downloaded automatically by dataset(...). Column summaries below are generated directly from the data, so they stay accurate as the datasets are updated.

3.1 `fggk21` (the Emotikon project)

Full name: Physical fitness of third-graders (Emotikon study)

Physical-fitness component scores from Fühner et al. (2021): 108,295 third-graders from 515 primary schools across 9 cohorts, tested on a battery of fitness tasks (running, jumping, balancing, and so on). The size and nested structure (children within schools within cohorts) make it a realistic large-scale example for contrast coding of a multi-level factor (Test), for PCA-guided model complexification, and for visualization at scale.

Citation: Fühner et al. (2021).

columns of fggk21

describe(DataFrame(dataset(:fggk21)))

7×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	Cohort		2011		2019	0	String
2	School		S100043		S800200	0	String
3	Child		C002352		C117966	0	String
4	Sex		female		male	0	String
5	age	8.56073	7.99452	8.55852	9.10609	0	Float64
6	Test		BPT		Star_r	0	String
7	score	226.141	1.14152	4.65116	1530.0	0	Float64

Used in: contrast coding, AoG plots, Emotikon capstone, comparing transformed metrics.

Related files

Pre-aggregated variants fggk21_Child and fggk21_Score are also available via dataset(...), and the raw emotikon.rds lives under data/. The data-wrangling steps that produce the Arrow files are walked through in Saving data with Arrow.

3.2 `kwdyz11`

Full name: Kliegl, Wei, Dambacher, Yan & Zhou (2011) visual-attention experiment

Reaction times from a cueing experiment with four cue-target relations (CTRs), used to estimate spatial, object-based, and attractor-like effects of visual attention together with the individual differences in those effects. A key teaching point: the attraction effect is near zero as a fixed effect but has a reliable variance component, and the originally reported model turned out to be singular — motivating the replication in kkl15.

Citation: Kliegl et al. (2011).

columns of kwdyz11

describe(DataFrame(dataset(:kwdyz11)))

5×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	Item		I002		I603	0	String
2	Subj		S01		S61	0	String
3	CTR		dod		val	0	String
4	dir		hor		ver	0	String
5	rt	370.426	150.1	358.6	705.7	0	Float32

Used in: contrast coding of visual-attention effects, visual-attention capstone.

3.3 `kkl15`

Full name: Kliegl, Kushela & Laubrock (2015) replication and extension

A larger conceptual replication of kwdyz11 (Kliegl et al., 2015), run at Potsdam with additional manipulations of target size and rectangle orientation. With more data and more conditions it estimates the theoretically critical spatial–attraction correlation parameter that was the source of the singularity in the original study, and is the running example for principled reduction of model complexity.

Citation: Kliegl et al. (2015).

columns of kkl15

describe(DataFrame(dataset(:kkl15)))

5×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	Subj		S001		S147	0	String
2	CTR		dod		val	0	String
3	rt	293.147	150.22	276.594	749.481	0	Float32
4	cardinal		cardinal		diagonal	0	String
5	size		big		small	0	String

Used in: replication & model reduction, and as the larger companion to kwdyz11.

References

Fühner, T., Granacher, U., Golle, K., & Kliegl, R. (2021). Age and sex effects in physical fitness components of 108,295 third graders including 515 primary schools and 9 cohorts. Scientific Reports, 11(1). https://doi.org/10.1038/s41598-021-97000-4

Kliegl, R., Kushela, J., & Laubrock, J. (2015). Object orientation and target size modulate the speed of visual attention. Department of Psychology, University of Potsdam.

Kliegl, R., Wei, P., Dambacher, M., Yan, M., & Zhou, X. (2011). Experimental effects and individual differences in linear mixed models: Estimating the relationship between spatial, object, and attraction effects in visual attention. Frontiers in Psychology. https://doi.org/10.3389/fpsyg.2010.00238

This page was rendered from git revision e563d55 using Quarto 1.9.38 and Julia 1.12.6.

1 How datasets are loaded

2 Datasets bundled with MixedModelsDatasets.jl

2.1 dyestuff

2.2 sleepstudy

2.3 kb07

2.4 mrk17_exp1

3 Course-specific datasets

3.1 fggk21 (the Emotikon project)

3.2 kwdyz11

3.3 kkl15

2.1 `dyestuff`

2.2 `sleepstudy`

2.3 `kb07`

2.4 `mrk17_exp1`

3.1 `fggk21` (the Emotikon project)

3.2 `kwdyz11`

3.3 `kkl15`