Notes on saved data files

Author

Phillip Alday, Douglas Bates, and Reinhold Kliegl

Published

2022-09-27

The Arrow storage format

The Arrow storage format provides a language-agnostic storage and memory specification for columnar data tables, which just means “something that looks like a data frame in R”. That is, an arrow table is an ordered, named collection of columns, all of the same length.

The columns can be of different types including numeric values, character strings, and factor-like representations - called DictEncoded.

An Arrow file can be read or written from R, Python, Julia and many other languages. Somewhat confusingly in R and Python the name feather, which refers to an earlier version of the storage format, is used in some function names like read_feather.

The Emotikon data

The SMLP2021 repository contains a version of the data from Fühner et al. (2021) in notebooks/data/fggk21.arrow. After that file was created there were changes in the master RDS file on the osf.io site for the project. We will recreate the Arrow file here then split it into two separate tables, one with a row for each child in the study and one with a row for each test result.

The Arrow package for Julia does not export any function names, which means that the function to read an Arrow file must be called as Arrow.Table. It returns a column table, as described in the Tables package. This is like a read-only data frame, which can be easily converted to a full-fledged DataFrame if desired.

This arrangement allows for the Arrow package not to depend on the DataFrames package, which is a heavy-weight dependency, but still easily produce a DataFrame if warranted.

Load the packages to be used.

Code

using AlgebraOfGraphics
using Arrow
using CairoMakie
using Chain
using DataFrameMacros
using DataFrames
using Downloads
using KernelDensity
using RCall   # access R from within Julia
using StatsBase

CairoMakie.activate!(; type="svg")
using AlgebraOfGraphics: density

Downloading and importing the RDS file

This is similar to some of the code shown by Julius Krumbiegel on Monday. In the data directory of the emotikon project on osf.io under Data, the url for the rds data file is found to be [https://osf.io/xawdb/]. Note that we want version 2 of this file.

fn = Downloads.download("https://osf.io/xawdb/download?version=2");

dfrm = rcopy(R"readRDS($fn)")

525,126 rows × 7 columns

	Cohort	School	Child	Sex	age	Test	score
	Cat…	Cat…	Cat…	Cat…	Float64	Cat…	Float64
1	2013	S100067	C002352	male	7.99452	S20_r	5.26316
2	2013	S100067	C002352	male	7.99452	BPT	3.7
3	2013	S100067	C002352	male	7.99452	SLJ	125.0
4	2013	S100067	C002352	male	7.99452	Star_r	2.47146
5	2013	S100067	C002352	male	7.99452	Run	1053.0
6	2013	S100067	C002353	male	7.99452	S20_r	5.0
7	2013	S100067	C002353	male	7.99452	BPT	4.1
8	2013	S100067	C002353	male	7.99452	SLJ	116.0
9	2013	S100067	C002353	male	7.99452	Star_r	1.76778
10	2013	S100067	C002353	male	7.99452	Run	1089.0
11	2013	S100067	C002354	male	7.99452	S20_r	4.54545
12	2013	S100067	C002354	male	7.99452	BPT	3.9
13	2013	S100067	C002354	male	7.99452	SLJ	111.0
14	2013	S100067	C002354	male	7.99452	Star_r	1.98875
15	2013	S100067	C002354	male	7.99452	Run	864.0
16	2013	S100122	C002355	female	7.99452	S20_r	4.54545
17	2013	S100122	C002355	female	7.99452	BPT	3.0
18	2013	S100122	C002355	female	7.99452	SLJ	114.0
19	2013	S100122	C002355	female	7.99452	Star_r	1.84464
20	2013	S100122	C002355	female	7.99452	Run	835.0
21	2013	S100146	C002356	male	7.99452	S20_r	4.34783
22	2013	S100146	C002356	male	7.99452	BPT	3.3
23	2013	S100146	C002356	male	7.99452	SLJ	118.0
24	2013	S100146	C002356	male	7.99452	Star_r	1.90682
25	2013	S100146	C002356	male	7.99452	Run	860.0
26	2013	S100146	C002357	male	7.99452	S20_r	4.34783
27	2013	S100146	C002357	male	7.99452	BPT	4.3
28	2013	S100146	C002357	male	7.99452	SLJ	130.0
29	2013	S100146	C002357	male	7.99452	Star_r	1.99655
30	2013	S100146	C002357	male	7.99452	Run	960.0
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

Now write this file as a Arrow file and read it back in.

arrowfn = joinpath("data", "fggk21.arrow")
Arrow.write(arrowfn, dfrm; compress=:lz4)
tbl = Arrow.Table(arrowfn)

Arrow.Table with 525126 rows, 7 columns, and schema:
 :Cohort  String
 :School  String
 :Child   String
 :Sex     String
 :age     Float64
 :Test    String
 :score   Float64

filesize(arrowfn)

df = DataFrame(tbl)

525,126 rows × 7 columns

	Cohort	School	Child	Sex	age	Test	score
	String	String	String	String	Float64	String	Float64
1	2013	S100067	C002352	male	7.99452	S20_r	5.26316
2	2013	S100067	C002352	male	7.99452	BPT	3.7
3	2013	S100067	C002352	male	7.99452	SLJ	125.0
4	2013	S100067	C002352	male	7.99452	Star_r	2.47146
5	2013	S100067	C002352	male	7.99452	Run	1053.0
6	2013	S100067	C002353	male	7.99452	S20_r	5.0
7	2013	S100067	C002353	male	7.99452	BPT	4.1
8	2013	S100067	C002353	male	7.99452	SLJ	116.0
9	2013	S100067	C002353	male	7.99452	Star_r	1.76778
10	2013	S100067	C002353	male	7.99452	Run	1089.0
11	2013	S100067	C002354	male	7.99452	S20_r	4.54545
12	2013	S100067	C002354	male	7.99452	BPT	3.9
13	2013	S100067	C002354	male	7.99452	SLJ	111.0
14	2013	S100067	C002354	male	7.99452	Star_r	1.98875
15	2013	S100067	C002354	male	7.99452	Run	864.0
16	2013	S100122	C002355	female	7.99452	S20_r	4.54545
17	2013	S100122	C002355	female	7.99452	BPT	3.0
18	2013	S100122	C002355	female	7.99452	SLJ	114.0
19	2013	S100122	C002355	female	7.99452	Star_r	1.84464
20	2013	S100122	C002355	female	7.99452	Run	835.0
21	2013	S100146	C002356	male	7.99452	S20_r	4.34783
22	2013	S100146	C002356	male	7.99452	BPT	3.3
23	2013	S100146	C002356	male	7.99452	SLJ	118.0
24	2013	S100146	C002356	male	7.99452	Star_r	1.90682
25	2013	S100146	C002356	male	7.99452	Run	860.0
26	2013	S100146	C002357	male	7.99452	S20_r	4.34783
27	2013	S100146	C002357	male	7.99452	BPT	4.3
28	2013	S100146	C002357	male	7.99452	SLJ	130.0
29	2013	S100146	C002357	male	7.99452	Star_r	1.99655
30	2013	S100146	C002357	male	7.99452	Run	960.0
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

Avoiding needless repetition

One of the principles of relational database design is that information should not be repeated needlessly. Each row of df is determined by a combination of Child and Test, together producing a score, which can be converted to a zScore.

The other columns in the table, Cohort, School, age, and Sex, are properties of the Child.

Storing these values redundantly in the full table takes up space but, more importantly, allows for inconsistency. As it stands, a given Child could be recorded as being in one Cohort for the Run test and in another Cohort for the S20_r test and nothing about the table would detect this as being an error.

The approach used in relational databases is to store the information for score in one table that contains only Child, Test and score, store the information for the Child in another table including Cohort, School, age and Sex. These tables can then be combined to create the table to be used for analysis by joining the different tables together.

The maintainers of the DataFrames package have put in a lot of work over the past few years to make joins quite efficient in Julia. Thus the processing penalty of reassembling the big table from three smaller tables is minimal.

It is important to note that the main advantage of using smaller tables that are joined together to produce the analysis table is the fact that the information in the analysis table is consistent by design.

Creating the smaller table

Child = unique(select(df, :Child, :School, :Cohort, :Sex, :age))

108,295 rows × 5 columns

	Child	School	Cohort	Sex	age
	String	String	String	String	Float64
1	C002352	S100067	2013	male	7.99452
2	C002353	S100067	2013	male	7.99452
3	C002354	S100067	2013	male	7.99452
4	C002355	S100122	2013	female	7.99452
5	C002356	S100146	2013	male	7.99452
6	C002357	S100146	2013	male	7.99452
7	C002358	S100146	2013	male	7.99452
8	C002359	S100183	2013	female	7.99452
9	C002360	S100195	2013	female	7.99452
10	C002361	S100213	2013	male	7.99452
11	C002362	S100237	2013	female	7.99452
12	C002363	S100237	2013	female	7.99452
13	C002364	S100250	2013	female	7.99452
14	C002365	S100304	2013	male	7.99452
15	C002366	S100304	2013	male	7.99452
16	C002367	S100316	2013	female	7.99452
17	C002368	S100365	2013	male	7.99452
18	C002369	S100365	2013	male	7.99452
19	C002370	S100365	2013	female	7.99452
20	C002371	S100432	2013	female	7.99452
21	C002372	S100432	2013	male	7.99452
22	C002373	S100481	2013	male	7.99452
23	C002374	S100481	2013	male	7.99452
24	C002375	S100481	2013	female	7.99452
25	C002376	S100493	2013	female	7.99452
26	C002377	S100493	2013	female	7.99452
27	C002378	S100547	2013	male	7.99452
28	C002379	S100547	2013	male	7.99452
29	C002380	S100547	2013	male	7.99452
30	C002381	S100547	2013	female	7.99452
⋮	⋮	⋮	⋮	⋮	⋮

length(unique(Child.Child))  # should be 108295

filesize(
  Arrow.write("./data/fggk21_Child.arrow", Child; compress=:lz4)
)

filesize(
  Arrow.write(
    "./data/fggk21_Score.arrow",
    select(df, :Child, :Test, :score);
    compress=:lz4,
  ),
)

Note

A careful examination of the file sizes versus that of ./data/fggk21.arrow will show that the separate tables combined take up more space than the original because of the compression. Compression algorithms are often more successful when applied to larger files.

Now read the Arrow tables in and reassemble the original table.

Score = DataFrame(Arrow.Table("./data/fggk21_Score.arrow"))

525,126 rows × 3 columns

	Child	Test	score
	String	String	Float64
1	C002352	S20_r	5.26316
2	C002352	BPT	3.7
3	C002352	SLJ	125.0
4	C002352	Star_r	2.47146
5	C002352	Run	1053.0
6	C002353	S20_r	5.0
7	C002353	BPT	4.1
8	C002353	SLJ	116.0
9	C002353	Star_r	1.76778
10	C002353	Run	1089.0
11	C002354	S20_r	4.54545
12	C002354	BPT	3.9
13	C002354	SLJ	111.0
14	C002354	Star_r	1.98875
15	C002354	Run	864.0
16	C002355	S20_r	4.54545
17	C002355	BPT	3.0
18	C002355	SLJ	114.0
19	C002355	Star_r	1.84464
20	C002355	Run	835.0
21	C002356	S20_r	4.34783
22	C002356	BPT	3.3
23	C002356	SLJ	118.0
24	C002356	Star_r	1.90682
25	C002356	Run	860.0
26	C002357	S20_r	4.34783
27	C002357	BPT	4.3
28	C002357	SLJ	130.0
29	C002357	Star_r	1.99655
30	C002357	Run	960.0
⋮	⋮	⋮	⋮

At this point we can create the z-score column by standardizing the scores for each Test. The code to do this follows Julius’s presentation on Monday.

@transform!(groupby(Score, :Test), :zScore = @c zscore(:score))

525,126 rows × 4 columns

	Child	Test	score	zScore
	String	String	Float64	Float64
1	C002352	S20_r	5.26316	1.7913
2	C002352	BPT	3.7	-0.0622317
3	C002352	SLJ	125.0	-0.0336567
4	C002352	Star_r	2.47146	1.46874
5	C002352	Run	1053.0	0.331058
6	C002353	S20_r	5.0	1.15471
7	C002353	BPT	4.1	0.498354
8	C002353	SLJ	116.0	-0.498822
9	C002353	Star_r	1.76778	-0.9773
10	C002353	Run	1089.0	0.574056
11	C002354	S20_r	4.54545	0.0551481
12	C002354	BPT	3.9	0.218061
13	C002354	SLJ	111.0	-0.757248
14	C002354	Star_r	1.98875	-0.209186
15	C002354	Run	864.0	-0.944681
16	C002355	S20_r	4.54545	0.0551481
17	C002355	BPT	3.0	-1.04326
18	C002355	SLJ	114.0	-0.602193
19	C002355	Star_r	1.84464	-0.71013
20	C002355	Run	835.0	-1.14043
21	C002356	S20_r	4.34783	-0.422921
22	C002356	BPT	3.3	-0.622817
23	C002356	SLJ	118.0	-0.395452
24	C002356	Star_r	1.90682	-0.493992
25	C002356	Run	860.0	-0.97168
26	C002357	S20_r	4.34783	-0.422921
27	C002357	BPT	4.3	0.778646
28	C002357	SLJ	130.0	0.224769
29	C002357	Star_r	1.99655	-0.182076
30	C002357	Run	960.0	-0.296686
⋮	⋮	⋮	⋮	⋮

Child = DataFrame(Arrow.Table("./data/fggk21_Child.arrow"))

108,295 rows × 5 columns

	Child	School	Cohort	Sex	age
	String	String	String	String	Float64
1	C002352	S100067	2013	male	7.99452
2	C002353	S100067	2013	male	7.99452
3	C002354	S100067	2013	male	7.99452
4	C002355	S100122	2013	female	7.99452
5	C002356	S100146	2013	male	7.99452
6	C002357	S100146	2013	male	7.99452
7	C002358	S100146	2013	male	7.99452
8	C002359	S100183	2013	female	7.99452
9	C002360	S100195	2013	female	7.99452
10	C002361	S100213	2013	male	7.99452
11	C002362	S100237	2013	female	7.99452
12	C002363	S100237	2013	female	7.99452
13	C002364	S100250	2013	female	7.99452
14	C002365	S100304	2013	male	7.99452
15	C002366	S100304	2013	male	7.99452
16	C002367	S100316	2013	female	7.99452
17	C002368	S100365	2013	male	7.99452
18	C002369	S100365	2013	male	7.99452
19	C002370	S100365	2013	female	7.99452
20	C002371	S100432	2013	female	7.99452
21	C002372	S100432	2013	male	7.99452
22	C002373	S100481	2013	male	7.99452
23	C002374	S100481	2013	male	7.99452
24	C002375	S100481	2013	female	7.99452
25	C002376	S100493	2013	female	7.99452
26	C002377	S100493	2013	female	7.99452
27	C002378	S100547	2013	male	7.99452
28	C002379	S100547	2013	male	7.99452
29	C002380	S100547	2013	male	7.99452
30	C002381	S100547	2013	female	7.99452
⋮	⋮	⋮	⋮	⋮	⋮

df1 = disallowmissing!(leftjoin(Score, Child; on=:Child))

525,126 rows × 8 columns

	Child	Test	score	zScore	School	Cohort	Sex	age
	String	String	Float64	Float64	String	String	String	Float64
1	C002352	S20_r	5.26316	1.7913	S100067	2013	male	7.99452
2	C002352	BPT	3.7	-0.0622317	S100067	2013	male	7.99452
3	C002352	SLJ	125.0	-0.0336567	S100067	2013	male	7.99452
4	C002352	Star_r	2.47146	1.46874	S100067	2013	male	7.99452
5	C002352	Run	1053.0	0.331058	S100067	2013	male	7.99452
6	C002353	S20_r	5.0	1.15471	S100067	2013	male	7.99452
7	C002353	BPT	4.1	0.498354	S100067	2013	male	7.99452
8	C002353	SLJ	116.0	-0.498822	S100067	2013	male	7.99452
9	C002353	Star_r	1.76778	-0.9773	S100067	2013	male	7.99452
10	C002353	Run	1089.0	0.574056	S100067	2013	male	7.99452
11	C002354	S20_r	4.54545	0.0551481	S100067	2013	male	7.99452
12	C002354	BPT	3.9	0.218061	S100067	2013	male	7.99452
13	C002354	SLJ	111.0	-0.757248	S100067	2013	male	7.99452
14	C002354	Star_r	1.98875	-0.209186	S100067	2013	male	7.99452
15	C002354	Run	864.0	-0.944681	S100067	2013	male	7.99452
16	C002355	S20_r	4.54545	0.0551481	S100122	2013	female	7.99452
17	C002355	BPT	3.0	-1.04326	S100122	2013	female	7.99452
18	C002355	SLJ	114.0	-0.602193	S100122	2013	female	7.99452
19	C002355	Star_r	1.84464	-0.71013	S100122	2013	female	7.99452
20	C002355	Run	835.0	-1.14043	S100122	2013	female	7.99452
21	C002356	S20_r	4.34783	-0.422921	S100146	2013	male	7.99452
22	C002356	BPT	3.3	-0.622817	S100146	2013	male	7.99452
23	C002356	SLJ	118.0	-0.395452	S100146	2013	male	7.99452
24	C002356	Star_r	1.90682	-0.493992	S100146	2013	male	7.99452
25	C002356	Run	860.0	-0.97168	S100146	2013	male	7.99452
26	C002357	S20_r	4.34783	-0.422921	S100146	2013	male	7.99452
27	C002357	BPT	4.3	0.778646	S100146	2013	male	7.99452
28	C002357	SLJ	130.0	0.224769	S100146	2013	male	7.99452
29	C002357	Star_r	1.99655	-0.182076	S100146	2013	male	7.99452
30	C002357	Run	960.0	-0.296686	S100146	2013	male	7.99452
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

Note

The call to disallowmissing! is because the join will create columns that allow for missing values but we know that we should not get missing values in the result. This call will fail if, for some reason, missing values were created.

Discovering patterns in the data

One of the motivations for creating the Child table was be able to bin the ages according to the age of each child, not the age of each Child-Test combination. Not all children have all 5 test results. We can check the number of results by grouping on :Child and evaluate the number of rows in each group.

nobsChild = combine(groupby(Score, :Child), nrow => :ntest)

108,295 rows × 2 columns

	Child	ntest
	String	Int64
1	C002352	5
2	C002353	5
3	C002354	5
4	C002355	5
5	C002356	5
6	C002357	5
7	C002358	5
8	C002359	4
9	C002360	5
10	C002361	4
11	C002362	5
12	C002363	5
13	C002364	5
14	C002365	5
15	C002366	5
16	C002367	5
17	C002368	5
18	C002369	5
19	C002370	5
20	C002371	4
21	C002372	5
22	C002373	5
23	C002374	5
24	C002375	5
25	C002376	5
26	C002377	5
27	C002378	5
28	C002379	5
29	C002380	5
30	C002381	5
⋮	⋮	⋮

Now create a table of the number of children with 1, 2, …, 5 test scores.

combine(groupby(nobsChild, :ntest), nrow)

5 rows × 2 columns

	ntest	nrow
	Int64	Int64
1	1	462
2	2	729
3	3	1739
4	4	8836
5	5	96529

A natural question at this point is whether there is something about those students who have few observations. For example, are they from only a few schools?

One approach to examining properties like is to add the number of observations for each child to the :Child table. Later we can group the table according to this :ntest to look at properties of :Child by :ntest.

gdf = groupby(
  disallowmissing!(leftjoin(Child, nobsChild; on=:Child)), :ntest
)

GroupedDataFrame with 5 groups based on key: ntest

First Group (462 rows): ntest = 1

	Child	School	Cohort	Sex	age	ntest
	String	String	String	String	Float64	Int64
1	C002452	S101175	2013	male	7.99452	1
2	C002625	S103329	2013	male	7.99452	1
3	C002754	S104814	2013	female	7.99452	1
4	C003269	S102258	2012	female	7.99726	1
5	C003599	S105843	2012	female	7.99726	1
6	C003807	S100754	2011	male	8.0	1
7	C003985	S102945	2011	male	8.0	1
8	C004086	S104255	2011	male	8.0	1
9	C004657	S101400	2014	male	8.03833	1
10	C005036	S105909	2014	male	8.03833	1
11	C005440	S101023	2019	male	8.05202	1
12	C005523	S101825	2019	female	8.05202	1
13	C005697	S103615	2019	male	8.05202	1
14	C005759	S104632	2019	female	8.05202	1
15	C005810	S104954	2019	female	8.05202	1
16	C005835	S105053	2019	male	8.05202	1
17	C005854	S105405	2019	male	8.05202	1
18	C006550	S103329	2013	male	8.0794	1
19	C006760	S105181	2013	female	8.0794	1
20	C007031	S113244	2013	male	8.0794	1
21	C007050	S100195	2012	female	8.08214	1
22	C007305	S102350	2012	male	8.08214	1
23	C007828	S111405	2012	female	8.08214	1
24	C008698	S104917	2016	female	8.09309	1
25	C008707	S102271	2016	male	8.09582	1
26	C009596	S103421	2014	female	8.1232	1
27	C009651	S103706	2014	female	8.1232	1
28	C009879	S105909	2014	female	8.1232	1
29	C010203	S102660	2016	male	8.12594	1
30	C010204	S102660	2016	male	8.12594	1
⋮	⋮	⋮	⋮	⋮	⋮	⋮

⋮

Last Group (96529 rows): ntest = 5

	Child	School	Cohort	Sex	age	ntest
	String	String	String	String	Float64	Int64
1	C002352	S100067	2013	male	7.99452	5
2	C002353	S100067	2013	male	7.99452	5
3	C002354	S100067	2013	male	7.99452	5
4	C002355	S100122	2013	female	7.99452	5
5	C002356	S100146	2013	male	7.99452	5
6	C002357	S100146	2013	male	7.99452	5
7	C002358	S100146	2013	male	7.99452	5
8	C002360	S100195	2013	female	7.99452	5
9	C002362	S100237	2013	female	7.99452	5
10	C002363	S100237	2013	female	7.99452	5
11	C002364	S100250	2013	female	7.99452	5
12	C002365	S100304	2013	male	7.99452	5
13	C002366	S100304	2013	male	7.99452	5
14	C002367	S100316	2013	female	7.99452	5
15	C002368	S100365	2013	male	7.99452	5
16	C002369	S100365	2013	male	7.99452	5
17	C002370	S100365	2013	female	7.99452	5
18	C002372	S100432	2013	male	7.99452	5
19	C002373	S100481	2013	male	7.99452	5
20	C002374	S100481	2013	male	7.99452	5
21	C002375	S100481	2013	female	7.99452	5
22	C002376	S100493	2013	female	7.99452	5
23	C002377	S100493	2013	female	7.99452	5
24	C002378	S100547	2013	male	7.99452	5
25	C002379	S100547	2013	male	7.99452	5
26	C002380	S100547	2013	male	7.99452	5
27	C002381	S100547	2013	female	7.99452	5
28	C002382	S100547	2013	female	7.99452	5
29	C002383	S100584	2013	female	7.99452	5
30	C002384	S100596	2013	male	7.99452	5
⋮	⋮	⋮	⋮	⋮	⋮	⋮

Are the sexes represented more-or-less equally?

combine(groupby(first(gdf), :Sex), nrow => :nchild)

2 rows × 2 columns

	Sex	nchild
	String	Int64
1	male	230
2	female	232

combine(groupby(last(gdf), :Sex), nrow => :nchild)

2 rows × 2 columns

	Sex	nchild
	String	Int64
1	male	47552
2	female	48977

What about the distribution of ages?

"""
    ridgeplot!(ax::Axis, df::AbstractDataFrame, densvar::Symbol, group::Symbol; normalize=false)
    ridgeplot!(f::Figure, args...; pos=(1,1) kwargs...)
    ridgeplot(args...; kwargs...)
Create a "ridge plot".
A ridge plot is stacked plot of densities for a given variable (`densvar`) grouped by a different variable (`group`). Because densities can very widely in scale, it is sometimes useful to `normalize` the densities so that each density has a maximum of 1.
The non-mutating method creates a Figure before calling the method for Figure.
The method for Figure places the ridge plot in the grid position specified by `pos`, default is (1,1).
"""
function ridgeplot!(
  ax::Axis,
  df::AbstractDataFrame,
  densvar::Symbol,
  group::Symbol;
  normalize=false,
)
  # `normalize` makes it so that the max density is always 1
  # `normalize` works on the density not the area/mass
  gdf = groupby(df, group)
  dens = combine(gdf, densvar => kde => :kde)
  sort!(dens, group)
  spacing = normalize ? 1.0 : 0.9 * maximum(dens[!, :kde]) do val
    return maximum(val.density)
  end

  nticks = length(gdf)

  for (idx, row) in enumerate(eachrow(dens))
    dd = if normalize
      row.kde.density ./ maximum(row.kde.density)
    else
      row.kde.density
    end

    offset = idx * spacing

    lower = Node(Point2f.(row.kde.x, offset))
    upper = Node(Point2f.(row.kde.x, dd .+ offset))
    band!(ax, lower, upper; color=(:black, 0.3))
    lines!(ax, upper; color=(:black, 1.0))
  end

  ax.yticks[] = (
    1:spacing:(nticks * spacing), string.(dens[!, group])
  )
  ylims!(ax, 0, (nticks + 2) * spacing)
  ax.xlabel[] = string(densvar)
  ax.ylabel[] = string(group)

  return ax
end

function ridgeplot!(f::Figure, args...; pos=(1, 1), kwargs...)
  ridgeplot!(Axis(f[pos...]), args...; kwargs...)
  return f
end

"""
    ridgeplot(args...; kwargs...)
See [ridgeplot!](@ref).
"""
function ridgeplot(args...; kwargs...)
  return ridgeplot!(Figure(), args...; kwargs...)
end

ridgeplot(parent(gdf), :age, :ntest)

parent(gdf)

108,295 rows × 6 columns

	Child	School	Cohort	Sex	age	ntest
	String	String	String	String	Float64	Int64
1	C002352	S100067	2013	male	7.99452	5
2	C002353	S100067	2013	male	7.99452	5
3	C002354	S100067	2013	male	7.99452	5
4	C002355	S100122	2013	female	7.99452	5
5	C002356	S100146	2013	male	7.99452	5
6	C002357	S100146	2013	male	7.99452	5
7	C002358	S100146	2013	male	7.99452	5
8	C002359	S100183	2013	female	7.99452	4
9	C002360	S100195	2013	female	7.99452	5
10	C002361	S100213	2013	male	7.99452	4
11	C002362	S100237	2013	female	7.99452	5
12	C002363	S100237	2013	female	7.99452	5
13	C002364	S100250	2013	female	7.99452	5
14	C002365	S100304	2013	male	7.99452	5
15	C002366	S100304	2013	male	7.99452	5
16	C002367	S100316	2013	female	7.99452	5
17	C002368	S100365	2013	male	7.99452	5
18	C002369	S100365	2013	male	7.99452	5
19	C002370	S100365	2013	female	7.99452	5
20	C002371	S100432	2013	female	7.99452	4
21	C002372	S100432	2013	male	7.99452	5
22	C002373	S100481	2013	male	7.99452	5
23	C002374	S100481	2013	male	7.99452	5
24	C002375	S100481	2013	female	7.99452	5
25	C002376	S100493	2013	female	7.99452	5
26	C002377	S100493	2013	female	7.99452	5
27	C002378	S100547	2013	male	7.99452	5
28	C002379	S100547	2013	male	7.99452	5
29	C002380	S100547	2013	male	7.99452	5
30	C002381	S100547	2013	female	7.99452	5
⋮	⋮	⋮	⋮	⋮	⋮	⋮

Reading Arrow files in other languages

There are Arrow implementations for R (the arrow package) and for Python (pyarrow).

#| eval: false
import pyarrow.feather: read_table
read_table("./data/fggk21.arrow")

#| eval: false
library("arrow")
fggk21 <- read_feather("./data/fggk21.arrow")
nrow(fggk21)

References

Fühner, T., Granacher, U., Golle, K., & Kliegl, R. (2021). Age and sex effects in physical fitness components of 108,295 third graders including 515 primary schools and 9 cohorts. Scientific Reports, 11(1). https://doi.org/10.1038/s41598-021-97000-4