Saving data with Arrow

Published

2026-04-10

1 The Arrow storage format

The Arrow storage format provides a language-agnostic storage and memory specification for columnar data tables, which just means “something that looks like a data frame in R”. That is, an arrow table is an ordered, named collection of columns, all of the same length.

The columns can be of different types including numeric values, character strings, and factor-like representations - called DictEncoded.

An Arrow file can be read or written from R, Python, Julia and many other languages. Somewhat confusingly in R and Python the name feather, which refers to an earlier version of the storage format, is used in some function names like read_feather.

Internally, the SMLP2026 package uses Arrow to store all of its datasets.

2 The Emotikon data

The SMLP2021 repository contains a version of the data from @Fuehner2021 in notebooks/data/fggk21.arrow. After that file was created there were changes in the master RDS file on the osf.io site for the project. We will recreate the Arrow file here then split it into two separate tables, one with a row for each child in the study and one with a row for each test result.

The Arrow package for Julia does not export any function names, which means that the function to read an Arrow file must be called as Arrow.Table. It returns a column table, as described in the Tables package. This is like a read-only data frame, which can be easily converted to a full-fledged DataFrame if desired.

This arrangement allows for the Arrow package not to depend on the DataFrames package, which is a heavy-weight dependency, but still easily produce a DataFrame if warranted.

Load the packages to be used.

Code

using AlgebraOfGraphics
using Arrow
using CairoMakie
using Chain
using DataFrameMacros
using DataFrames
using Downloads
using KernelDensity
using RData   # load RDS files
using SMLP2026
using StatsBase

CairoMakie.activate!(; type="svg")
using AlgebraOfGraphics: density

3 Downloading and importing the RDS file

This is similar to some of the code shown by Julius Krumbiegel on Monday. In the data directory of the Emotikon project on osf.io under Data, the url for the rds data file is found to be [https://osf.io/xawdb/]. Note that we want version 2 of this file.

# downloads the file to a temporary directory
fn = joinpath(DATADIR, "emotikon.rds")
if !isfile(fn)
  @info "Downloading Emotikon data"
  Downloads.download("https://osf.io/xawdb/download?version=2", fn)
end

dfrm = load(fn)

525126×7 DataFrame

525101 rows omitted

Row	Cohort	School	Child	Sex	age	Test	score
	Cat…	Cat…	Cat…	Cat…	Float64	Cat…	Float64
1	2013	S100067	C002352	male	7.99452	S20_r	5.26316
2	2013	S100067	C002352	male	7.99452	BPT	3.7
3	2013	S100067	C002352	male	7.99452	SLJ	125.0
4	2013	S100067	C002352	male	7.99452	Star_r	2.47146
5	2013	S100067	C002352	male	7.99452	Run	1053.0
6	2013	S100067	C002353	male	7.99452	S20_r	5.0
7	2013	S100067	C002353	male	7.99452	BPT	4.1
8	2013	S100067	C002353	male	7.99452	SLJ	116.0
9	2013	S100067	C002353	male	7.99452	Star_r	1.76778
10	2013	S100067	C002353	male	7.99452	Run	1089.0
11	2013	S100067	C002354	male	7.99452	S20_r	4.54545
12	2013	S100067	C002354	male	7.99452	BPT	3.9
13	2013	S100067	C002354	male	7.99452	SLJ	111.0
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
525115	2018	S401470	C117964	male	9.10609	Star_r	1.63704
525116	2018	S401470	C117964	male	9.10609	Run	864.0
525117	2018	S401470	C117965	female	9.10609	S20_r	4.65116
525118	2018	S401470	C117965	female	9.10609	BPT	3.8
525119	2018	S401470	C117965	female	9.10609	SLJ	123.0
525120	2018	S401470	C117965	female	9.10609	Star_r	1.52889
525121	2018	S401470	C117965	female	9.10609	Run	1080.0
525122	2018	S800200	C117966	male	9.10609	S20_r	4.54545
525123	2018	S800200	C117966	male	9.10609	BPT	3.8
525124	2018	S800200	C117966	male	9.10609	SLJ	100.0
525125	2018	S800200	C117966	male	9.10609	Star_r	2.18506
525126	2018	S800200	C117966	male	9.10609	Run	990.0

Now write this file as a Arrow file and read it back in.

arrowfn = joinpath(DATADIR, "fggk21.arrow")
Arrow.write(arrowfn, dfrm; compress=:lz4)
tbl = Arrow.Table(arrowfn)

Arrow.Table with 525126 rows, 7 columns, and schema:
 :Cohort  CategoricalArrays.CategoricalValue{String, UInt8}
 :School  CategoricalArrays.CategoricalValue{String, UInt16}
 :Child   CategoricalArrays.CategoricalValue{String, UInt32}
 :Sex     CategoricalArrays.CategoricalValue{String, UInt8}
 :age     Float64
 :Test    CategoricalArrays.CategoricalValue{String, UInt8}
 :score   Float64

filesize(arrowfn)

df = DataFrame(tbl)

525126×7 DataFrame

525101 rows omitted

Row	Cohort	School	Child	Sex	age	Test	score
	Cat…	Cat…	Cat…	Cat…	Float64	Cat…	Float64
1	2013	S100067	C002352	male	7.99452	S20_r	5.26316
2	2013	S100067	C002352	male	7.99452	BPT	3.7
3	2013	S100067	C002352	male	7.99452	SLJ	125.0
4	2013	S100067	C002352	male	7.99452	Star_r	2.47146
5	2013	S100067	C002352	male	7.99452	Run	1053.0
6	2013	S100067	C002353	male	7.99452	S20_r	5.0
7	2013	S100067	C002353	male	7.99452	BPT	4.1
8	2013	S100067	C002353	male	7.99452	SLJ	116.0
9	2013	S100067	C002353	male	7.99452	Star_r	1.76778
10	2013	S100067	C002353	male	7.99452	Run	1089.0
11	2013	S100067	C002354	male	7.99452	S20_r	4.54545
12	2013	S100067	C002354	male	7.99452	BPT	3.9
13	2013	S100067	C002354	male	7.99452	SLJ	111.0
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
525115	2018	S401470	C117964	male	9.10609	Star_r	1.63704
525116	2018	S401470	C117964	male	9.10609	Run	864.0
525117	2018	S401470	C117965	female	9.10609	S20_r	4.65116
525118	2018	S401470	C117965	female	9.10609	BPT	3.8
525119	2018	S401470	C117965	female	9.10609	SLJ	123.0
525120	2018	S401470	C117965	female	9.10609	Star_r	1.52889
525121	2018	S401470	C117965	female	9.10609	Run	1080.0
525122	2018	S800200	C117966	male	9.10609	S20_r	4.54545
525123	2018	S800200	C117966	male	9.10609	BPT	3.8
525124	2018	S800200	C117966	male	9.10609	SLJ	100.0
525125	2018	S800200	C117966	male	9.10609	Star_r	2.18506
525126	2018	S800200	C117966	male	9.10609	Run	990.0

4 Avoiding needless repetition

One of the principles of relational database design is that information should not be repeated needlessly. Each row of df is determined by a combination of Child and Test, together producing a score, which can be converted to a zScore.

The other columns in the table, Cohort, School, age, and Sex, are properties of the Child.

Storing these values redundantly in the full table takes up space but, more importantly, allows for inconsistency. As it stands, a given Child could be recorded as being in one Cohort for the Run test and in another Cohort for the S20_r test and nothing about the table would detect this as being an error.

The approach used in relational databases is to store the information for score in one table that contains only Child, Test and score, store the information for the Child in another table including Cohort, School, age and Sex. These tables can then be combined to create the table to be used for analysis by joining the different tables together.

The maintainers of the DataFrames package have put in a lot of work over the past few years to make joins quite efficient in Julia. Thus the processing penalty of reassembling the big table from three smaller tables is minimal.

It is important to note that the main advantage of using smaller tables that are joined together to produce the analysis table is the fact that the information in the analysis table is consistent by design.

5 Creating the smaller table

Child = unique(select(df, :Child, :School, :Cohort, :Sex, :age))

108295×5 DataFrame

108270 rows omitted

Row	Child	School	Cohort	Sex	age
	Cat…	Cat…	Cat…	Cat…	Float64
1	C002352	S100067	2013	male	7.99452
2	C002353	S100067	2013	male	7.99452
3	C002354	S100067	2013	male	7.99452
4	C002355	S100122	2013	female	7.99452
5	C002356	S100146	2013	male	7.99452
6	C002357	S100146	2013	male	7.99452
7	C002358	S100146	2013	male	7.99452
8	C002359	S100183	2013	female	7.99452
9	C002360	S100195	2013	female	7.99452
10	C002361	S100213	2013	male	7.99452
11	C002362	S100237	2013	female	7.99452
12	C002363	S100237	2013	female	7.99452
13	C002364	S100250	2013	female	7.99452
⋮	⋮	⋮	⋮	⋮	⋮
108284	C117943	S130539	2018	female	9.10609
108285	C117944	S130539	2018	male	9.10609
108286	C117945	S130539	2018	male	9.10609
108287	C117946	S130539	2018	male	9.10609
108288	C117956	S400580	2018	male	9.10609
108289	C117957	S400919	2018	male	9.10609
108290	C117958	S400919	2018	male	9.10609
108291	C117959	S400919	2018	male	9.10609
108292	C117962	S401250	2018	female	9.10609
108293	C117964	S401470	2018	male	9.10609
108294	C117965	S401470	2018	female	9.10609
108295	C117966	S800200	2018	male	9.10609

length(unique(Child.Child))  # should be 108295

filesize(
  Arrow.write(joinpath(DATADIR, "fggk21_Child.arrow"), Child; compress=:lz4)
)

filesize(
  Arrow.write(
    joinpath(DATADIR, "fggk21_Score.arrow"),
    select(df, :Child, :Test, :score);
    compress=:lz4,
  ),
)

Note

A careful examination of the file sizes versus that of ./data/fggk21.arrow will show that the separate tables combined take up more space than the original because of the compression. Compression algorithms are often more successful when applied to larger files.

Now read the Arrow tables in and reassemble the original table.

Score = DataFrame(Arrow.Table(joinpath(DATADIR, "fggk21_Score.arrow")))

525126×3 DataFrame

525101 rows omitted

Row	Child	Test	score
	Cat…	Cat…	Float64
1	C002352	S20_r	5.26316
2	C002352	BPT	3.7
3	C002352	SLJ	125.0
4	C002352	Star_r	2.47146
5	C002352	Run	1053.0
6	C002353	S20_r	5.0
7	C002353	BPT	4.1
8	C002353	SLJ	116.0
9	C002353	Star_r	1.76778
10	C002353	Run	1089.0
11	C002354	S20_r	4.54545
12	C002354	BPT	3.9
13	C002354	SLJ	111.0
⋮	⋮	⋮	⋮
525115	C117964	Star_r	1.63704
525116	C117964	Run	864.0
525117	C117965	S20_r	4.65116
525118	C117965	BPT	3.8
525119	C117965	SLJ	123.0
525120	C117965	Star_r	1.52889
525121	C117965	Run	1080.0
525122	C117966	S20_r	4.54545
525123	C117966	BPT	3.8
525124	C117966	SLJ	100.0
525125	C117966	Star_r	2.18506
525126	C117966	Run	990.0

At this point we can create the z-score column by standardizing the scores for each Test. The code to do this follows Julius’s presentation on Monday.

@transform!(groupby(Score, :Test), :zScore = @bycol zscore(:score))

525126×4 DataFrame

525101 rows omitted

Row	Child	Test	score	zScore
	Cat…	Cat…	Float64	Float64
1	C002352	S20_r	5.26316	1.7913
2	C002352	BPT	3.7	-0.0622317
3	C002352	SLJ	125.0	-0.0336567
4	C002352	Star_r	2.47146	1.46874
5	C002352	Run	1053.0	0.331058
6	C002353	S20_r	5.0	1.15471
7	C002353	BPT	4.1	0.498354
8	C002353	SLJ	116.0	-0.498822
9	C002353	Star_r	1.76778	-0.9773
10	C002353	Run	1089.0	0.574056
11	C002354	S20_r	4.54545	0.0551481
12	C002354	BPT	3.9	0.218061
13	C002354	SLJ	111.0	-0.757248
⋮	⋮	⋮	⋮	⋮
525115	C117964	Star_r	1.63704	-1.43175
525116	C117964	Run	864.0	-0.944681
525117	C117965	S20_r	4.65116	0.31086
525118	C117965	BPT	3.8	0.0779146
525119	C117965	SLJ	123.0	-0.137027
525120	C117965	Star_r	1.52889	-1.8077
525121	C117965	Run	1080.0	0.513306
525122	C117966	S20_r	4.54545	0.0551481
525123	C117966	BPT	3.8	0.0779146
525124	C117966	SLJ	100.0	-1.32578
525125	C117966	Star_r	2.18506	0.473217
525126	C117966	Run	990.0	-0.0941883

Child = DataFrame(Arrow.Table(joinpath(DATADIR, "fggk21_Child.arrow")))

108295×5 DataFrame

108270 rows omitted

Row	Child	School	Cohort	Sex	age
	Cat…	Cat…	Cat…	Cat…	Float64
1	C002352	S100067	2013	male	7.99452
2	C002353	S100067	2013	male	7.99452
3	C002354	S100067	2013	male	7.99452
4	C002355	S100122	2013	female	7.99452
5	C002356	S100146	2013	male	7.99452
6	C002357	S100146	2013	male	7.99452
7	C002358	S100146	2013	male	7.99452
8	C002359	S100183	2013	female	7.99452
9	C002360	S100195	2013	female	7.99452
10	C002361	S100213	2013	male	7.99452
11	C002362	S100237	2013	female	7.99452
12	C002363	S100237	2013	female	7.99452
13	C002364	S100250	2013	female	7.99452
⋮	⋮	⋮	⋮	⋮	⋮
108284	C117943	S130539	2018	female	9.10609
108285	C117944	S130539	2018	male	9.10609
108286	C117945	S130539	2018	male	9.10609
108287	C117946	S130539	2018	male	9.10609
108288	C117956	S400580	2018	male	9.10609
108289	C117957	S400919	2018	male	9.10609
108290	C117958	S400919	2018	male	9.10609
108291	C117959	S400919	2018	male	9.10609
108292	C117962	S401250	2018	female	9.10609
108293	C117964	S401470	2018	male	9.10609
108294	C117965	S401470	2018	female	9.10609
108295	C117966	S800200	2018	male	9.10609

df1 = disallowmissing!(leftjoin(Score, Child; on=:Child))

525126×8 DataFrame

525101 rows omitted

Row	Child	Test	score	zScore	School	Cohort	Sex	age
	Cat…	Cat…	Float64	Float64	Cat…	Cat…	Cat…	Float64
1	C002352	S20_r	5.26316	1.7913	S100067	2013	male	7.99452
2	C002352	BPT	3.7	-0.0622317	S100067	2013	male	7.99452
3	C002352	SLJ	125.0	-0.0336567	S100067	2013	male	7.99452
4	C002352	Star_r	2.47146	1.46874	S100067	2013	male	7.99452
5	C002352	Run	1053.0	0.331058	S100067	2013	male	7.99452
6	C002353	S20_r	5.0	1.15471	S100067	2013	male	7.99452
7	C002353	BPT	4.1	0.498354	S100067	2013	male	7.99452
8	C002353	SLJ	116.0	-0.498822	S100067	2013	male	7.99452
9	C002353	Star_r	1.76778	-0.9773	S100067	2013	male	7.99452
10	C002353	Run	1089.0	0.574056	S100067	2013	male	7.99452
11	C002354	S20_r	4.54545	0.0551481	S100067	2013	male	7.99452
12	C002354	BPT	3.9	0.218061	S100067	2013	male	7.99452
13	C002354	SLJ	111.0	-0.757248	S100067	2013	male	7.99452
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
525115	C117964	Star_r	1.63704	-1.43175	S401470	2018	male	9.10609
525116	C117964	Run	864.0	-0.944681	S401470	2018	male	9.10609
525117	C117965	S20_r	4.65116	0.31086	S401470	2018	female	9.10609
525118	C117965	BPT	3.8	0.0779146	S401470	2018	female	9.10609
525119	C117965	SLJ	123.0	-0.137027	S401470	2018	female	9.10609
525120	C117965	Star_r	1.52889	-1.8077	S401470	2018	female	9.10609
525121	C117965	Run	1080.0	0.513306	S401470	2018	female	9.10609
525122	C117966	S20_r	4.54545	0.0551481	S800200	2018	male	9.10609
525123	C117966	BPT	3.8	0.0779146	S800200	2018	male	9.10609
525124	C117966	SLJ	100.0	-1.32578	S800200	2018	male	9.10609
525125	C117966	Star_r	2.18506	0.473217	S800200	2018	male	9.10609
525126	C117966	Run	990.0	-0.0941883	S800200	2018	male	9.10609

Note

The call to disallowmissing! is because the join will create columns that allow for missing values but we know that we should not get missing values in the result. This call will fail if, for some reason, missing values were created.

6 Discovering patterns in the data

One of the motivations for creating the Child table was to be able to bin the ages according to the age of each child, not the age of each Child-Test combination. Not all children have all 5 test results. We can check the number of results by grouping on :Child and evaluate the number of rows in each group.

nobsChild = combine(groupby(Score, :Child), nrow => :ntest)

108295×2 DataFrame

108270 rows omitted

Row	Child	ntest
	Cat…	Int64
1	C002352	5
2	C002353	5
3	C002354	5
4	C002355	5
5	C002356	5
6	C002357	5
7	C002358	5
8	C002359	4
9	C002360	5
10	C002361	4
11	C002362	5
12	C002363	5
13	C002364	5
⋮	⋮	⋮
108284	C117943	5
108285	C117944	5
108286	C117945	5
108287	C117946	5
108288	C117956	5
108289	C117957	4
108290	C117958	5
108291	C117959	5
108292	C117962	5
108293	C117964	5
108294	C117965	5
108295	C117966	5

Now create a table of the number of children with 1, 2, …, 5 test scores.

combine(groupby(nobsChild, :ntest), nrow)

5×2 DataFrame

Row	ntest	nrow
	Int64	Int64
1	1	462
2	2	729
3	3	1739
4	4	8836
5	5	96529

A natural question at this point is whether there is something about those students who have few observations. For example, are they from only a few schools?

One approach to examining properties like is to add the number of observations for each child to the :Child table. Later we can group the table according to this :ntest to look at properties of :Child by :ntest.

gdf = groupby(
  disallowmissing!(leftjoin(Child, nobsChild; on=:Child)), :ntest
)

GroupedDataFrame with 5 groups based on key: ntest

First Group (462 rows): ntest = 1

437 rows omitted

Row	Child	School	Cohort	Sex	age	ntest
	Cat…	Cat…	Cat…	Cat…	Float64	Int64
1	C002452	S101175	2013	male	7.99452	1
2	C002625	S103329	2013	male	7.99452	1
3	C002754	S104814	2013	female	7.99452	1
4	C003269	S102258	2012	female	7.99726	1
5	C003599	S105843	2012	female	7.99726	1
6	C003807	S100754	2011	male	8.0	1
7	C003985	S102945	2011	male	8.0	1
8	C004086	S104255	2011	male	8.0	1
9	C004657	S101400	2014	male	8.03833	1
10	C005036	S105909	2014	male	8.03833	1
11	C005440	S101023	2019	male	8.05202	1
12	C005523	S101825	2019	female	8.05202	1
13	C005697	S103615	2019	male	8.05202	1
⋮	⋮	⋮	⋮	⋮	⋮	⋮
451	C112638	S103718	2015	female	9.0486	1
452	C114749	S112938	2017	male	9.06502	1
453	C115460	S101953	2015	male	9.08145	1
454	C115569	S100572	2017	male	9.08419	1
455	C115587	S100754	2017	female	9.08419	1
456	C117108	S103196	2018	female	9.10609	1
457	C117229	S103615	2018	male	9.10609	1
458	C117230	S103615	2018	male	9.10609	1
459	C117419	S104954	2018	female	9.10609	1
460	C117437	S105004	2018	male	9.10609	1
461	C117539	S105636	2018	male	9.10609	1
462	C117659	S106483	2018	female	9.10609	1

⋮

Last Group (96529 rows): ntest = 5

96504 rows omitted

Row	Child	School	Cohort	Sex	age	ntest
	Cat…	Cat…	Cat…	Cat…	Float64	Int64
1	C002352	S100067	2013	male	7.99452	5
2	C002353	S100067	2013	male	7.99452	5
3	C002354	S100067	2013	male	7.99452	5
4	C002355	S100122	2013	female	7.99452	5
5	C002356	S100146	2013	male	7.99452	5
6	C002357	S100146	2013	male	7.99452	5
7	C002358	S100146	2013	male	7.99452	5
8	C002360	S100195	2013	female	7.99452	5
9	C002362	S100237	2013	female	7.99452	5
10	C002363	S100237	2013	female	7.99452	5
11	C002364	S100250	2013	female	7.99452	5
12	C002365	S100304	2013	male	7.99452	5
13	C002366	S100304	2013	male	7.99452	5
⋮	⋮	⋮	⋮	⋮	⋮	⋮
96518	C117942	S130539	2018	male	9.10609	5
96519	C117943	S130539	2018	female	9.10609	5
96520	C117944	S130539	2018	male	9.10609	5
96521	C117945	S130539	2018	male	9.10609	5
96522	C117946	S130539	2018	male	9.10609	5
96523	C117956	S400580	2018	male	9.10609	5
96524	C117958	S400919	2018	male	9.10609	5
96525	C117959	S400919	2018	male	9.10609	5
96526	C117962	S401250	2018	female	9.10609	5
96527	C117964	S401470	2018	male	9.10609	5
96528	C117965	S401470	2018	female	9.10609	5
96529	C117966	S800200	2018	male	9.10609	5

Are the sexes represented more-or-less equally?

combine(groupby(first(gdf), :Sex), nrow => :nchild)

2×2 DataFrame

Row	Sex	nchild
	Cat…	Int64
1	male	230
2	female	232

combine(groupby(last(gdf), :Sex), nrow => :nchild)

2×2 DataFrame

Row	Sex	nchild
	Cat…	Int64
1	male	47552
2	female	48977

7 Reading Arrow files in other languages

There are Arrow implementations for R (the arrow package) and for Python (pyarrow).

import pyarrow.feather: read_table
read_table("fggk21.arrow")

library(arrow)
read_ipc_file("fggk21.arrow")

8 References

This page was rendered from git revision 0e399c4 using Quarto 1.9.36.