The Arrow storage format provides a language-agnostic storage and memory specification for columnar data tables, which just means “something that looks like a data frame in R”. That is, an arrow table is an ordered, named collection of columns, all of the same length.
The columns can be of different types including numeric values, character strings, and factor-like representations - called DictEncoded.
An Arrow file can be read or written from R, Python, Julia and many other languages. Somewhat confusingly in R and Python the name feather, which refers to an earlier version of the storage format, is used in some function names like read_feather.
Internally, the SMLP2026 package uses Arrow to store all of its datasets.
2 The Emotikon data
The SMLP2021 repository contains a version of the data from @Fuehner2021 in notebooks/data/fggk21.arrow. After that file was created there were changes in the master RDS file on the osf.io site for the project. We will recreate the Arrow file here then split it into two separate tables, one with a row for each child in the study and one with a row for each test result.
The Arrow package for Julia does not export any function names, which means that the function to read an Arrow file must be called as Arrow.Table. It returns a column table, as described in the Tables package. This is like a read-only data frame, which can be easily converted to a full-fledged DataFrame if desired.
This arrangement allows for the Arrow package not to depend on the DataFrames package, which is a heavy-weight dependency, but still easily produce a DataFrame if warranted.
Load the packages to be used.
Code
usingAlgebraOfGraphicsusingArrowusingCairoMakieusingChainusingDataFrameMacrosusingDataFramesusingDownloadsusingKernelDensityusingRData# load RDS filesusingSMLP2026usingStatsBaseCairoMakie.activate!(; type="svg")usingAlgebraOfGraphics: density
3 Downloading and importing the RDS file
This is similar to some of the code shown by Julius Krumbiegel on Monday. In the data directory of the Emotikon project on osf.io under Data, the url for the rds data file is found to be [https://osf.io/xawdb/]. Note that we want version 2 of this file.
# downloads the file to a temporary directoryfn =joinpath(DATADIR, "emotikon.rds")if !isfile(fn)@info"Downloading Emotikon data"Downloads.download("https://osf.io/xawdb/download?version=2", fn)end
dfrm =load(fn)
525126×7 DataFrame
525101 rows omitted
Row
Cohort
School
Child
Sex
age
Test
score
Cat…
Cat…
Cat…
Cat…
Float64
Cat…
Float64
1
2013
S100067
C002352
male
7.99452
S20_r
5.26316
2
2013
S100067
C002352
male
7.99452
BPT
3.7
3
2013
S100067
C002352
male
7.99452
SLJ
125.0
4
2013
S100067
C002352
male
7.99452
Star_r
2.47146
5
2013
S100067
C002352
male
7.99452
Run
1053.0
6
2013
S100067
C002353
male
7.99452
S20_r
5.0
7
2013
S100067
C002353
male
7.99452
BPT
4.1
8
2013
S100067
C002353
male
7.99452
SLJ
116.0
9
2013
S100067
C002353
male
7.99452
Star_r
1.76778
10
2013
S100067
C002353
male
7.99452
Run
1089.0
11
2013
S100067
C002354
male
7.99452
S20_r
4.54545
12
2013
S100067
C002354
male
7.99452
BPT
3.9
13
2013
S100067
C002354
male
7.99452
SLJ
111.0
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
525115
2018
S401470
C117964
male
9.10609
Star_r
1.63704
525116
2018
S401470
C117964
male
9.10609
Run
864.0
525117
2018
S401470
C117965
female
9.10609
S20_r
4.65116
525118
2018
S401470
C117965
female
9.10609
BPT
3.8
525119
2018
S401470
C117965
female
9.10609
SLJ
123.0
525120
2018
S401470
C117965
female
9.10609
Star_r
1.52889
525121
2018
S401470
C117965
female
9.10609
Run
1080.0
525122
2018
S800200
C117966
male
9.10609
S20_r
4.54545
525123
2018
S800200
C117966
male
9.10609
BPT
3.8
525124
2018
S800200
C117966
male
9.10609
SLJ
100.0
525125
2018
S800200
C117966
male
9.10609
Star_r
2.18506
525126
2018
S800200
C117966
male
9.10609
Run
990.0
Now write this file as a Arrow file and read it back in.
One of the principles of relational database design is that information should not be repeated needlessly. Each row of df is determined by a combination of Child and Test, together producing a score, which can be converted to a zScore.
The other columns in the table, Cohort, School, age, and Sex, are properties of the Child.
Storing these values redundantly in the full table takes up space but, more importantly, allows for inconsistency. As it stands, a given Child could be recorded as being in one Cohort for the Run test and in another Cohort for the S20_r test and nothing about the table would detect this as being an error.
The approach used in relational databases is to store the information for score in one table that contains only Child, Test and score, store the information for the Child in another table including Cohort, School, age and Sex. These tables can then be combined to create the table to be used for analysis by joining the different tables together.
The maintainers of the DataFrames package have put in a lot of work over the past few years to make joins quite efficient in Julia. Thus the processing penalty of reassembling the big table from three smaller tables is minimal.
It is important to note that the main advantage of using smaller tables that are joined together to produce the analysis table is the fact that the information in the analysis table is consistent by design.
A careful examination of the file sizes versus that of ./data/fggk21.arrow will show that the separate tables combined take up more space than the original because of the compression. Compression algorithms are often more successful when applied to larger files.
Now read the Arrow tables in and reassemble the original table.
The call to disallowmissing! is because the join will create columns that allow for missing values but we know that we should not get missing values in the result. This call will fail if, for some reason, missing values were created.
6 Discovering patterns in the data
One of the motivations for creating the Child table was to be able to bin the ages according to the age of each child, not the age of each Child-Test combination. Not all children have all 5 test results. We can check the number of results by grouping on :Child and evaluate the number of rows in each group.
Now create a table of the number of children with 1, 2, …, 5 test scores.
combine(groupby(nobsChild, :ntest), nrow)
5×2 DataFrame
Row
ntest
nrow
Int64
Int64
1
1
462
2
2
729
3
3
1739
4
4
8836
5
5
96529
A natural question at this point is whether there is something about those students who have few observations. For example, are they from only a few schools?
One approach to examining properties like is to add the number of observations for each child to the :Child table. Later we can group the table according to this :ntest to look at properties of :Child by :ntest.