Saving data with Arrow

Author

Phillip Alday, Douglas Bates, and Reinhold Kliegl

Published

2024-06-27

1 The Arrow storage format

The Arrow storage format provides a language-agnostic storage and memory specification for columnar data tables, which just means “something that looks like a data frame in R”. That is, an arrow table is an ordered, named collection of columns, all of the same length.

The columns can be of different types including numeric values, character strings, and factor-like representations - called DictEncoded.

An Arrow file can be read or written from R, Python, Julia and many other languages. Somewhat confusingly in R and Python the name feather, which refers to an earlier version of the storage format, is used in some function names like read_feather.

Internally, the SMLP2023 package uses Arrow to store all of its datasets.

2 The Emotikon data

The SMLP2021 repository contains a version of the data from Fühner et al. (2021) in notebooks/data/fggk21.arrow. After that file was created there were changes in the master RDS file on the osf.io site for the project. We will recreate the Arrow file here then split it into two separate tables, one with a row for each child in the study and one with a row for each test result.

The Arrow package for Julia does not export any function names, which means that the function to read an Arrow file must be called as Arrow.Table. It returns a column table, as described in the Tables package. This is like a read-only data frame, which can be easily converted to a full-fledged DataFrame if desired.

This arrangement allows for the Arrow package not to depend on the DataFrames package, which is a heavy-weight dependency, but still easily produce a DataFrame if warranted.

Load the packages to be used.

Code
using AlgebraOfGraphics
using Arrow
using CairoMakie
using Chain
using DataFrameMacros
using DataFrames
using Downloads
using KernelDensity
using RCall   # access R from within Julia
using StatsBase

CairoMakie.activate!(; type="svg")
using AlgebraOfGraphics: density

3 Downloading and importing the RDS file

This is similar to some of the code shown by Julius Krumbiegel on Monday. In the data directory of the emotikon project on osf.io under Data, the url for the rds data file is found to be [https://osf.io/xawdb/]. Note that we want version 2 of this file.

fn = Downloads.download("https://osf.io/xawdb/download?version=2");
dfrm = rcopy(R"readRDS($fn)")
525126×7 DataFrame
525101 rows omitted
Row Cohort School Child Sex age Test score
Cat… Cat… Cat… Cat… Float64 Cat… Float64
1 2013 S100067 C002352 male 7.99452 S20_r 5.26316
2 2013 S100067 C002352 male 7.99452 BPT 3.7
3 2013 S100067 C002352 male 7.99452 SLJ 125.0
4 2013 S100067 C002352 male 7.99452 Star_r 2.47146
5 2013 S100067 C002352 male 7.99452 Run 1053.0
6 2013 S100067 C002353 male 7.99452 S20_r 5.0
7 2013 S100067 C002353 male 7.99452 BPT 4.1
8 2013 S100067 C002353 male 7.99452 SLJ 116.0
9 2013 S100067 C002353 male 7.99452 Star_r 1.76778
10 2013 S100067 C002353 male 7.99452 Run 1089.0
11 2013 S100067 C002354 male 7.99452 S20_r 4.54545
12 2013 S100067 C002354 male 7.99452 BPT 3.9
13 2013 S100067 C002354 male 7.99452 SLJ 111.0
525115 2018 S401470 C117964 male 9.10609 Star_r 1.63704
525116 2018 S401470 C117964 male 9.10609 Run 864.0
525117 2018 S401470 C117965 female 9.10609 S20_r 4.65116
525118 2018 S401470 C117965 female 9.10609 BPT 3.8
525119 2018 S401470 C117965 female 9.10609 SLJ 123.0
525120 2018 S401470 C117965 female 9.10609 Star_r 1.52889
525121 2018 S401470 C117965 female 9.10609 Run 1080.0
525122 2018 S800200 C117966 male 9.10609 S20_r 4.54545
525123 2018 S800200 C117966 male 9.10609 BPT 3.8
525124 2018 S800200 C117966 male 9.10609 SLJ 100.0
525125 2018 S800200 C117966 male 9.10609 Star_r 2.18506
525126 2018 S800200 C117966 male 9.10609 Run 990.0

Now write this file as a Arrow file and read it back in.

arrowfn = joinpath("data", "fggk21.arrow")
Arrow.write(arrowfn, dfrm; compress=:lz4)
tbl = Arrow.Table(arrowfn)
Arrow.Table with 525126 rows, 7 columns, and schema:
 :Cohort  String
 :School  String
 :Child   String
 :Sex     String
 :age     Float64
 :Test    String
 :score   Float64
filesize(arrowfn)
3077850
df = DataFrame(tbl)
525126×7 DataFrame
525101 rows omitted
Row Cohort School Child Sex age Test score
String String String String Float64 String Float64
1 2013 S100067 C002352 male 7.99452 S20_r 5.26316
2 2013 S100067 C002352 male 7.99452 BPT 3.7
3 2013 S100067 C002352 male 7.99452 SLJ 125.0
4 2013 S100067 C002352 male 7.99452 Star_r 2.47146
5 2013 S100067 C002352 male 7.99452 Run 1053.0
6 2013 S100067 C002353 male 7.99452 S20_r 5.0
7 2013 S100067 C002353 male 7.99452 BPT 4.1
8 2013 S100067 C002353 male 7.99452 SLJ 116.0
9 2013 S100067 C002353 male 7.99452 Star_r 1.76778
10 2013 S100067 C002353 male 7.99452 Run 1089.0
11 2013 S100067 C002354 male 7.99452 S20_r 4.54545
12 2013 S100067 C002354 male 7.99452 BPT 3.9
13 2013 S100067 C002354 male 7.99452 SLJ 111.0
525115 2018 S401470 C117964 male 9.10609 Star_r 1.63704
525116 2018 S401470 C117964 male 9.10609 Run 864.0
525117 2018 S401470 C117965 female 9.10609 S20_r 4.65116
525118 2018 S401470 C117965 female 9.10609 BPT 3.8
525119 2018 S401470 C117965 female 9.10609 SLJ 123.0
525120 2018 S401470 C117965 female 9.10609 Star_r 1.52889
525121 2018 S401470 C117965 female 9.10609 Run 1080.0
525122 2018 S800200 C117966 male 9.10609 S20_r 4.54545
525123 2018 S800200 C117966 male 9.10609 BPT 3.8
525124 2018 S800200 C117966 male 9.10609 SLJ 100.0
525125 2018 S800200 C117966 male 9.10609 Star_r 2.18506
525126 2018 S800200 C117966 male 9.10609 Run 990.0

4 Avoiding needless repetition

One of the principles of relational database design is that information should not be repeated needlessly. Each row of df is determined by a combination of Child and Test, together producing a score, which can be converted to a zScore.

The other columns in the table, Cohort, School, age, and Sex, are properties of the Child.

Storing these values redundantly in the full table takes up space but, more importantly, allows for inconsistency. As it stands, a given Child could be recorded as being in one Cohort for the Run test and in another Cohort for the S20_r test and nothing about the table would detect this as being an error.

The approach used in relational databases is to store the information for score in one table that contains only Child, Test and score, store the information for the Child in another table including Cohort, School, age and Sex. These tables can then be combined to create the table to be used for analysis by joining the different tables together.

The maintainers of the DataFrames package have put in a lot of work over the past few years to make joins quite efficient in Julia. Thus the processing penalty of reassembling the big table from three smaller tables is minimal.

It is important to note that the main advantage of using smaller tables that are joined together to produce the analysis table is the fact that the information in the analysis table is consistent by design.

5 Creating the smaller table

Child = unique(select(df, :Child, :School, :Cohort, :Sex, :age))
108295×5 DataFrame
108270 rows omitted
Row Child School Cohort Sex age
String String String String Float64
1 C002352 S100067 2013 male 7.99452
2 C002353 S100067 2013 male 7.99452
3 C002354 S100067 2013 male 7.99452
4 C002355 S100122 2013 female 7.99452
5 C002356 S100146 2013 male 7.99452
6 C002357 S100146 2013 male 7.99452
7 C002358 S100146 2013 male 7.99452
8 C002359 S100183 2013 female 7.99452
9 C002360 S100195 2013 female 7.99452
10 C002361 S100213 2013 male 7.99452
11 C002362 S100237 2013 female 7.99452
12 C002363 S100237 2013 female 7.99452
13 C002364 S100250 2013 female 7.99452
108284 C117943 S130539 2018 female 9.10609
108285 C117944 S130539 2018 male 9.10609
108286 C117945 S130539 2018 male 9.10609
108287 C117946 S130539 2018 male 9.10609
108288 C117956 S400580 2018 male 9.10609
108289 C117957 S400919 2018 male 9.10609
108290 C117958 S400919 2018 male 9.10609
108291 C117959 S400919 2018 male 9.10609
108292 C117962 S401250 2018 female 9.10609
108293 C117964 S401470 2018 male 9.10609
108294 C117965 S401470 2018 female 9.10609
108295 C117966 S800200 2018 male 9.10609
length(unique(Child.Child))  # should be 108295
108295
filesize(
  Arrow.write("./data/fggk21_Child.arrow", Child; compress=:lz4)
)
1774946
filesize(
  Arrow.write(
    "./data/fggk21_Score.arrow",
    select(df, :Child, :Test, :score);
    compress=:lz4,
  ),
)
2794058
Note

A careful examination of the file sizes versus that of ./data/fggk21.arrow will show that the separate tables combined take up more space than the original because of the compression. Compression algorithms are often more successful when applied to larger files.

Now read the Arrow tables in and reassemble the original table.

Score = DataFrame(Arrow.Table("./data/fggk21_Score.arrow"))
525126×3 DataFrame
525101 rows omitted
Row Child Test score
String String Float64
1 C002352 S20_r 5.26316
2 C002352 BPT 3.7
3 C002352 SLJ 125.0
4 C002352 Star_r 2.47146
5 C002352 Run 1053.0
6 C002353 S20_r 5.0
7 C002353 BPT 4.1
8 C002353 SLJ 116.0
9 C002353 Star_r 1.76778
10 C002353 Run 1089.0
11 C002354 S20_r 4.54545
12 C002354 BPT 3.9
13 C002354 SLJ 111.0
525115 C117964 Star_r 1.63704
525116 C117964 Run 864.0
525117 C117965 S20_r 4.65116
525118 C117965 BPT 3.8
525119 C117965 SLJ 123.0
525120 C117965 Star_r 1.52889
525121 C117965 Run 1080.0
525122 C117966 S20_r 4.54545
525123 C117966 BPT 3.8
525124 C117966 SLJ 100.0
525125 C117966 Star_r 2.18506
525126 C117966 Run 990.0

At this point we can create the z-score column by standardizing the scores for each Test. The code to do this follows Julius’s presentation on Monday.

@transform!(groupby(Score, :Test), :zScore = @bycol zscore(:score))
525126×4 DataFrame
525101 rows omitted
Row Child Test score zScore
String String Float64 Float64
1 C002352 S20_r 5.26316 1.7913
2 C002352 BPT 3.7 -0.0622317
3 C002352 SLJ 125.0 -0.0336567
4 C002352 Star_r 2.47146 1.46874
5 C002352 Run 1053.0 0.331058
6 C002353 S20_r 5.0 1.15471
7 C002353 BPT 4.1 0.498354
8 C002353 SLJ 116.0 -0.498822
9 C002353 Star_r 1.76778 -0.9773
10 C002353 Run 1089.0 0.574056
11 C002354 S20_r 4.54545 0.0551481
12 C002354 BPT 3.9 0.218061
13 C002354 SLJ 111.0 -0.757248
525115 C117964 Star_r 1.63704 -1.43175
525116 C117964 Run 864.0 -0.944681
525117 C117965 S20_r 4.65116 0.31086
525118 C117965 BPT 3.8 0.0779146
525119 C117965 SLJ 123.0 -0.137027
525120 C117965 Star_r 1.52889 -1.8077
525121 C117965 Run 1080.0 0.513306
525122 C117966 S20_r 4.54545 0.0551481
525123 C117966 BPT 3.8 0.0779146
525124 C117966 SLJ 100.0 -1.32578
525125 C117966 Star_r 2.18506 0.473217
525126 C117966 Run 990.0 -0.0941883
Child = DataFrame(Arrow.Table("./data/fggk21_Child.arrow"))
108295×5 DataFrame
108270 rows omitted
Row Child School Cohort Sex age
String String String String Float64
1 C002352 S100067 2013 male 7.99452
2 C002353 S100067 2013 male 7.99452
3 C002354 S100067 2013 male 7.99452
4 C002355 S100122 2013 female 7.99452
5 C002356 S100146 2013 male 7.99452
6 C002357 S100146 2013 male 7.99452
7 C002358 S100146 2013 male 7.99452
8 C002359 S100183 2013 female 7.99452
9 C002360 S100195 2013 female 7.99452
10 C002361 S100213 2013 male 7.99452
11 C002362 S100237 2013 female 7.99452
12 C002363 S100237 2013 female 7.99452
13 C002364 S100250 2013 female 7.99452
108284 C117943 S130539 2018 female 9.10609
108285 C117944 S130539 2018 male 9.10609
108286 C117945 S130539 2018 male 9.10609
108287 C117946 S130539 2018 male 9.10609
108288 C117956 S400580 2018 male 9.10609
108289 C117957 S400919 2018 male 9.10609
108290 C117958 S400919 2018 male 9.10609
108291 C117959 S400919 2018 male 9.10609
108292 C117962 S401250 2018 female 9.10609
108293 C117964 S401470 2018 male 9.10609
108294 C117965 S401470 2018 female 9.10609
108295 C117966 S800200 2018 male 9.10609
df1 = disallowmissing!(leftjoin(Score, Child; on=:Child))
525126×8 DataFrame
525101 rows omitted
Row Child Test score zScore School Cohort Sex age
String String Float64 Float64 String String String Float64
1 C002352 S20_r 5.26316 1.7913 S100067 2013 male 7.99452
2 C002352 BPT 3.7 -0.0622317 S100067 2013 male 7.99452
3 C002352 SLJ 125.0 -0.0336567 S100067 2013 male 7.99452
4 C002352 Star_r 2.47146 1.46874 S100067 2013 male 7.99452
5 C002352 Run 1053.0 0.331058 S100067 2013 male 7.99452
6 C002353 S20_r 5.0 1.15471 S100067 2013 male 7.99452
7 C002353 BPT 4.1 0.498354 S100067 2013 male 7.99452
8 C002353 SLJ 116.0 -0.498822 S100067 2013 male 7.99452
9 C002353 Star_r 1.76778 -0.9773 S100067 2013 male 7.99452
10 C002353 Run 1089.0 0.574056 S100067 2013 male 7.99452
11 C002354 S20_r 4.54545 0.0551481 S100067 2013 male 7.99452
12 C002354 BPT 3.9 0.218061 S100067 2013 male 7.99452
13 C002354 SLJ 111.0 -0.757248 S100067 2013 male 7.99452
525115 C117964 Star_r 1.63704 -1.43175 S401470 2018 male 9.10609
525116 C117964 Run 864.0 -0.944681 S401470 2018 male 9.10609
525117 C117965 S20_r 4.65116 0.31086 S401470 2018 female 9.10609
525118 C117965 BPT 3.8 0.0779146 S401470 2018 female 9.10609
525119 C117965 SLJ 123.0 -0.137027 S401470 2018 female 9.10609
525120 C117965 Star_r 1.52889 -1.8077 S401470 2018 female 9.10609
525121 C117965 Run 1080.0 0.513306 S401470 2018 female 9.10609
525122 C117966 S20_r 4.54545 0.0551481 S800200 2018 male 9.10609
525123 C117966 BPT 3.8 0.0779146 S800200 2018 male 9.10609
525124 C117966 SLJ 100.0 -1.32578 S800200 2018 male 9.10609
525125 C117966 Star_r 2.18506 0.473217 S800200 2018 male 9.10609
525126 C117966 Run 990.0 -0.0941883 S800200 2018 male 9.10609
Note

The call to disallowmissing! is because the join will create columns that allow for missing values but we know that we should not get missing values in the result. This call will fail if, for some reason, missing values were created.

6 Discovering patterns in the data

One of the motivations for creating the Child table was to be able to bin the ages according to the age of each child, not the age of each Child-Test combination. Not all children have all 5 test results. We can check the number of results by grouping on :Child and evaluate the number of rows in each group.

nobsChild = combine(groupby(Score, :Child), nrow => :ntest)
108295×2 DataFrame
108270 rows omitted
Row Child ntest
String Int64
1 C002352 5
2 C002353 5
3 C002354 5
4 C002355 5
5 C002356 5
6 C002357 5
7 C002358 5
8 C002359 4
9 C002360 5
10 C002361 4
11 C002362 5
12 C002363 5
13 C002364 5
108284 C117943 5
108285 C117944 5
108286 C117945 5
108287 C117946 5
108288 C117956 5
108289 C117957 4
108290 C117958 5
108291 C117959 5
108292 C117962 5
108293 C117964 5
108294 C117965 5
108295 C117966 5

Now create a table of the number of children with 1, 2, …, 5 test scores.

combine(groupby(nobsChild, :ntest), nrow)
5×2 DataFrame
Row ntest nrow
Int64 Int64
1 1 462
2 2 729
3 3 1739
4 4 8836
5 5 96529

A natural question at this point is whether there is something about those students who have few observations. For example, are they from only a few schools?

One approach to examining properties like is to add the number of observations for each child to the :Child table. Later we can group the table according to this :ntest to look at properties of :Child by :ntest.

gdf = groupby(
  disallowmissing!(leftjoin(Child, nobsChild; on=:Child)), :ntest
)

GroupedDataFrame with 5 groups based on key: ntest

First Group (462 rows): ntest = 1
437 rows omitted
Row Child School Cohort Sex age ntest
String String String String Float64 Int64
1 C002452 S101175 2013 male 7.99452 1
2 C002625 S103329 2013 male 7.99452 1
3 C002754 S104814 2013 female 7.99452 1
4 C003269 S102258 2012 female 7.99726 1
5 C003599 S105843 2012 female 7.99726 1
6 C003807 S100754 2011 male 8.0 1
7 C003985 S102945 2011 male 8.0 1
8 C004086 S104255 2011 male 8.0 1
9 C004657 S101400 2014 male 8.03833 1
10 C005036 S105909 2014 male 8.03833 1
11 C005440 S101023 2019 male 8.05202 1
12 C005523 S101825 2019 female 8.05202 1
13 C005697 S103615 2019 male 8.05202 1
451 C112638 S103718 2015 female 9.0486 1
452 C114749 S112938 2017 male 9.06502 1
453 C115460 S101953 2015 male 9.08145 1
454 C115569 S100572 2017 male 9.08419 1
455 C115587 S100754 2017 female 9.08419 1
456 C117108 S103196 2018 female 9.10609 1
457 C117229 S103615 2018 male 9.10609 1
458 C117230 S103615 2018 male 9.10609 1
459 C117419 S104954 2018 female 9.10609 1
460 C117437 S105004 2018 male 9.10609 1
461 C117539 S105636 2018 male 9.10609 1
462 C117659 S106483 2018 female 9.10609 1

Last Group (96529 rows): ntest = 5
96504 rows omitted
Row Child School Cohort Sex age ntest
String String String String Float64 Int64
1 C002352 S100067 2013 male 7.99452 5
2 C002353 S100067 2013 male 7.99452 5
3 C002354 S100067 2013 male 7.99452 5
4 C002355 S100122 2013 female 7.99452 5
5 C002356 S100146 2013 male 7.99452 5
6 C002357 S100146 2013 male 7.99452 5
7 C002358 S100146 2013 male 7.99452 5
8 C002360 S100195 2013 female 7.99452 5
9 C002362 S100237 2013 female 7.99452 5
10 C002363 S100237 2013 female 7.99452 5
11 C002364 S100250 2013 female 7.99452 5
12 C002365 S100304 2013 male 7.99452 5
13 C002366 S100304 2013 male 7.99452 5
96518 C117942 S130539 2018 male 9.10609 5
96519 C117943 S130539 2018 female 9.10609 5
96520 C117944 S130539 2018 male 9.10609 5
96521 C117945 S130539 2018 male 9.10609 5
96522 C117946 S130539 2018 male 9.10609 5
96523 C117956 S400580 2018 male 9.10609 5
96524 C117958 S400919 2018 male 9.10609 5
96525 C117959 S400919 2018 male 9.10609 5
96526 C117962 S401250 2018 female 9.10609 5
96527 C117964 S401470 2018 male 9.10609 5
96528 C117965 S401470 2018 female 9.10609 5
96529 C117966 S800200 2018 male 9.10609 5

Are the sexes represented more-or-less equally?

combine(groupby(first(gdf), :Sex), nrow => :nchild)
2×2 DataFrame
Row Sex nchild
String Int64
1 male 230
2 female 232
combine(groupby(last(gdf), :Sex), nrow => :nchild)
2×2 DataFrame
Row Sex nchild
String Int64
1 male 47552
2 female 48977

7 Reading Arrow files in other languages

There are Arrow implementations for R (the arrow package) and for Python (pyarrow).

#| eval: false
import pyarrow.feather: read_table
read_table("./data/fggk21.arrow")
#| eval: false
library("arrow")
fggk21 <- read_feather("./data/fggk21.arrow")
nrow(fggk21)

8 References

Fühner, T., Granacher, U., Golle, K., & Kliegl, R. (2021). Age and sex effects in physical fitness components of 108,295 third graders including 515 primary schools and 9 cohorts. Scientific Reports, 11(1). https://doi.org/10.1038/s41598-021-97000-4
Back to top