Notes on saved data files

Author

Phillip Alday, Douglas Bates, and Reinhold Kliegl

Published

2022-09-27

The Arrow storage format

The Arrow storage format provides a language-agnostic storage and memory specification for columnar data tables, which just means “something that looks like a data frame in R”. That is, an arrow table is an ordered, named collection of columns, all of the same length.

The columns can be of different types including numeric values, character strings, and factor-like representations - called DictEncoded.

An Arrow file can be read or written from R, Python, Julia and many other languages. Somewhat confusingly in R and Python the name feather, which refers to an earlier version of the storage format, is used in some function names like read_feather.

The Emotikon data

The SMLP2021 repository contains a version of the data from Fühner et al. (2021) in notebooks/data/fggk21.arrow. After that file was created there were changes in the master RDS file on the osf.io site for the project. We will recreate the Arrow file here then split it into two separate tables, one with a row for each child in the study and one with a row for each test result.

The Arrow package for Julia does not export any function names, which means that the function to read an Arrow file must be called as Arrow.Table. It returns a column table, as described in the Tables package. This is like a read-only data frame, which can be easily converted to a full-fledged DataFrame if desired.

This arrangement allows for the Arrow package not to depend on the DataFrames package, which is a heavy-weight dependency, but still easily produce a DataFrame if warranted.

Load the packages to be used.

Code
using AlgebraOfGraphics
using Arrow
using CairoMakie
using Chain
using DataFrameMacros
using DataFrames
using Downloads
using KernelDensity
using RCall   # access R from within Julia
using StatsBase

CairoMakie.activate!(; type="svg")
using AlgebraOfGraphics: density

Downloading and importing the RDS file

This is similar to some of the code shown by Julius Krumbiegel on Monday. In the data directory of the emotikon project on osf.io under Data, the url for the rds data file is found to be [https://osf.io/xawdb/]. Note that we want version 2 of this file.

fn = Downloads.download("https://osf.io/xawdb/download?version=2");
dfrm = rcopy(R"readRDS($fn)")

525,126 rows × 7 columns

CohortSchoolChildSexageTestscore
Cat…Cat…Cat…Cat…Float64Cat…Float64
12013S100067C002352male7.99452S20_r5.26316
22013S100067C002352male7.99452BPT3.7
32013S100067C002352male7.99452SLJ125.0
42013S100067C002352male7.99452Star_r2.47146
52013S100067C002352male7.99452Run1053.0
62013S100067C002353male7.99452S20_r5.0
72013S100067C002353male7.99452BPT4.1
82013S100067C002353male7.99452SLJ116.0
92013S100067C002353male7.99452Star_r1.76778
102013S100067C002353male7.99452Run1089.0
112013S100067C002354male7.99452S20_r4.54545
122013S100067C002354male7.99452BPT3.9
132013S100067C002354male7.99452SLJ111.0
142013S100067C002354male7.99452Star_r1.98875
152013S100067C002354male7.99452Run864.0
162013S100122C002355female7.99452S20_r4.54545
172013S100122C002355female7.99452BPT3.0
182013S100122C002355female7.99452SLJ114.0
192013S100122C002355female7.99452Star_r1.84464
202013S100122C002355female7.99452Run835.0
212013S100146C002356male7.99452S20_r4.34783
222013S100146C002356male7.99452BPT3.3
232013S100146C002356male7.99452SLJ118.0
242013S100146C002356male7.99452Star_r1.90682
252013S100146C002356male7.99452Run860.0
262013S100146C002357male7.99452S20_r4.34783
272013S100146C002357male7.99452BPT4.3
282013S100146C002357male7.99452SLJ130.0
292013S100146C002357male7.99452Star_r1.99655
302013S100146C002357male7.99452Run960.0

Now write this file as a Arrow file and read it back in.

arrowfn = joinpath("data", "fggk21.arrow")
Arrow.write(arrowfn, dfrm; compress=:lz4)
tbl = Arrow.Table(arrowfn)
Arrow.Table with 525126 rows, 7 columns, and schema:
 :Cohort  String
 :School  String
 :Child   String
 :Sex     String
 :age     Float64
 :Test    String
 :score   Float64
filesize(arrowfn)
3077850
df = DataFrame(tbl)

525,126 rows × 7 columns

CohortSchoolChildSexageTestscore
StringStringStringStringFloat64StringFloat64
12013S100067C002352male7.99452S20_r5.26316
22013S100067C002352male7.99452BPT3.7
32013S100067C002352male7.99452SLJ125.0
42013S100067C002352male7.99452Star_r2.47146
52013S100067C002352male7.99452Run1053.0
62013S100067C002353male7.99452S20_r5.0
72013S100067C002353male7.99452BPT4.1
82013S100067C002353male7.99452SLJ116.0
92013S100067C002353male7.99452Star_r1.76778
102013S100067C002353male7.99452Run1089.0
112013S100067C002354male7.99452S20_r4.54545
122013S100067C002354male7.99452BPT3.9
132013S100067C002354male7.99452SLJ111.0
142013S100067C002354male7.99452Star_r1.98875
152013S100067C002354male7.99452Run864.0
162013S100122C002355female7.99452S20_r4.54545
172013S100122C002355female7.99452BPT3.0
182013S100122C002355female7.99452SLJ114.0
192013S100122C002355female7.99452Star_r1.84464
202013S100122C002355female7.99452Run835.0
212013S100146C002356male7.99452S20_r4.34783
222013S100146C002356male7.99452BPT3.3
232013S100146C002356male7.99452SLJ118.0
242013S100146C002356male7.99452Star_r1.90682
252013S100146C002356male7.99452Run860.0
262013S100146C002357male7.99452S20_r4.34783
272013S100146C002357male7.99452BPT4.3
282013S100146C002357male7.99452SLJ130.0
292013S100146C002357male7.99452Star_r1.99655
302013S100146C002357male7.99452Run960.0

Avoiding needless repetition

One of the principles of relational database design is that information should not be repeated needlessly. Each row of df is determined by a combination of Child and Test, together producing a score, which can be converted to a zScore.

The other columns in the table, Cohort, School, age, and Sex, are properties of the Child.

Storing these values redundantly in the full table takes up space but, more importantly, allows for inconsistency. As it stands, a given Child could be recorded as being in one Cohort for the Run test and in another Cohort for the S20_r test and nothing about the table would detect this as being an error.

The approach used in relational databases is to store the information for score in one table that contains only Child, Test and score, store the information for the Child in another table including Cohort, School, age and Sex. These tables can then be combined to create the table to be used for analysis by joining the different tables together.

The maintainers of the DataFrames package have put in a lot of work over the past few years to make joins quite efficient in Julia. Thus the processing penalty of reassembling the big table from three smaller tables is minimal.

It is important to note that the main advantage of using smaller tables that are joined together to produce the analysis table is the fact that the information in the analysis table is consistent by design.

Creating the smaller table

Child = unique(select(df, :Child, :School, :Cohort, :Sex, :age))

108,295 rows × 5 columns

ChildSchoolCohortSexage
StringStringStringStringFloat64
1C002352S1000672013male7.99452
2C002353S1000672013male7.99452
3C002354S1000672013male7.99452
4C002355S1001222013female7.99452
5C002356S1001462013male7.99452
6C002357S1001462013male7.99452
7C002358S1001462013male7.99452
8C002359S1001832013female7.99452
9C002360S1001952013female7.99452
10C002361S1002132013male7.99452
11C002362S1002372013female7.99452
12C002363S1002372013female7.99452
13C002364S1002502013female7.99452
14C002365S1003042013male7.99452
15C002366S1003042013male7.99452
16C002367S1003162013female7.99452
17C002368S1003652013male7.99452
18C002369S1003652013male7.99452
19C002370S1003652013female7.99452
20C002371S1004322013female7.99452
21C002372S1004322013male7.99452
22C002373S1004812013male7.99452
23C002374S1004812013male7.99452
24C002375S1004812013female7.99452
25C002376S1004932013female7.99452
26C002377S1004932013female7.99452
27C002378S1005472013male7.99452
28C002379S1005472013male7.99452
29C002380S1005472013male7.99452
30C002381S1005472013female7.99452
length(unique(Child.Child))  # should be 108295
108295
filesize(
  Arrow.write("./data/fggk21_Child.arrow", Child; compress=:lz4)
)
1774946
filesize(
  Arrow.write(
    "./data/fggk21_Score.arrow",
    select(df, :Child, :Test, :score);
    compress=:lz4,
  ),
)
2794058
Note

A careful examination of the file sizes versus that of ./data/fggk21.arrow will show that the separate tables combined take up more space than the original because of the compression. Compression algorithms are often more successful when applied to larger files.

Now read the Arrow tables in and reassemble the original table.

Score = DataFrame(Arrow.Table("./data/fggk21_Score.arrow"))

525,126 rows × 3 columns

ChildTestscore
StringStringFloat64
1C002352S20_r5.26316
2C002352BPT3.7
3C002352SLJ125.0
4C002352Star_r2.47146
5C002352Run1053.0
6C002353S20_r5.0
7C002353BPT4.1
8C002353SLJ116.0
9C002353Star_r1.76778
10C002353Run1089.0
11C002354S20_r4.54545
12C002354BPT3.9
13C002354SLJ111.0
14C002354Star_r1.98875
15C002354Run864.0
16C002355S20_r4.54545
17C002355BPT3.0
18C002355SLJ114.0
19C002355Star_r1.84464
20C002355Run835.0
21C002356S20_r4.34783
22C002356BPT3.3
23C002356SLJ118.0
24C002356Star_r1.90682
25C002356Run860.0
26C002357S20_r4.34783
27C002357BPT4.3
28C002357SLJ130.0
29C002357Star_r1.99655
30C002357Run960.0

At this point we can create the z-score column by standardizing the scores for each Test. The code to do this follows Julius’s presentation on Monday.

@transform!(groupby(Score, :Test), :zScore = @c zscore(:score))

525,126 rows × 4 columns

ChildTestscorezScore
StringStringFloat64Float64
1C002352S20_r5.263161.7913
2C002352BPT3.7-0.0622317
3C002352SLJ125.0-0.0336567
4C002352Star_r2.471461.46874
5C002352Run1053.00.331058
6C002353S20_r5.01.15471
7C002353BPT4.10.498354
8C002353SLJ116.0-0.498822
9C002353Star_r1.76778-0.9773
10C002353Run1089.00.574056
11C002354S20_r4.545450.0551481
12C002354BPT3.90.218061
13C002354SLJ111.0-0.757248
14C002354Star_r1.98875-0.209186
15C002354Run864.0-0.944681
16C002355S20_r4.545450.0551481
17C002355BPT3.0-1.04326
18C002355SLJ114.0-0.602193
19C002355Star_r1.84464-0.71013
20C002355Run835.0-1.14043
21C002356S20_r4.34783-0.422921
22C002356BPT3.3-0.622817
23C002356SLJ118.0-0.395452
24C002356Star_r1.90682-0.493992
25C002356Run860.0-0.97168
26C002357S20_r4.34783-0.422921
27C002357BPT4.30.778646
28C002357SLJ130.00.224769
29C002357Star_r1.99655-0.182076
30C002357Run960.0-0.296686
Child = DataFrame(Arrow.Table("./data/fggk21_Child.arrow"))

108,295 rows × 5 columns

ChildSchoolCohortSexage
StringStringStringStringFloat64
1C002352S1000672013male7.99452
2C002353S1000672013male7.99452
3C002354S1000672013male7.99452
4C002355S1001222013female7.99452
5C002356S1001462013male7.99452
6C002357S1001462013male7.99452
7C002358S1001462013male7.99452
8C002359S1001832013female7.99452
9C002360S1001952013female7.99452
10C002361S1002132013male7.99452
11C002362S1002372013female7.99452
12C002363S1002372013female7.99452
13C002364S1002502013female7.99452
14C002365S1003042013male7.99452
15C002366S1003042013male7.99452
16C002367S1003162013female7.99452
17C002368S1003652013male7.99452
18C002369S1003652013male7.99452
19C002370S1003652013female7.99452
20C002371S1004322013female7.99452
21C002372S1004322013male7.99452
22C002373S1004812013male7.99452
23C002374S1004812013male7.99452
24C002375S1004812013female7.99452
25C002376S1004932013female7.99452
26C002377S1004932013female7.99452
27C002378S1005472013male7.99452
28C002379S1005472013male7.99452
29C002380S1005472013male7.99452
30C002381S1005472013female7.99452
df1 = disallowmissing!(leftjoin(Score, Child; on=:Child))

525,126 rows × 8 columns

ChildTestscorezScoreSchoolCohortSexage
StringStringFloat64Float64StringStringStringFloat64
1C002352S20_r5.263161.7913S1000672013male7.99452
2C002352BPT3.7-0.0622317S1000672013male7.99452
3C002352SLJ125.0-0.0336567S1000672013male7.99452
4C002352Star_r2.471461.46874S1000672013male7.99452
5C002352Run1053.00.331058S1000672013male7.99452
6C002353S20_r5.01.15471S1000672013male7.99452
7C002353BPT4.10.498354S1000672013male7.99452
8C002353SLJ116.0-0.498822S1000672013male7.99452
9C002353Star_r1.76778-0.9773S1000672013male7.99452
10C002353Run1089.00.574056S1000672013male7.99452
11C002354S20_r4.545450.0551481S1000672013male7.99452
12C002354BPT3.90.218061S1000672013male7.99452
13C002354SLJ111.0-0.757248S1000672013male7.99452
14C002354Star_r1.98875-0.209186S1000672013male7.99452
15C002354Run864.0-0.944681S1000672013male7.99452
16C002355S20_r4.545450.0551481S1001222013female7.99452
17C002355BPT3.0-1.04326S1001222013female7.99452
18C002355SLJ114.0-0.602193S1001222013female7.99452
19C002355Star_r1.84464-0.71013S1001222013female7.99452
20C002355Run835.0-1.14043S1001222013female7.99452
21C002356S20_r4.34783-0.422921S1001462013male7.99452
22C002356BPT3.3-0.622817S1001462013male7.99452
23C002356SLJ118.0-0.395452S1001462013male7.99452
24C002356Star_r1.90682-0.493992S1001462013male7.99452
25C002356Run860.0-0.97168S1001462013male7.99452
26C002357S20_r4.34783-0.422921S1001462013male7.99452
27C002357BPT4.30.778646S1001462013male7.99452
28C002357SLJ130.00.224769S1001462013male7.99452
29C002357Star_r1.99655-0.182076S1001462013male7.99452
30C002357Run960.0-0.296686S1001462013male7.99452
Note

The call to disallowmissing! is because the join will create columns that allow for missing values but we know that we should not get missing values in the result. This call will fail if, for some reason, missing values were created.

Discovering patterns in the data

One of the motivations for creating the Child table was be able to bin the ages according to the age of each child, not the age of each Child-Test combination. Not all children have all 5 test results. We can check the number of results by grouping on :Child and evaluate the number of rows in each group.

nobsChild = combine(groupby(Score, :Child), nrow => :ntest)

108,295 rows × 2 columns

Childntest
StringInt64
1C0023525
2C0023535
3C0023545
4C0023555
5C0023565
6C0023575
7C0023585
8C0023594
9C0023605
10C0023614
11C0023625
12C0023635
13C0023645
14C0023655
15C0023665
16C0023675
17C0023685
18C0023695
19C0023705
20C0023714
21C0023725
22C0023735
23C0023745
24C0023755
25C0023765
26C0023775
27C0023785
28C0023795
29C0023805
30C0023815

Now create a table of the number of children with 1, 2, …, 5 test scores.

combine(groupby(nobsChild, :ntest), nrow)

5 rows × 2 columns

ntestnrow
Int64Int64
11462
22729
331739
448836
5596529

A natural question at this point is whether there is something about those students who have few observations. For example, are they from only a few schools?

One approach to examining properties like is to add the number of observations for each child to the :Child table. Later we can group the table according to this :ntest to look at properties of :Child by :ntest.

gdf = groupby(
  disallowmissing!(leftjoin(Child, nobsChild; on=:Child)), :ntest
)

GroupedDataFrame with 5 groups based on key: ntest

First Group (462 rows): ntest = 1

ChildSchoolCohortSexagentest
StringStringStringStringFloat64Int64
1C002452S1011752013male7.994521
2C002625S1033292013male7.994521
3C002754S1048142013female7.994521
4C003269S1022582012female7.997261
5C003599S1058432012female7.997261
6C003807S1007542011male8.01
7C003985S1029452011male8.01
8C004086S1042552011male8.01
9C004657S1014002014male8.038331
10C005036S1059092014male8.038331
11C005440S1010232019male8.052021
12C005523S1018252019female8.052021
13C005697S1036152019male8.052021
14C005759S1046322019female8.052021
15C005810S1049542019female8.052021
16C005835S1050532019male8.052021
17C005854S1054052019male8.052021
18C006550S1033292013male8.07941
19C006760S1051812013female8.07941
20C007031S1132442013male8.07941
21C007050S1001952012female8.082141
22C007305S1023502012male8.082141
23C007828S1114052012female8.082141
24C008698S1049172016female8.093091
25C008707S1022712016male8.095821
26C009596S1034212014female8.12321
27C009651S1037062014female8.12321
28C009879S1059092014female8.12321
29C010203S1026602016male8.125941
30C010204S1026602016male8.125941

Last Group (96529 rows): ntest = 5

ChildSchoolCohortSexagentest
StringStringStringStringFloat64Int64
1C002352S1000672013male7.994525
2C002353S1000672013male7.994525
3C002354S1000672013male7.994525
4C002355S1001222013female7.994525
5C002356S1001462013male7.994525
6C002357S1001462013male7.994525
7C002358S1001462013male7.994525
8C002360S1001952013female7.994525
9C002362S1002372013female7.994525
10C002363S1002372013female7.994525
11C002364S1002502013female7.994525
12C002365S1003042013male7.994525
13C002366S1003042013male7.994525
14C002367S1003162013female7.994525
15C002368S1003652013male7.994525
16C002369S1003652013male7.994525
17C002370S1003652013female7.994525
18C002372S1004322013male7.994525
19C002373S1004812013male7.994525
20C002374S1004812013male7.994525
21C002375S1004812013female7.994525
22C002376S1004932013female7.994525
23C002377S1004932013female7.994525
24C002378S1005472013male7.994525
25C002379S1005472013male7.994525
26C002380S1005472013male7.994525
27C002381S1005472013female7.994525
28C002382S1005472013female7.994525
29C002383S1005842013female7.994525
30C002384S1005962013male7.994525

Are the sexes represented more-or-less equally?

combine(groupby(first(gdf), :Sex), nrow => :nchild)

2 rows × 2 columns

Sexnchild
StringInt64
1male230
2female232
combine(groupby(last(gdf), :Sex), nrow => :nchild)

2 rows × 2 columns

Sexnchild
StringInt64
1male47552
2female48977

What about the distribution of ages?

"""
    ridgeplot!(ax::Axis, df::AbstractDataFrame, densvar::Symbol, group::Symbol; normalize=false)
    ridgeplot!(f::Figure, args...; pos=(1,1) kwargs...)
    ridgeplot(args...; kwargs...)
Create a "ridge plot".
A ridge plot is stacked plot of densities for a given variable (`densvar`) grouped by a different variable (`group`). Because densities can very widely in scale, it is sometimes useful to `normalize` the densities so that each density has a maximum of 1.
The non-mutating method creates a Figure before calling the method for Figure.
The method for Figure places the ridge plot in the grid position specified by `pos`, default is (1,1).
"""
function ridgeplot!(
  ax::Axis,
  df::AbstractDataFrame,
  densvar::Symbol,
  group::Symbol;
  normalize=false,
)
  # `normalize` makes it so that the max density is always 1
  # `normalize` works on the density not the area/mass
  gdf = groupby(df, group)
  dens = combine(gdf, densvar => kde => :kde)
  sort!(dens, group)
  spacing = normalize ? 1.0 : 0.9 * maximum(dens[!, :kde]) do val
    return maximum(val.density)
  end

  nticks = length(gdf)

  for (idx, row) in enumerate(eachrow(dens))
    dd = if normalize
      row.kde.density ./ maximum(row.kde.density)
    else
      row.kde.density
    end

    offset = idx * spacing

    lower = Node(Point2f.(row.kde.x, offset))
    upper = Node(Point2f.(row.kde.x, dd .+ offset))
    band!(ax, lower, upper; color=(:black, 0.3))
    lines!(ax, upper; color=(:black, 1.0))
  end

  ax.yticks[] = (
    1:spacing:(nticks * spacing), string.(dens[!, group])
  )
  ylims!(ax, 0, (nticks + 2) * spacing)
  ax.xlabel[] = string(densvar)
  ax.ylabel[] = string(group)

  return ax
end
function ridgeplot!(f::Figure, args...; pos=(1, 1), kwargs...)
  ridgeplot!(Axis(f[pos...]), args...; kwargs...)
  return f
end
"""
    ridgeplot(args...; kwargs...)
See [ridgeplot!](@ref).
"""
function ridgeplot(args...; kwargs...)
  return ridgeplot!(Figure(), args...; kwargs...)
end
ridgeplot(parent(gdf), :age, :ntest)
parent(gdf)

108,295 rows × 6 columns

ChildSchoolCohortSexagentest
StringStringStringStringFloat64Int64
1C002352S1000672013male7.994525
2C002353S1000672013male7.994525
3C002354S1000672013male7.994525
4C002355S1001222013female7.994525
5C002356S1001462013male7.994525
6C002357S1001462013male7.994525
7C002358S1001462013male7.994525
8C002359S1001832013female7.994524
9C002360S1001952013female7.994525
10C002361S1002132013male7.994524
11C002362S1002372013female7.994525
12C002363S1002372013female7.994525
13C002364S1002502013female7.994525
14C002365S1003042013male7.994525
15C002366S1003042013male7.994525
16C002367S1003162013female7.994525
17C002368S1003652013male7.994525
18C002369S1003652013male7.994525
19C002370S1003652013female7.994525
20C002371S1004322013female7.994524
21C002372S1004322013male7.994525
22C002373S1004812013male7.994525
23C002374S1004812013male7.994525
24C002375S1004812013female7.994525
25C002376S1004932013female7.994525
26C002377S1004932013female7.994525
27C002378S1005472013male7.994525
28C002379S1005472013male7.994525
29C002380S1005472013male7.994525
30C002381S1005472013female7.994525

Reading Arrow files in other languages

There are Arrow implementations for R (the arrow package) and for Python (pyarrow).

#| eval: false
import pyarrow.feather: read_table
read_table("./data/fggk21.arrow")
#| eval: false
library("arrow")
fggk21 <- read_feather("./data/fggk21.arrow")
nrow(fggk21)

References

Fühner, T., Granacher, U., Golle, K., & Kliegl, R. (2021). Age and sex effects in physical fitness components of 108,295 third graders including 515 primary schools and 9 cohorts. Scientific Reports, 11(1). https://doi.org/10.1038/s41598-021-97000-4