The Arrow storage format provides a language-agnostic storage and memory specification for columnar data tables, which just means “something that looks like a data frame in R”. That is, an arrow table is an ordered, named collection of columns, all of the same length.
The columns can be of different types including numeric values, character strings, and factor-like representations - called DictEncoded.
An Arrow file can be read or written from R, Python, Julia and many other languages. Somewhat confusingly in R and Python the name feather, which refers to an earlier version of the storage format, is used in some function names like read_feather.
Internally, the SMLP2023 package uses Arrow to store all of its datasets.
2 The Emotikon data
The SMLP2021 repository contains a version of the data from Fühner et al. (2021) in notebooks/data/fggk21.arrow. After that file was created there were changes in the master RDS file on the osf.io site for the project. We will recreate the Arrow file here then split it into two separate tables, one with a row for each child in the study and one with a row for each test result.
The Arrow package for Julia does not export any function names, which means that the function to read an Arrow file must be called as Arrow.Table. It returns a column table, as described in the Tables package. This is like a read-only data frame, which can be easily converted to a full-fledged DataFrame if desired.
This arrangement allows for the Arrow package not to depend on the DataFrames package, which is a heavy-weight dependency, but still easily produce a DataFrame if warranted.
Load the packages to be used.
Code
usingAlgebraOfGraphicsusingArrowusingCairoMakieusingChainusingDataFrameMacrosusingDataFramesusingDownloadsusingKernelDensityusingRCall# access R from within JuliausingStatsBaseCairoMakie.activate!(; type="svg")usingAlgebraOfGraphics: density
3 Downloading and importing the RDS file
This is similar to some of the code shown by Julius Krumbiegel on Monday. In the data directory of the emotikon project on osf.io under Data, the url for the rds data file is found to be [https://osf.io/xawdb/]. Note that we want version 2 of this file.
One of the principles of relational database design is that information should not be repeated needlessly. Each row of df is determined by a combination of Child and Test, together producing a score, which can be converted to a zScore.
The other columns in the table, Cohort, School, age, and Sex, are properties of the Child.
Storing these values redundantly in the full table takes up space but, more importantly, allows for inconsistency. As it stands, a given Child could be recorded as being in one Cohort for the Run test and in another Cohort for the S20_r test and nothing about the table would detect this as being an error.
The approach used in relational databases is to store the information for score in one table that contains only Child, Test and score, store the information for the Child in another table including Cohort, School, age and Sex. These tables can then be combined to create the table to be used for analysis by joining the different tables together.
The maintainers of the DataFrames package have put in a lot of work over the past few years to make joins quite efficient in Julia. Thus the processing penalty of reassembling the big table from three smaller tables is minimal.
It is important to note that the main advantage of using smaller tables that are joined together to produce the analysis table is the fact that the information in the analysis table is consistent by design.
A careful examination of the file sizes versus that of ./data/fggk21.arrow will show that the separate tables combined take up more space than the original because of the compression. Compression algorithms are often more successful when applied to larger files.
Now read the Arrow tables in and reassemble the original table.
The call to disallowmissing! is because the join will create columns that allow for missing values but we know that we should not get missing values in the result. This call will fail if, for some reason, missing values were created.
6 Discovering patterns in the data
One of the motivations for creating the Child table was to be able to bin the ages according to the age of each child, not the age of each Child-Test combination. Not all children have all 5 test results. We can check the number of results by grouping on :Child and evaluate the number of rows in each group.
Now create a table of the number of children with 1, 2, …, 5 test scores.
combine(groupby(nobsChild, :ntest), nrow)
5×2 DataFrame
Row
ntest
nrow
Int64
Int64
1
1
462
2
2
729
3
3
1739
4
4
8836
5
5
96529
A natural question at this point is whether there is something about those students who have few observations. For example, are they from only a few schools?
One approach to examining properties like is to add the number of observations for each child to the :Child table. Later we can group the table according to this :ntest to look at properties of :Child by :ntest.
Fühner, T., Granacher, U., Golle, K., & Kliegl, R. (2021). Age and sex effects in physical fitness components of 108,295 third graders including 515 primary schools and 9 cohorts. Scientific Reports, 11(1). https://doi.org/10.1038/s41598-021-97000-4