Hi,
I have created (from R) an arrow dataset consisting of 86 files (feather).
Each of them is 895M, with about 500 rows and 32000 columns. The natural
structure of the complete dataframe is a 86*500 row dataframe.
My aim is to load a chunk consisting of all rows and a subset of columns
(two ID columns + 100 other columns), I'll do some manipulation and
modelling on that chunk, then move to the next and repeat.
Each row in the dataframe corresponds to a flattened image, with two ID
columns. Each feather file contains the set of images corresponding to a
single measure.
I want to run a series of collect(arrowdataset[, c("ID1", "ID2", "V1",
"V2")])
However the load time seems very slow (10 minutes+), and I'm wondering what
I've done wrong. I've tested on hosts with SSD.
I can see a saving in which ID1 becomes part of the partitioning instead of
storing it with the data, but that sounds like a minor change.
Any thoughts on what I've missed.