[R][Dataset] how to speed up creating FileSystemDatasetFactory from a large partitioned dataset?

Charlton Callender Tue, 22 Dec 2020 21:08:11 -0800

Hi

I am starting to use arrow in a workflow where I have a dataset partitioned by 
a couple variables (like location and year) that leads to > 100,000 parquet 
files.


I have been using `arrow::open_dataset(sources = FILEPATH, unify_schemas = 
FALSE)` but found this is taking a couple minutes to run. I can see that almost 
all the time is spent on this line creating the FileSystemDatasetFactory. 
https://github.com/apache/arrow/blob/master/r/R/dataset-factory.R#L135

In my use case I know all the partition file paths and I know the schema (and 
that it is consistent across partitions). Is there any way to use that 
information to more quickly create the Dataset object with a highly partitioned 
dataset?

I found this section in the Python docs about creating a dataset from 
filepaths, is this possible to do from R? 
https://arrow.apache.org/docs/python/dataset.html#manual-specification-of-the-dataset

Thank you! I’ve been finding arrow/parquet really useful as an alternative to 
hdf5 and csv.

[R][Dataset] how to speed up creating FileSystemDatasetFactory from a large partitioned dataset?

Reply via email to