Thanks for the report. The R bindings to the C++ methods that pyarrow is using in the docs you linked haven't been written yet. https://issues.apache.org/jira/browse/ARROW-9657 is the open issue for that. I agree that it would be good to support from R.
A couple of minutes also seems a bit slow even for the case where you don't provide the file paths, so that would be worth investigating as well. Neal On Tue, Dec 22, 2020 at 9:08 PM Charlton Callender <[email protected]> wrote: > Hi > > > > I am starting to use arrow in a workflow where I have a dataset > partitioned by a couple variables (like location and year) that leads to > > 100,000 parquet files. > > > > I have been using `arrow::open_dataset(sources = FILEPATH, unify_schemas = > FALSE)` but found this is taking a couple minutes to run. I can see that > almost all the time is spent on this line creating the > FileSystemDatasetFactory. > https://github.com/apache/arrow/blob/master/r/R/dataset-factory.R#L135 > > > > In my use case I know all the partition file paths and I know the schema > (and that it is consistent across partitions). Is there any way to use that > information to more quickly create the Dataset object with a highly > partitioned dataset? > > > > I found this section in the Python docs about creating a dataset from > filepaths, is this possible to do from R? > https://arrow.apache.org/docs/python/dataset.html#manual-specification-of-the-dataset > > > > Thank you! I’ve been finding arrow/parquet really useful as an alternative > to hdf5 and csv. >
