[jira] [Created] (ARROW-7224) [C++][Datasets] Partition level filters should be able to provide filtering to file systems

Micah Kornfield (Jira) Wed, 20 Nov 2019 23:18:05 -0800

Micah Kornfield created ARROW-7224:
--------------------------------------

             Summary: [C++][Datasets] Partition level filters should be able to 
provide filtering to file systems
                 Key: ARROW-7224
                 URL: https://issues.apache.org/jira/browse/ARROW-7224
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++, C++ - Dataset
            Reporter: Micah Kornfield



When providing a filter for partitions, it should be possible in some cases to 
use it to optimize file system list calls.  This can greatly improve the speed 
for reading data from partitions because fewer number of directories/files need 
to be explored/expanded.  I've fallen behind on the dataset code, but I want to 
make sure this issue is tracked someplace.  This came up in SO question linked 
below (feel free to correct my analysis if I missed the functionality 
someplace).

Reference: 
[https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7224) [C++][Datasets] Partition level filters should be able to provide filtering to file systems

Reply via email to