[ 
https://issues.apache.org/jira/browse/ARROW-8061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8061.
-------------------------------------------
    Fix Version/s: 0.17.0
       Resolution: Fixed

Issue resolved by pull request 6670
[https://github.com/apache/arrow/pull/6670]

> [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support 
> row groups)
> -----------------------------------------------------------------------------------------
>
>                 Key: ARROW-8061
>                 URL: https://issues.apache.org/jira/browse/ARROW-8061
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++ - Dataset
>            Reporter: Joris Van den Bossche
>            Assignee: Ben Kietzman
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.17.0
>
>          Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Specifically for parquet (not sure if it will be relevant for other file 
> formats as well, for IPC/feather potentially ther record batch), it would be 
> useful to target row groups instead of files as fragments.
> Quoting the original design documents: _"In datasets consisting of many 
> fragments, the dataset API must expose the granularity of fragments in a 
> public way to enable parallel processing, if desired. "._   
> And a comment from Wes on that: _"a single Parquet file can "export" one or 
> more fragments based on settings. The default might be to split fragments 
> based on row group"_
> Currently, the level on which fragments are defined (at least in the typical 
> partitioned parquet dataset) is "1 file == 1 fragment".
> Would it be possible or desirable to make this more fine grained, where you 
> could also opt to have a fragment per row group?   
> We could have a ParquetFragment that has this option, and a ParquetFileFormat 
> specific option to say what the granularity of a fragment is (file vs row 
> group)?
> cc [~fsaintjacques] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to