[jira] [Assigned] (ARROW-8061) [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support row groups)

2020-03-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-8061:
--

Assignee: Neal Richardson

> [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support 
> row groups)
> -
>
> Key: ARROW-8061
> URL: https://issues.apache.org/jira/browse/ARROW-8061
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Specifically for parquet (not sure if it will be relevant for other file 
> formats as well, for IPC/feather potentially ther record batch), it would be 
> useful to target row groups instead of files as fragments.
> Quoting the original design documents: _"In datasets consisting of many 
> fragments, the dataset API must expose the granularity of fragments in a 
> public way to enable parallel processing, if desired. "._   
> And a comment from Wes on that: _"a single Parquet file can "export" one or 
> more fragments based on settings. The default might be to split fragments 
> based on row group"_
> Currently, the level on which fragments are defined (at least in the typical 
> partitioned parquet dataset) is "1 file == 1 fragment".
> Would it be possible or desirable to make this more fine grained, where you 
> could also opt to have a fragment per row group?   
> We could have a ParquetFragment that has this option, and a ParquetFileFormat 
> specific option to say what the granularity of a fragment is (file vs row 
> group)?
> cc [~fsaintjacques] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8061) [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support row groups)

2020-03-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-8061:
--

Assignee: Ben Kietzman  (was: Neal Richardson)

> [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support 
> row groups)
> -
>
> Key: ARROW-8061
> URL: https://issues.apache.org/jira/browse/ARROW-8061
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Specifically for parquet (not sure if it will be relevant for other file 
> formats as well, for IPC/feather potentially ther record batch), it would be 
> useful to target row groups instead of files as fragments.
> Quoting the original design documents: _"In datasets consisting of many 
> fragments, the dataset API must expose the granularity of fragments in a 
> public way to enable parallel processing, if desired. "._   
> And a comment from Wes on that: _"a single Parquet file can "export" one or 
> more fragments based on settings. The default might be to split fragments 
> based on row group"_
> Currently, the level on which fragments are defined (at least in the typical 
> partitioned parquet dataset) is "1 file == 1 fragment".
> Would it be possible or desirable to make this more fine grained, where you 
> could also opt to have a fragment per row group?   
> We could have a ParquetFragment that has this option, and a ParquetFileFormat 
> specific option to say what the granularity of a fragment is (file vs row 
> group)?
> cc [~fsaintjacques] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)