[jira] [Commented] (ARROW-3244) [Python] Multi-file parquet loading without scan

2019-06-12 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862254#comment-16862254
 ] 

Wes McKinney commented on ARROW-3244:
-

It seems we still need a bit more scaffolding to be able to read a multi-file 
dataset given the _metadata file without having to read the individual file 
footers. This isn't going to get done for 0.14.0 I don't think but I intend to 
see this logic implemented in the "C++ Datasets" component, so likely in the 
next release cycle

> [Python] Multi-file parquet loading without scan
> 
>
> Key: ARROW-3244
> URL: https://issues.apache.org/jira/browse/ARROW-3244
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> A number of mechanism are possible to avoid having to access and read the 
> parquet footers in a data set consisting of a number of files. In the case of 
> a large number of data files (perhaps split with directory partitioning) and 
> remote storage, this can be a significant overhead. This is significant from 
> the point of view of Dask, which must have the metadata available in the 
> client before setting up computational graphs.
>  
> Here are some suggestions of what could be done.
>  
>  * some parquet writing frameworks include a `_metadata` file, which contains 
> all the information from the footers of the various files. If this file is 
> present, then this data can be read from one place, with a single file 
> access. For a large number of files, parsing the thrift information may, by 
> itself, be a non-negligible overhead≥
>  * the schema (dtypes) can be found in a `_common_metadata`, or from any one 
> of the data-files, then the schema could be assumed (perhaps at the user's 
> option) to be the same for all of the files. However, the information about 
> the directory partitioning would not be available. Although Dask may infer 
> the information from the filenames, it would be preferable to go through the 
> machinery with parquet-cpp, and view the whole data-set as a single object. 
> Note that the files will still need to have the footer read to access the 
> data, for the bytes offsets, but from Dask's point of view, this would be 
> deferred to tasks running in parallel.
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3244) [Python] Multi-file parquet loading without scan

2019-03-26 Thread Matthew Rocklin (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802274#comment-16802274
 ] 

Matthew Rocklin commented on ARROW-3244:


Is this resolved today by the ParquetDataset.metadata attribute?

{code}
pq.ParquetDataset(path).metadata
{code}

> [Python] Multi-file parquet loading without scan
> 
>
> Key: ARROW-3244
> URL: https://issues.apache.org/jira/browse/ARROW-3244
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> A number of mechanism are possible to avoid having to access and read the 
> parquet footers in a data set consisting of a number of files. In the case of 
> a large number of data files (perhaps split with directory partitioning) and 
> remote storage, this can be a significant overhead. This is significant from 
> the point of view of Dask, which must have the metadata available in the 
> client before setting up computational graphs.
>  
> Here are some suggestions of what could be done.
>  
>  * some parquet writing frameworks include a `_metadata` file, which contains 
> all the information from the footers of the various files. If this file is 
> present, then this data can be read from one place, with a single file 
> access. For a large number of files, parsing the thrift information may, by 
> itself, be a non-negligible overhead≥
>  * the schema (dtypes) can be found in a `_common_metadata`, or from any one 
> of the data-files, then the schema could be assumed (perhaps at the user's 
> option) to be the same for all of the files. However, the information about 
> the directory partitioning would not be available. Although Dask may infer 
> the information from the filenames, it would be preferable to go through the 
> machinery with parquet-cpp, and view the whole data-set as a single object. 
> Note that the files will still need to have the footer read to access the 
> data, for the bytes offsets, but from Dask's point of view, this would be 
> deferred to tasks running in parallel.
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3244) [Python] Multi-file parquet loading without scan

2018-09-20 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622920#comment-16622920
 ] 

Wes McKinney commented on ARROW-3244:
-

The pyarrow perspective is essentially agnostic to the data access pattern, but 
we'd like to provide APIs to do as the user wishes with the files. The basic 
pattern of a partitioned dataset read by a single node works fine now (that's 
the {{ParquetDataset}} object)

Let's come up with a concrete API ask and the desired semantics with regards to 
when precisely the underlying file system is to be accessed, and if this is not 
available now, we can slate it for one of the upcoming releases. 

> [Python] Multi-file parquet loading without scan
> 
>
> Key: ARROW-3244
> URL: https://issues.apache.org/jira/browse/ARROW-3244
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Major
>  Labels: parquet
>
> A number of mechanism are possible to avoid having to access and read the 
> parquet footers in a data set consisting of a number of files. In the case of 
> a large number of data files (perhaps split with directory partitioning) and 
> remote storage, this can be a significant overhead. This is significant from 
> the point of view of Dask, which must have the metadata available in the 
> client before setting up computational graphs.
>  
> Here are some suggestions of what could be done.
>  
>  * some parquet writing frameworks include a `_metadata` file, which contains 
> all the information from the footers of the various files. If this file is 
> present, then this data can be read from one place, with a single file 
> access. For a large number of files, parsing the thrift information may, by 
> itself, be a non-negligible overhead≥
>  * the schema (dtypes) can be found in a `_common_metadata`, or from any one 
> of the data-files, then the schema could be assumed (perhaps at the user's 
> option) to be the same for all of the files. However, the information about 
> the directory partitioning would not be available. Although Dask may infer 
> the information from the filenames, it would be preferable to go through the 
> machinery with parquet-cpp, and view the whole data-set as a single object. 
> Note that the files will still need to have the footer read to access the 
> data, for the bytes offsets, but from Dask's point of view, this would be 
> deferred to tasks running in parallel.
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3244) [Python] Multi-file parquet loading without scan

2018-09-20 Thread Matthew Rocklin (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622917#comment-16622917
 ] 

Matthew Rocklin commented on ARROW-3244:


What happens today when someone reads a multi-file parquet dataset with dask 
dataframe?  We read a single file to get the schema and then just build tasks 
for everything else?  Or do we need to read through each of the files in order 
to find out how many row blocks are in each?

On the Arrow side is this in scope?  Is this already implemented?  Are there 
mechanisms to construct the metadata files from within Arrow?  If not, and if 
this is in scope then what is the right way / place to add this behavior?

> [Python] Multi-file parquet loading without scan
> 
>
> Key: ARROW-3244
> URL: https://issues.apache.org/jira/browse/ARROW-3244
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Major
>  Labels: parquet
>
> A number of mechanism are possible to avoid having to access and read the 
> parquet footers in a data set consisting of a number of files. In the case of 
> a large number of data files (perhaps split with directory partitioning) and 
> remote storage, this can be a significant overhead. This is significant from 
> the point of view of Dask, which must have the metadata available in the 
> client before setting up computational graphs.
>  
> Here are some suggestions of what could be done.
>  
>  * some parquet writing frameworks include a `_metadata` file, which contains 
> all the information from the footers of the various files. If this file is 
> present, then this data can be read from one place, with a single file 
> access. For a large number of files, parsing the thrift information may, by 
> itself, be a non-negligible overhead≥
>  * the schema (dtypes) can be found in a `_common_metadata`, or from any one 
> of the data-files, then the schema could be assumed (perhaps at the user's 
> option) to be the same for all of the files. However, the information about 
> the directory partitioning would not be available. Although Dask may infer 
> the information from the filenames, it would be preferable to go through the 
> machinery with parquet-cpp, and view the whole data-set as a single object. 
> Note that the files will still need to have the footer read to access the 
> data, for the bytes offsets, but from Dask's point of view, this would be 
> deferred to tasks running in parallel.
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)