[jira] [Commented] (ARROW-8244) [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" metadata fields

2020-04-02 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074114#comment-17074114
 ] 

Wes McKinney commented on ARROW-8244:
-

As long as there's a well-documented way to generate the _metadata file 
containing all the row group metadata and file paths in a single structure, and 
then construct a dataset from the _metadata file (avoiding having to parse the 
metadata from all the constituent files -- which is time consuming), that 
sounds good to me

> [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" 
> metadata fields
> ---
>
> Key: ARROW-8244
> URL: https://issues.apache.org/jira/browse/ARROW-8244
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: Rick Zamora
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: parquet, pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Prior to [dask#6023|[https://github.com/dask/dask/pull/6023]], Dask has been 
> using the `write_to_dataset` API to write partitioned parquet datasets.  This 
> PR is switching to a (hopefully temporary) custom solution, because that API 
> makes it difficult to populate the the "file_path"  column-chunk metadata 
> fields that are returned within the optional `metadata_collector` kwarg.  
> Dask needs to set these fields correctly in order to generate a proper global 
> `"_metadata"` file.
> Possible solutions to this problem:
>  # Optionally populate the file-path fields within `write_to_dataset`
>  # Always populate the file-path fields within `write_to_dataset`
>  # Return the file paths for the data written within `write_to_dataset` (up 
> to the user to manually populate the file-path fields)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8244) [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" metadata fields

2020-03-31 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071845#comment-17071845
 ] 

Joris Van den Bossche commented on ARROW-8244:
--

So to summarize the issue: the python {{write_to_dataset}} API provides an 
option to return the FileMetadata of the parquet files it has written through 
{{metadata_collector}} (the FileMetadata python objects are appended to the 
list passed to this {{metadata_collector}} option).

In practice, this feature is being used by dask to collect all FileMetadata of 
all parquet files in the partitioned dataset, and to concatenate them to create 
the {{_metadata}} file.

For this use case, though, the {{file_path}} field in the 
{{ColumnChunkMetadata}} of the {{FileMetadata}} object needs to be set to the 
relative path inside the partitioned dataset. But the problem is that dask 
doesn't have these paths (it only gets the list of FileMetaData objects, and 
the files are already written to disk by pyarrow. So we have the option to 
either return those paths in some way as well, or either set those paths before 
returning the FileMetaData object in the {{write_to_dataset}} function (and to 
be clear, we would _only_ set the path in the FileMetaData being returned to 
the collector, and _not_ in the actual FileMetaData in the parquet data files 
that are being written).

I would personally just change this to set the paths in pyarrow (and consider 
this a bug fix), as I think creating the {{_metadata}} file is probably the 
only use case for this (but this is a not-very-much-educated guess). And that 
way we don't need to complicate the API further with additional options to also 
set or return the paths (but this is certainly possible to do, if we don't want 
to change the current behaviour)

cc [~wesm] [~fsaintjacques]

> [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" 
> metadata fields
> ---
>
> Key: ARROW-8244
> URL: https://issues.apache.org/jira/browse/ARROW-8244
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: Rick Zamora
>Priority: Minor
>  Labels: parquet
> Fix For: 0.17.0
>
>
> Prior to [dask#6023|[https://github.com/dask/dask/pull/6023]], Dask has been 
> using the `write_to_dataset` API to write partitioned parquet datasets.  This 
> PR is switching to a (hopefully temporary) custom solution, because that API 
> makes it difficult to populate the the "file_path"  column-chunk metadata 
> fields that are returned within the optional `metadata_collector` kwarg.  
> Dask needs to set these fields correctly in order to generate a proper global 
> `"_metadata"` file.
> Possible solutions to this problem:
>  # Optionally populate the file-path fields within `write_to_dataset`
>  # Always populate the file-path fields within `write_to_dataset`
>  # Return the file paths for the data written within `write_to_dataset` (up 
> to the user to manually populate the file-path fields)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8244) [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" metadata fields

2020-03-27 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068794#comment-17068794
 ] 

Joris Van den Bossche commented on ARROW-8244:
--

Thanks for opening the issue [~rjzamora]

Agreed this is a problem, and I think we should at least also return the path 
(so it can be fixed afterwards), or otherwise set it ourselves (optionally).

Regarding those different options: starting to also return the path together 
with the metadata is not really backwards compatible, so we would need to add 
additional keyword like `path_collector` in addition to `metadata_collector`. 

For simply always populating the file path, that might depend on whether there 
are other use cases for collecting this metadata (although I assume dask is the 
main user of this keyword).   
A github search turned up dask, cudf and spatialpandas as users of the 
`metadata_collector` keyword. I assume `cudf` needs the same fix as dask. I 
didn't check yet how it's used in spatialpandas.

I suppose optionally populating it is the safest, I am only doubtful that 
having it optional behind a new keyword is actually useful (whether there are 
use cases for not wanting to populate it).

> [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" 
> metadata fields
> ---
>
> Key: ARROW-8244
> URL: https://issues.apache.org/jira/browse/ARROW-8244
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: Rick Zamora
>Priority: Minor
>  Labels: parquet
> Fix For: 0.17.0
>
>
> Prior to [dask#6023|[https://github.com/dask/dask/pull/6023]], Dask has been 
> using the `write_to_dataset` API to write partitioned parquet datasets.  This 
> PR is switching to a (hopefully temporary) custom solution, because that API 
> makes it difficult to populate the the "file_path"  column-chunk metadata 
> fields that are returned within the optional `metadata_collector` kwarg.  
> Dask needs to set these fields correctly in order to generate a proper global 
> `"_metadata"` file.
> Possible solutions to this problem:
>  # Optionally populate the file-path fields within `write_to_dataset`
>  # Always populate the file-path fields within `write_to_dataset`
>  # Return the file paths for the data written within `write_to_dataset` (up 
> to the user to manually populate the file-path fields)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)