[jira] [Commented] (ARROW-7087) [Python] Table Metadata disappear when we write a partitioned dataset

2019-11-07 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969407#comment-16969407
 ] 

François Blanchard commented on ARROW-7087:
---

I will

> [Python] Table Metadata disappear when we write a partitioned dataset
> -
>
> Key: ARROW-7087
> URL: https://issues.apache.org/jira/browse/ARROW-7087
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: François Blanchard
>Priority: Major
> Fix For: 1.0.0
>
>
> There is an unexpected behavior with the method 
> *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
>  in *pyarrow/parquet.py*
> When we write a table that contains metadata then metadata are replaced by 
> pandas metadata. This happens only if we defined *partition_cols*.
>  
> To be more explicit here is an example code: 
> {code:python}
> from pyarrow.parquet import write_to_dataset
> import pyarrow as pa
> import pyarrow.parquet as pd
> columnA = pa.array(['a', 'b', 'c'], type=pa.string())
> columnB = pa.array([1, 1, 2], type=pa.int32())
> # Build table from collumns
> table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 
> 'columnB'], metadata={'data': 'test'})
> print table.schema.metadata
> """
> Metadata is set as expected
> >> OrderedDict([('data', 'test')])
> """
> # Write table in parquet format partitioned per columnB
> write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])
> # Load data from parquet files
> ds = pd.ParquetDataset('/path/to/test')
> load_table = pq.read_table(ds.pieces[0].path)
> print load_table.schema.metadata
> """
> Metadata with the key `data` is missing
> >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": 
> >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": 
> >> [{"metadata": null, "field_name": "columnA", "name": "columnA", 
> >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": 
> >> []}')])
> """{code}
>  
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7087) [Python] Table Metadata disappear when we write a partitioned dataset

2019-11-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969404#comment-16969404
 ] 

Wes McKinney commented on ARROW-7087:
-

I would guess this relates to the table splitting logic dropping the metadata. 
Please feel free to submit a PR to fix

> [Python] Table Metadata disappear when we write a partitioned dataset
> -
>
> Key: ARROW-7087
> URL: https://issues.apache.org/jira/browse/ARROW-7087
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: François Blanchard
>Priority: Major
>
> There is an unexpected behavior with the method 
> *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
>  in *pyarrow/parquet.py*
> When we write a table that contains metadata then metadata are replaced by 
> pandas metadata. This happens only if we defined *partition_cols*.
>  
> To be more explicit here is an example code: 
> {code:python}
> from pyarrow.parquet import write_to_dataset
> import pyarrow as pa
> import pyarrow.parquet as pd
> columnA = pa.array(['a', 'b', 'c'], type=pa.string())
> columnB = pa.array([1, 1, 2], type=pa.int32())
> # Build table from collumns
> table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 
> 'columnB'], metadata={'data': 'test'})
> print table.schema.metadata
> """
> Metadata is set as expected
> >> OrderedDict([('data', 'test')])
> """
> # Write table in parquet format partitioned per columnB
> write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])
> # Load data from parquet files
> ds = pd.ParquetDataset('/path/to/test')
> load_table = pq.read_table(ds.pieces[0].path)
> print load_table.schema.metadata
> """
> Metadata with the key `data` is missing
> >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": 
> >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": 
> >> [{"metadata": null, "field_name": "columnA", "name": "columnA", 
> >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": 
> >> []}')])
> """{code}
>  
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)