[ https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney reassigned ARROW-5666: ----------------------------------- Assignee: Joris Van den Bossche > [Python] Underscores in partition (string) values are dropped when reading > dataset > ---------------------------------------------------------------------------------- > > Key: ARROW-5666 > URL: https://issues.apache.org/jira/browse/ARROW-5666 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.13.0 > Reporter: Julian de Ruiter > Assignee: Joris Van den Bossche > Priority: Major > Labels: dataset-parquet-read, parquet > > When reading a partitioned dataset, in which the partition column contains > string values with underscores, pyarrow seems to be ignoring the underscores > in the resulting values. > For example if I write and then read a dataset as follows: > {code:java} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({ > "year_week": ["2019_2", "2019_3"], > "value": [1, 2] > }) > table = pa.Table.from_pandas(df.head()) > pq.write_to_dataset(table, 'test', partition_cols=["year_week"]) > table2 = pq.ParquetDataset('test').read() > {code} > The resulting 'year_week' column in table 2 has lost the underscores: > {code:java} > table2[1] # Gives: > <Column name='year_week' type=DictionaryType(dictionary<values=int64, > indices=int32, ordered=0>)> > [ > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 0 > ], > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 1 > ] > ] > {code} > Is this intentional behaviour or is this a bug in arrow? -- This message was sent by Atlassian Jira (v8.3.4#803005)