[ https://issues.apache.org/jira/browse/ARROW-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833719#comment-16833719 ]
Joris Van den Bossche edited comment on ARROW-5138 at 5/6/19 11:29 AM: ----------------------------------------------------------------------- The issue here is that there is a mismatch between the pandas metadata (of the original full dataframe) and the row group: - full pandas DataFrame is converted to arrow Table, which includes this part in the metadata about the index: {code}"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 4, "step": 1}] {code} - the arrow Table is written to parquet as two RowGroups, but the pandas metadata is kept intact - when reading a single row group, the pandas metadata still suggest a RangeIndex of length 4, while the row group is only of length 2. As a result, a default index is used (always starting at zero, for both row groups). I am not sure this can be solved (you would need to start modifying the range start/stop values in the pandas metadata when splitting arrow tables that have such metadata. Similar issues will be encountered when eg slicing a Table). It seems the consequence of the choice to no longer serialize a RangeIndex. [~fjetter] I think the best workaround for now would be to ensure your original data has a Int64Index instead of RangeIndex, if you want to keep this working ({{df.index = pd.Int64Index(df.index)}}). Given those issues, should we add an option to still include a RangeIndex in the actual schema? was (Author: jorisvandenbossche): The issue here is that there is a mismatch between the pandas metadata (of the original full dataframe) and the row group: - full pandas DataFrame is converted to arrow Table (which includes this part in the metadata about the index: {code}"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 4, "step": 1}] {code} - the arrow Table is written to parquet as two RowGroups, but the pandas metadata is kept intact - when reading a single row group, the pandas metadata still suggest a RangeIndex of length 4, while the row group is only of length 2. As a result, a default index is used (always starting at zero, for both row groups). I am not sure this can be solved (you would need to start modifying the range start/stop values in the pandas metadata when splitting arrow tables that have such metadata. Similar issues will be encountered when eg slicing a Table). It seems the consequence of the choice to no longer serialize a RangeIndex. [~fjetter] I think the best workaround for now would be to ensure your original data has a Int64Index instead of RangeIndex, if you want to keep this working ({{df.index = pd.Int64Index(df.index)}}). Given those issues, should we add an option to still include a RangeIndex in the actual schema? > [Python/C++] Row group retrieval doesn't restore index properly > --------------------------------------------------------------- > > Key: ARROW-5138 > URL: https://issues.apache.org/jira/browse/ARROW-5138 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.13.0 > Reporter: Florian Jetter > Priority: Minor > Labels: parquet > Fix For: 0.14.0 > > > When retrieving row groups the index is no longer properly restored to its > initial value and is set to an range index starting at zero no matter what. > version 0.12.1 restored and int64 index with the correct index values. > {code:python} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > print(pa.__version__) > df = pd.DataFrame( > {"a": [1, 2, 3, 4]} > ) > print("total DF") > print(df.index) > table = pa.Table.from_pandas(df) > buf = pa.BufferOutputStream() > pq.write_table(table, buf, chunk_size=2) > reader = pa.BufferReader(buf.getvalue().to_pybytes()) > parquet_file = pq.ParquetFile(reader) > rg = parquet_file.read_row_group(1) > df_restored = rg.to_pandas() > print("Row group") > print(df_restored.index) > {code} > Previous behavior > {code:python} > 0.12.1 > total DF > RangeIndex(start=0, stop=4, step=1) > Row group > Int64Index([2, 3], dtype='int64') > {code} > Behavior now > {code:python} > 0.13.0 > total DF > RangeIndex(start=0, stop=4, step=1) > Row group > RangeIndex(start=0, stop=2, step=1) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)