[jira] [Commented] (ARROW-5138) [Python/C++] Row group retrieval doesn't restore index properly
[ https://issues.apache.org/jira/browse/ARROW-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856508#comment-16856508 ] Joris Van den Bossche commented on ARROW-5138: -- [~wesmckinn] I don't think that will solve this problem. The _original_ dataframe (when converted to an arrow Table) had a trivial RangeIndex (starting at 0, step of 1), so the optimization would have been correctly applied according to that logic. It is only when a Table is sliced or splitted (in row groups, and then reading a single row group instead of the full table) that the RangeIndex metadata get "out of date" and no longer match the new (subsetted) arrow Table. See also ARROW-5427 for a summary issue I made on this topic. > [Python/C++] Row group retrieval doesn't restore index properly > --- > > Key: ARROW-5138 > URL: https://issues.apache.org/jira/browse/ARROW-5138 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.13.0 >Reporter: Florian Jetter >Priority: Minor > Labels: parquet > Fix For: 0.14.0 > > > When retrieving row groups the index is no longer properly restored to its > initial value and is set to an range index starting at zero no matter what. > version 0.12.1 restored and int64 index with the correct index values. > {code:python} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > print(pa.__version__) > df = pd.DataFrame( > {"a": [1, 2, 3, 4]} > ) > print("total DF") > print(df.index) > table = pa.Table.from_pandas(df) > buf = pa.BufferOutputStream() > pq.write_table(table, buf, chunk_size=2) > reader = pa.BufferReader(buf.getvalue().to_pybytes()) > parquet_file = pq.ParquetFile(reader) > rg = parquet_file.read_row_group(1) > df_restored = rg.to_pandas() > print("Row group") > print(df_restored.index) > {code} > Previous behavior > {code:python} > 0.12.1 > total DF > RangeIndex(start=0, stop=4, step=1) > Row group > Int64Index([2, 3], dtype='int64') > {code} > Behavior now > {code:python} > 0.13.0 > total DF > RangeIndex(start=0, stop=4, step=1) > Row group > RangeIndex(start=0, stop=2, step=1) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5138) [Python/C++] Row group retrieval doesn't restore index properly
[ https://issues.apache.org/jira/browse/ARROW-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853418#comment-16853418 ] Wes McKinney commented on ARROW-5138: - I think we should change the RangeIndex optimization to only do so for trivial RangeIndex starting at 0 and with step 1. Then this issue is resolved > [Python/C++] Row group retrieval doesn't restore index properly > --- > > Key: ARROW-5138 > URL: https://issues.apache.org/jira/browse/ARROW-5138 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.13.0 >Reporter: Florian Jetter >Priority: Minor > Labels: parquet > Fix For: 0.14.0 > > > When retrieving row groups the index is no longer properly restored to its > initial value and is set to an range index starting at zero no matter what. > version 0.12.1 restored and int64 index with the correct index values. > {code:python} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > print(pa.__version__) > df = pd.DataFrame( > {"a": [1, 2, 3, 4]} > ) > print("total DF") > print(df.index) > table = pa.Table.from_pandas(df) > buf = pa.BufferOutputStream() > pq.write_table(table, buf, chunk_size=2) > reader = pa.BufferReader(buf.getvalue().to_pybytes()) > parquet_file = pq.ParquetFile(reader) > rg = parquet_file.read_row_group(1) > df_restored = rg.to_pandas() > print("Row group") > print(df_restored.index) > {code} > Previous behavior > {code:python} > 0.12.1 > total DF > RangeIndex(start=0, stop=4, step=1) > Row group > Int64Index([2, 3], dtype='int64') > {code} > Behavior now > {code:python} > 0.13.0 > total DF > RangeIndex(start=0, stop=4, step=1) > Row group > RangeIndex(start=0, stop=2, step=1) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5138) [Python/C++] Row group retrieval doesn't restore index properly
[ https://issues.apache.org/jira/browse/ARROW-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833719#comment-16833719 ] Joris Van den Bossche commented on ARROW-5138: -- The issue here is that there is a mismatch between the pandas metadata (of the original full dataframe) and the row group: - full pandas DataFrame is converted to arrow Table (which includes this part in the metadata about the index: {code}"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 4, "step": 1}] {code} - the arrow Table is written to parquet as two RowGroups, but the pandas metadata is kept intact - when reading a single row group, the pandas metadata still suggest a RangeIndex of length 4, while the row group is only of length 2. As a result, a default index is used (always starting at zero, for both row groups). I am not sure this can be solved (you would need to start modifying the range start/stop values in the pandas metadata when splitting arrow tables that have such metadata. Similar issues will be encountered when eg slicing a Table). It seems the consequence of the choice to no longer serialize a RangeIndex. [~fjetter] I think the best workaround for now would be to ensure your original data has a Int64Index instead of RangeIndex, if you want to keep this working ({{df.index = pd.Int64Index(df.index)}}). Given those issues, should we add an option to still include a RangeIndex in the actual schema? > [Python/C++] Row group retrieval doesn't restore index properly > --- > > Key: ARROW-5138 > URL: https://issues.apache.org/jira/browse/ARROW-5138 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.13.0 >Reporter: Florian Jetter >Priority: Minor > Labels: parquet > Fix For: 0.14.0 > > > When retrieving row groups the index is no longer properly restored to its > initial value and is set to an range index starting at zero no matter what. > version 0.12.1 restored and int64 index with the correct index values. > {code:python} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > print(pa.__version__) > df = pd.DataFrame( > {"a": [1, 2, 3, 4]} > ) > print("total DF") > print(df.index) > table = pa.Table.from_pandas(df) > buf = pa.BufferOutputStream() > pq.write_table(table, buf, chunk_size=2) > reader = pa.BufferReader(buf.getvalue().to_pybytes()) > parquet_file = pq.ParquetFile(reader) > rg = parquet_file.read_row_group(1) > df_restored = rg.to_pandas() > print("Row group") > print(df_restored.index) > {code} > Previous behavior > {code:python} > 0.12.1 > total DF > RangeIndex(start=0, stop=4, step=1) > Row group > Int64Index([2, 3], dtype='int64') > {code} > Behavior now > {code:python} > 0.13.0 > total DF > RangeIndex(start=0, stop=4, step=1) > Row group > RangeIndex(start=0, stop=2, step=1) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5138) [Python/C++] Row group retrieval doesn't restore index properly
[ https://issues.apache.org/jira/browse/ARROW-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822879#comment-16822879 ] Wes McKinney commented on ARROW-5138: - There seems to be a broken encapsulation here of pandas DataFrame serialization. Are you storing different DataFrame objects in different row groups? > [Python/C++] Row group retrieval doesn't restore index properly > --- > > Key: ARROW-5138 > URL: https://issues.apache.org/jira/browse/ARROW-5138 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.13.0 >Reporter: Florian Jetter >Priority: Minor > Labels: parquet > Fix For: 0.14.0 > > > When retrieving row groups the index is no longer properly restored to its > initial value and is set to an range index starting at zero no matter what. > version 0.12.1 restored and int64 index with the correct index values. > {code:python} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > print(pa.__version__) > df = pd.DataFrame( > {"a": [1, 2, 3, 4]} > ) > print("total DF") > print(df.index) > table = pa.Table.from_pandas(df) > buf = pa.BufferOutputStream() > pq.write_table(table, buf, chunk_size=2) > reader = pa.BufferReader(buf.getvalue().to_pybytes()) > parquet_file = pq.ParquetFile(reader) > rg = parquet_file.read_row_group(1) > df_restored = rg.to_pandas() > print("Row group") > print(df_restored.index) > {code} > Previous behavior > {code:python} > 0.12.1 > total DF > RangeIndex(start=0, stop=4, step=1) > Row group > Int64Index([2, 3], dtype='int64') > {code} > Behavior now > {code:python} > 0.13.0 > total DF > RangeIndex(start=0, stop=4, step=1) > Row group > RangeIndex(start=0, stop=2, step=1) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)