[jira] [Assigned] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1974: --- Assignee: Wes McKinney (was: Antoine Pitrou) > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Wes McKinney >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1974: --- Assignee: Antoine Pitrou (was: Phillip Cloud) > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phillip Cloud reassigned ARROW-1974: Assignee: Phillip Cloud > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Minor > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)