[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393313#comment-16393313 ] ASF GitHub Bot commented on ARROW-1974: --- cpcloud commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173526240 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: Right, makes sense. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393031#comment-16393031 ] ASF GitHub Bot commented on ARROW-1974: --- pitrou commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173486366 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: Yes, it was, it just wasn't necessarily the one expected by the caller according to its semantics. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392841#comment-16392841 ] ASF GitHub Bot commented on ARROW-1974: --- cpcloud commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173441381 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: I meant that before your change, the index returned was always valid? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392687#comment-16392687 ] ASF GitHub Bot commented on ARROW-1974: --- pitrou commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173416072 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: As a side note, if @wesm wants to include this in the release, we can defer API improvements to a later PR. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392686#comment-16392686 ] ASF GitHub Bot commented on ARROW-1974: --- pitrou commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173416072 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: As a side note, if @wesm wants to include this in the release, we can further API improvements to a later PR. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392653#comment-16392653 ] ASF GitHub Bot commented on ARROW-1974: --- pitrou commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173404125 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: Why? We're still returning a valid index. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391948#comment-16391948 ] ASF GitHub Bot commented on ARROW-1974: --- cpcloud commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173300635 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: > it could break compatibility True, though IIUC wouldn't this potentially segfault if you tried to use the result to index into something? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391924#comment-16391924 ] ASF GitHub Bot commented on ARROW-1974: --- pitrou commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173294153 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: 1) That's a good point. The fact that the container is unordered means it isn't guaranteed to retain insertion order, even for values which map to the same key (I would expect a straightforward implementation to maintain that order, though). I should probably remove the sentence above. 2) Because doing otherwise seems like it could break compatibility. Not sure how strongly you feel about it. The `std::string` overloads aren't used anymore in the parquet-cpp codebase, AFAICT. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391917#comment-16391917 ] ASF GitHub Bot commented on ARROW-1974: --- cpcloud commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173292620 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: Couple of questions: * I see [this language regarding the iteration order](http://en.cppreference.com/w/cpp/container/unordered_multimap) of the values for a particular key in the multimap: > every group of elements whose keys compare equivalent (compare equal with key_eq() as the comparator) forms a contiguous subrange in the iteration order Does the `iteration order` here mean that the values are iterated over in the order in which they were inserted? * Why did you choose to return the first one instead of returning `-1` (or maybe `-2`) for the `std::string` overload? Do we not want to provide a way to indicate that column indexes and column names are not 1:1 in the C++ API? Maybe that already exists. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391449#comment-16391449 ] ASF GitHub Bot commented on ARROW-1974: --- pitrou commented on issue #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-371534052 Ok, the reason for the error is that a similar pattern needs fixing in `SchemaDescriptor`. Updating shortly. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391418#comment-16391418 ] ASF GitHub Bot commented on ARROW-1974: --- pitrou commented on issue #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-371527191 Unfortunately this doesn't seem sufficient. If I add the following test, I get an error and a crash: ```diff diff --git a/src/parquet/arrow/arrow-reader-writer-test.cc b/src/parquet/arrow/arrow-reader-writer-test.cc index 72e65d4..eb5a8ec 100644 --- a/src/parquet/arrow/arrow-reader-writer-test.cc +++ b/src/parquet/arrow/arrow-reader-writer-test.cc @@ -1669,6 +1669,27 @@ TEST(TestArrowReadWrite, TableWithChunkedColumns) { } } +TEST(TestArrowReadWrite, TableWithDuplicateColumns) { + using ::arrow::ArrayFromVector; + + auto f0 = field("duplicate", ::arrow::int8()); + auto f1 = field("duplicate", ::arrow::int16()); + auto schema = ::arrow::schema({f0, f1}); + + std::vector a0_values = {1, 2, 3}; + std::vector a1_values = {14, 15, 16}; + + std::shared_ptr a0, a1; + + ArrayFromVector<::arrow::Int8Type, int8_t>(a0_values, &a0); + ArrayFromVector<::arrow::Int16Type, int16_t>(a1_values, &a1); + + auto table = Table::Make(schema, + {std::make_shared(f0->name(), a0), +std::make_shared(f1->name(), a1)}); + CheckSimpleRoundtrip(table, table->num_rows()); +} + TEST(TestArrowWrite, CheckChunkSize) { const int num_columns = 2; const int num_rows = 128; ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391413#comment-16391413 ] ASF GitHub Bot commented on ARROW-1974: --- cpcloud commented on issue #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-371525784 Thanks for doing this. Will review shortly This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391355#comment-16391355 ] Antoine Pitrou commented on ARROW-1974: --- With https://github.com/apache/parquet-cpp/pull/447, the {{to_pandas()}} call will fail with the following error: {code:python} File "table.pxi", line 1059, in pyarrow.lib.Table.to_pandas File "/home/antoine/arrow/python/pyarrow/pandas_compat.py", line 611, in table_to_blockmanager columns = _flatten_single_level_multiindex(columns) File "/home/antoine/arrow/python/pyarrow/pandas_compat.py", line 673, in _flatten_single_level_multiindex raise ValueError('Found non-unique column index') ValueError: Found non-unique column index {code} > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391353#comment-16391353 ] ASF GitHub Bot commented on ARROW-1974: --- pitrou opened a new pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391292#comment-16391292 ] Antoine Pitrou commented on ARROW-1974: --- The problem here is that {{FileReader::Impl::ReadTable}} creates a {{Table}} with a schema that has one more field than the number of physical columns. The underlying cause seems to be that {{ColumnIndicesToFieldIndices}} uses {{Group::FieldIndex}} which looks up the field by name... Also {{Group::Equals}} has a bit surprising semantics (why doesn't {{GroupNode::FieldIndex(const Node& node)}} simply look up the node by pointer equality?). > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Minor > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)