[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393313#comment-16393313
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

cpcloud commented on a change in pull request #447: ARROW-1974: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173526240
 
 

 ##
 File path: src/parquet/schema.h
 ##
 @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node {
   bool Equals(const Node* other) const override;
 
   NodePtr field(int i) const { return fields_[i]; }
+  // Get the index of a field by its name, or negative value if not found
+  // If several fields share the same name, the smallest index is returned
 
 Review comment:
   Right, makes sense.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393031#comment-16393031
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

pitrou commented on a change in pull request #447: ARROW-1974: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173486366
 
 

 ##
 File path: src/parquet/schema.h
 ##
 @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node {
   bool Equals(const Node* other) const override;
 
   NodePtr field(int i) const { return fields_[i]; }
+  // Get the index of a field by its name, or negative value if not found
+  // If several fields share the same name, the smallest index is returned
 
 Review comment:
   Yes, it was, it just wasn't necessarily the one expected by the caller 
according to its semantics.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392841#comment-16392841
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

cpcloud commented on a change in pull request #447: ARROW-1974: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173441381
 
 

 ##
 File path: src/parquet/schema.h
 ##
 @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node {
   bool Equals(const Node* other) const override;
 
   NodePtr field(int i) const { return fields_[i]; }
+  // Get the index of a field by its name, or negative value if not found
+  // If several fields share the same name, the smallest index is returned
 
 Review comment:
   I meant that before your change, the index returned was always valid?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392687#comment-16392687
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

pitrou commented on a change in pull request #447: ARROW-1974: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173416072
 
 

 ##
 File path: src/parquet/schema.h
 ##
 @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node {
   bool Equals(const Node* other) const override;
 
   NodePtr field(int i) const { return fields_[i]; }
+  // Get the index of a field by its name, or negative value if not found
+  // If several fields share the same name, the smallest index is returned
 
 Review comment:
   As a side note, if @wesm wants to include this in the release, we can defer 
API improvements to a later PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392686#comment-16392686
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

pitrou commented on a change in pull request #447: ARROW-1974: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173416072
 
 

 ##
 File path: src/parquet/schema.h
 ##
 @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node {
   bool Equals(const Node* other) const override;
 
   NodePtr field(int i) const { return fields_[i]; }
+  // Get the index of a field by its name, or negative value if not found
+  // If several fields share the same name, the smallest index is returned
 
 Review comment:
   As a side note, if @wesm wants to include this in the release, we can 
further API improvements to a later PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392653#comment-16392653
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

pitrou commented on a change in pull request #447: ARROW-1974: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173404125
 
 

 ##
 File path: src/parquet/schema.h
 ##
 @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node {
   bool Equals(const Node* other) const override;
 
   NodePtr field(int i) const { return fields_[i]; }
+  // Get the index of a field by its name, or negative value if not found
+  // If several fields share the same name, the smallest index is returned
 
 Review comment:
   Why? We're still returning a valid index.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391948#comment-16391948
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

cpcloud commented on a change in pull request #447: ARROW-1974: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173300635
 
 

 ##
 File path: src/parquet/schema.h
 ##
 @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node {
   bool Equals(const Node* other) const override;
 
   NodePtr field(int i) const { return fields_[i]; }
+  // Get the index of a field by its name, or negative value if not found
+  // If several fields share the same name, the smallest index is returned
 
 Review comment:
   > it could break compatibility
   
   True, though IIUC wouldn't this potentially segfault if you tried to use the 
result to index into something?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391924#comment-16391924
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

pitrou commented on a change in pull request #447: ARROW-1974: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173294153
 
 

 ##
 File path: src/parquet/schema.h
 ##
 @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node {
   bool Equals(const Node* other) const override;
 
   NodePtr field(int i) const { return fields_[i]; }
+  // Get the index of a field by its name, or negative value if not found
+  // If several fields share the same name, the smallest index is returned
 
 Review comment:
   1) That's a good point. The fact that the container is unordered means it 
isn't guaranteed to retain insertion order, even for values which map to the 
same key (I would expect a straightforward implementation to maintain that 
order, though). I should probably remove the sentence above.
   
   2) Because doing otherwise seems like it could break compatibility. Not sure 
how strongly you feel about it. The `std::string` overloads aren't used anymore 
in the parquet-cpp codebase, AFAICT.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391917#comment-16391917
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

cpcloud commented on a change in pull request #447: ARROW-1974: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173292620
 
 

 ##
 File path: src/parquet/schema.h
 ##
 @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node {
   bool Equals(const Node* other) const override;
 
   NodePtr field(int i) const { return fields_[i]; }
+  // Get the index of a field by its name, or negative value if not found
+  // If several fields share the same name, the smallest index is returned
 
 Review comment:
   Couple of questions:
   
   * I see [this language regarding the iteration 
order](http://en.cppreference.com/w/cpp/container/unordered_multimap) of the 
values for a particular key in the multimap:
   
   > every group of elements whose keys compare equivalent (compare equal with 
key_eq() as the comparator) forms a contiguous subrange in the iteration order
   
   Does the `iteration order` here mean that the values are iterated over in 
the order in which they were inserted?
   
   * Why did you choose to return the first one instead of returning `-1` (or 
maybe `-2`) for the `std::string` overload? Do we not want to provide a way to 
indicate that column indexes and column names are not 1:1 in the C++ API? Maybe 
that already exists.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391449#comment-16391449
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

pitrou commented on issue #447: ARROW-1974: Fix creating Arrow table with 
duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-371534052
 
 
   Ok, the reason for the error is that a similar pattern needs fixing in 
`SchemaDescriptor`. Updating shortly.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391418#comment-16391418
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

pitrou commented on issue #447: ARROW-1974: Fix creating Arrow table with 
duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-371527191
 
 
   Unfortunately this doesn't seem sufficient. If I add the following test, I 
get an error and a crash:
   ```diff
   diff --git a/src/parquet/arrow/arrow-reader-writer-test.cc 
b/src/parquet/arrow/arrow-reader-writer-test.cc
   index 72e65d4..eb5a8ec 100644
   --- a/src/parquet/arrow/arrow-reader-writer-test.cc
   +++ b/src/parquet/arrow/arrow-reader-writer-test.cc
   @@ -1669,6 +1669,27 @@ TEST(TestArrowReadWrite, TableWithChunkedColumns) {
  }
}

   +TEST(TestArrowReadWrite, TableWithDuplicateColumns) {
   +  using ::arrow::ArrayFromVector;
   +
   +  auto f0 = field("duplicate", ::arrow::int8());
   +  auto f1 = field("duplicate", ::arrow::int16());
   +  auto schema = ::arrow::schema({f0, f1});
   +
   +  std::vector a0_values = {1, 2, 3};
   +  std::vector a1_values = {14, 15, 16};
   +
   +  std::shared_ptr a0, a1;
   +
   +  ArrayFromVector<::arrow::Int8Type, int8_t>(a0_values, &a0);
   +  ArrayFromVector<::arrow::Int16Type, int16_t>(a1_values, &a1);
   +
   +  auto table = Table::Make(schema,
   +   {std::make_shared(f0->name(), a0),
   +std::make_shared(f1->name(), a1)});
   +  CheckSimpleRoundtrip(table, table->num_rows());
   +}
   +
TEST(TestArrowWrite, CheckChunkSize) {
  const int num_columns = 2;
  const int num_rows = 128;
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391413#comment-16391413
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

cpcloud commented on issue #447: ARROW-1974: Fix creating Arrow table with 
duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-371525784
 
 
   Thanks for doing this. Will review shortly


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391355#comment-16391355
 ] 

Antoine Pitrou commented on ARROW-1974:
---

With https://github.com/apache/parquet-cpp/pull/447, the {{to_pandas()}} call 
will fail with the following error:
{code:python}
  File "table.pxi", line 1059, in pyarrow.lib.Table.to_pandas
  File "/home/antoine/arrow/python/pyarrow/pandas_compat.py", line 611, in 
table_to_blockmanager
columns = _flatten_single_level_multiindex(columns)
  File "/home/antoine/arrow/python/pyarrow/pandas_compat.py", line 673, in 
_flatten_single_level_multiindex
raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
{code}

> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391353#comment-16391353
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

pitrou opened a new pull request #447: ARROW-1974: Fix creating Arrow table 
with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391292#comment-16391292
 ] 

Antoine Pitrou commented on ARROW-1974:
---

The problem here is that {{FileReader::Impl::ReadTable}} creates a {{Table}} 
with a schema that has one more field than the number of physical columns. The 
underlying cause seems to be that {{ColumnIndicesToFieldIndices}} uses 
{{Group::FieldIndex}} which looks up the field by name... Also 
{{Group::Equals}} has a bit surprising semantics (why doesn't 
{{GroupNode::FieldIndex(const Node& node)}} simply look up the node by pointer 
equality?).

> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Minor
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)