[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410584#comment-16410584
 ] 

ASF GitHub Bot commented on PARQUET-1245:
-

wesm commented on issue #447: PARQUET-1245: Fix creating Arrow table with 
duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-375495170
 
 
   @cpcloud @pitrou we may want to bump parquet-cpp in conda-forge to pick up 
this patch at some point


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410582#comment-16410582
 ] 

ASF GitHub Bot commented on PARQUET-1245:
-

wesm closed pull request #447: PARQUET-1245: Fix creating Arrow table with 
duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/parquet/arrow/arrow-reader-writer-test.cc 
b/src/parquet/arrow/arrow-reader-writer-test.cc
index 72e65d47..f06f4a87 100644
--- a/src/parquet/arrow/arrow-reader-writer-test.cc
+++ b/src/parquet/arrow/arrow-reader-writer-test.cc
@@ -1669,6 +1669,27 @@ TEST(TestArrowReadWrite, TableWithChunkedColumns) {
   }
 }
 
+TEST(TestArrowReadWrite, TableWithDuplicateColumns) {
+  // See ARROW-1974
+  using ::arrow::ArrayFromVector;
+
+  auto f0 = field("duplicate", ::arrow::int8());
+  auto f1 = field("duplicate", ::arrow::int16());
+  auto schema = ::arrow::schema({f0, f1});
+
+  std::vector a0_values = {1, 2, 3};
+  std::vector a1_values = {14, 15, 16};
+
+  std::shared_ptr a0, a1;
+
+  ArrayFromVector<::arrow::Int8Type, int8_t>(a0_values, );
+  ArrayFromVector<::arrow::Int16Type, int16_t>(a1_values, );
+
+  auto table = Table::Make(schema, {std::make_shared(f0->name(), a0),
+std::make_shared(f1->name(), a1)});
+  CheckSimpleRoundtrip(table, table->num_rows());
+}
+
 TEST(TestArrowWrite, CheckChunkSize) {
   const int num_columns = 2;
   const int num_rows = 128;
diff --git a/src/parquet/arrow/arrow-schema-test.cc 
b/src/parquet/arrow/arrow-schema-test.cc
index d502d243..da6af528 100644
--- a/src/parquet/arrow/arrow-schema-test.cc
+++ b/src/parquet/arrow/arrow-schema-test.cc
@@ -165,6 +165,31 @@ TEST_F(TestConvertParquetSchema, ParquetFlatPrimitives) {
   CheckFlatSchema(arrow_schema);
 }
 
+TEST_F(TestConvertParquetSchema, DuplicateFieldNames) {
+  std::vector parquet_fields;
+  std::vector arrow_fields;
+
+  parquet_fields.push_back(
+  PrimitiveNode::Make("xxx", Repetition::REQUIRED, ParquetType::BOOLEAN));
+  auto arrow_field1 = std::make_shared("xxx", BOOL, false);
+
+  parquet_fields.push_back(
+  PrimitiveNode::Make("xxx", Repetition::REQUIRED, ParquetType::INT32));
+  auto arrow_field2 = std::make_shared("xxx", INT32, false);
+
+  ASSERT_OK(ConvertSchema(parquet_fields));
+  arrow_fields = {arrow_field1, arrow_field2};
+  CheckFlatSchema(std::make_shared<::arrow::Schema>(arrow_fields));
+
+  ASSERT_OK(ConvertSchema(parquet_fields, std::vector({0, 1})));
+  arrow_fields = {arrow_field1, arrow_field2};
+  CheckFlatSchema(std::make_shared<::arrow::Schema>(arrow_fields));
+
+  ASSERT_OK(ConvertSchema(parquet_fields, std::vector({1, 0})));
+  arrow_fields = {arrow_field2, arrow_field1};
+  CheckFlatSchema(std::make_shared<::arrow::Schema>(arrow_fields));
+}
+
 TEST_F(TestConvertParquetSchema, ParquetKeyValueMetadata) {
   std::vector parquet_fields;
   std::vector arrow_fields;
diff --git a/src/parquet/arrow/reader.cc b/src/parquet/arrow/reader.cc
index bd68ec32..78c3225a 100644
--- a/src/parquet/arrow/reader.cc
+++ b/src/parquet/arrow/reader.cc
@@ -443,7 +443,7 @@ Status FileReader::Impl::ReadRowGroup(int row_group_index,
 }
 
 Status FileReader::Impl::ReadTable(const std::vector& indices,
-   std::shared_ptr* table) {
+   std::shared_ptr* out) {
   std::shared_ptr<::arrow::Schema> schema;
   RETURN_NOT_OK(GetSchema(indices, ));
 
@@ -473,7 +473,9 @@ Status FileReader::Impl::ReadTable(const std::vector& 
indices,
 RETURN_NOT_OK(ParallelFor(nthreads, num_fields, ReadColumnFunc));
   }
 
-  *table = Table::Make(schema, columns);
+  std::shared_ptr table = Table::Make(schema, columns);
+  RETURN_NOT_OK(table->Validate());
+  *out = table;
   return Status::OK();
 }
 
diff --git a/src/parquet/schema-test.cc b/src/parquet/schema-test.cc
index c8cce9fa..ec9aff42 100644
--- a/src/parquet/schema-test.cc
+++ b/src/parquet/schema-test.cc
@@ -292,6 +292,17 @@ class TestGroupNode : public ::testing::Test {
 
 return fields;
   }
+
+  NodeVector Fields2() {
+// Fields with a duplicate name
+NodeVector fields;
+
+fields.push_back(Int32("duplicate", Repetition::REQUIRED));
+fields.push_back(Int64("unique"));
+fields.push_back(Double("duplicate"));
+
+return fields;
+  }
 };
 
 TEST_F(TestGroupNode, Attrs) {
@@ -346,14 +357,23 @@ TEST_F(TestGroupNode, FieldIndex) {
   GroupNode group("group", Repetition::REQUIRED, fields);
   for (size_t i = 0; i < fields.size(); i++) {
 auto field = group.field(static_cast(i));
-ASSERT_EQ(i, group.FieldIndex(*field.get()));
+

[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408097#comment-16408097
 ] 

ASF GitHub Bot commented on PARQUET-1245:
-

pitrou commented on a change in pull request #447: PARQUET-1245: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r176127016
 
 

 ##
 File path: src/parquet/schema.cc
 ##
 @@ -720,17 +718,15 @@ int SchemaDescriptor::ColumnIndex(const std::string& 
node_path) const {
   return search->second;
 }
 
-int SchemaDescriptor::ColumnIndex(const Node& node) const {
-  int result = ColumnIndex(node.path()->ToDotString());
-  if (result < 0) {
-return -1;
-  }
-  DCHECK(result < num_columns());
-  if (!node.Equals(Column(result)->schema_node().get())) {
-// Same path but not the same node
-return -1;
+int SchemaDescriptor::ColumnIndex(const Node* node) const {
 
 Review comment:
   Ok, I've turned the argument back into a const-reference.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408065#comment-16408065
 ] 

ASF GitHub Bot commented on PARQUET-1245:
-

wesm commented on a change in pull request #447: PARQUET-1245: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r176119221
 
 

 ##
 File path: src/parquet/schema.cc
 ##
 @@ -720,17 +718,15 @@ int SchemaDescriptor::ColumnIndex(const std::string& 
node_path) const {
   return search->second;
 }
 
-int SchemaDescriptor::ColumnIndex(const Node& node) const {
-  int result = ColumnIndex(node.path()->ToDotString());
-  if (result < 0) {
-return -1;
-  }
-  DCHECK(result < num_columns());
-  if (!node.Equals(Column(result)->schema_node().get())) {
-// Same path but not the same node
-return -1;
+int SchemaDescriptor::ColumnIndex(const Node* node) const {
 
 Review comment:
   Sorry for the delay on this. I think we should consistently use 
const-references (unless null is a valid input), comparing pointer equality 
with `` where necessary. See commentary in 
https://google.github.io/styleguide/cppguide.html#Reference_Arguments


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397186#comment-16397186
 ] 

ASF GitHub Bot commented on PARQUET-1245:
-

xhochy commented on a change in pull request #447: PARQUET-1245: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r174192351
 
 

 ##
 File path: src/parquet/schema.cc
 ##
 @@ -720,17 +718,15 @@ int SchemaDescriptor::ColumnIndex(const std::string& 
node_path) const {
   return search->second;
 }
 
-int SchemaDescriptor::ColumnIndex(const Node& node) const {
-  int result = ColumnIndex(node.path()->ToDotString());
-  if (result < 0) {
-return -1;
-  }
-  DCHECK(result < num_columns());
-  if (!node.Equals(Column(result)->schema_node().get())) {
-// Same path but not the same node
-return -1;
+int SchemaDescriptor::ColumnIndex(const Node* node) const {
 
 Review comment:
   @wesm What is your opinion on this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397136#comment-16397136
 ] 

ASF GitHub Bot commented on PARQUET-1245:
-

pitrou commented on a change in pull request #447: PARQUET-1245: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r174183059
 
 

 ##
 File path: src/parquet/schema.cc
 ##
 @@ -720,17 +718,15 @@ int SchemaDescriptor::ColumnIndex(const std::string& 
node_path) const {
   return search->second;
 }
 
-int SchemaDescriptor::ColumnIndex(const Node& node) const {
-  int result = ColumnIndex(node.path()->ToDotString());
-  if (result < 0) {
-return -1;
-  }
-  DCHECK(result < num_columns());
-  if (!node.Equals(Column(result)->schema_node().get())) {
-// Same path but not the same node
-return -1;
+int SchemaDescriptor::ColumnIndex(const Node* node) const {
 
 Review comment:
   By the way other methods such `Node::Equals` take a node pointer, even 
though passing a null pointer isn't supported. Should I still convert back to a 
reference?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395748#comment-16395748
 ] 

ASF GitHub Bot commented on PARQUET-1245:
-

pitrou commented on a change in pull request #447: PARQUET-1245: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173916004
 
 

 ##
 File path: src/parquet/schema.cc
 ##
 @@ -720,17 +718,15 @@ int SchemaDescriptor::ColumnIndex(const std::string& 
node_path) const {
   return search->second;
 }
 
-int SchemaDescriptor::ColumnIndex(const Node& node) const {
-  int result = ColumnIndex(node.path()->ToDotString());
-  if (result < 0) {
-return -1;
-  }
-  DCHECK(result < num_columns());
-  if (!node.Equals(Column(result)->schema_node().get())) {
-// Same path but not the same node
-return -1;
+int SchemaDescriptor::ColumnIndex(const Node* node) const {
 
 Review comment:
   Hmm... I didn't know this guideline (but out parameters are passed as 
pointers apparently?).
   The search is done using pointer equality, so I thought passing a pointer 
would be more explicit (and perhaps less error-prone, in case the compiler does 
a temporary copy).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395742#comment-16395742
 ] 

ASF GitHub Bot commented on PARQUET-1245:
-

xhochy commented on a change in pull request #447: PARQUET-1245: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173914725
 
 

 ##
 File path: src/parquet/schema.cc
 ##
 @@ -720,17 +718,15 @@ int SchemaDescriptor::ColumnIndex(const std::string& 
node_path) const {
   return search->second;
 }
 
-int SchemaDescriptor::ColumnIndex(const Node& node) const {
-  int result = ColumnIndex(node.path()->ToDotString());
-  if (result < 0) {
-return -1;
-  }
-  DCHECK(result < num_columns());
-  if (!node.Equals(Column(result)->schema_node().get())) {
-// Same path but not the same node
-return -1;
+int SchemaDescriptor::ColumnIndex(const Node* node) const {
 
 Review comment:
   Why did this change from reference to pointer? We use references everywhere 
where the passed object cannot be null.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395701#comment-16395701
 ] 

ASF GitHub Bot commented on PARQUET-1245:
-

wesm commented on issue #447: PARQUET-1245: Fix creating Arrow table with 
duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-372422163
 
 
   Moved the JIRA from Arrow to Parquet


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)