[jira] [Assigned] (ARROW-7376) [C++] parquet NaN/null double statistics can result in endless loop
[ https://issues.apache.org/jira/browse/ARROW-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7376: -- Assignee: Neal Richardson (was: Francois Saint-Jacques) > [C++] parquet NaN/null double statistics can result in endless loop > --- > > Key: ARROW-7376 > URL: https://issues.apache.org/jira/browse/ARROW-7376 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.15.1 >Reporter: Pierre Belzile >Assignee: Neal Richardson >Priority: Critical > Labels: parquet, pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h > Remaining Estimate: 0h > > There is a bug in the doubles column statistics computation when writing to > parquet an array with only NaNs and nulls. It loops endlessly if the last > cell of a write group is a Null. The line in error is > [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L633] > which checks for NaN but not for Null. Code then falls through and loops > endlessly and causes the program to appear frozen. > This code snippet repeats: > {noformat} > TEST(parquet, nans) { > /* Create a small parquet structure */ > std::vector> fields; > fields.push_back(::arrow::field("doubles", ::arrow::float64())); > std::shared_ptr<::arrow::Schema> schema = > ::arrow::schema(std::move(fields)); > std::unique_ptr<::arrow::RecordBatchBuilder> builder; > ::arrow::RecordBatchBuilder::Make(schema, ::arrow::default_memory_pool(), > ); > > builder->GetFieldAs<::arrow::DoubleBuilder>(0)->Append(std::numeric_limits::quiet_NaN()); > builder->GetFieldAs<::arrow::DoubleBuilder>(0)->AppendNull(); > std::shared_ptr<::arrow::RecordBatch> batch; > builder->Flush(); > arrow::PrettyPrint(*batch, 0, ::cout); std::shared_ptr > table; > arrow::Table::FromRecordBatches({batch}, ); /* Attempt to write */ > std::shared_ptr<::arrow::io::FileOutputStream> os; > arrow::io::FileOutputStream::Open("/tmp/test.parquet", ); > parquet::WriterProperties::Builder writer_props_bld; > // writer_props_bld.disable_statistics("doubles"); > std::shared_ptr writer_props = > writer_props_bld.build(); > std::shared_ptr arrow_props = > parquet::ArrowWriterProperties::Builder().store_schema()->build(); > std::unique_ptr writer; > parquet::arrow::FileWriter::Open( > *table->schema(), arrow::default_memory_pool(), os, > writer_props, arrow_props, ); > writer->WriteTable(*table, 1024); > }{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7376) [C++] parquet NaN/null double statistics can result in endless loop
[ https://issues.apache.org/jira/browse/ARROW-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-7376: -- Assignee: Francois Saint-Jacques (was: Neal Richardson) > [C++] parquet NaN/null double statistics can result in endless loop > --- > > Key: ARROW-7376 > URL: https://issues.apache.org/jira/browse/ARROW-7376 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.15.1 >Reporter: Pierre Belzile >Assignee: Francois Saint-Jacques >Priority: Critical > Labels: parquet, pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h > Remaining Estimate: 0h > > There is a bug in the doubles column statistics computation when writing to > parquet an array with only NaNs and nulls. It loops endlessly if the last > cell of a write group is a Null. The line in error is > [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L633] > which checks for NaN but not for Null. Code then falls through and loops > endlessly and causes the program to appear frozen. > This code snippet repeats: > {noformat} > TEST(parquet, nans) { > /* Create a small parquet structure */ > std::vector> fields; > fields.push_back(::arrow::field("doubles", ::arrow::float64())); > std::shared_ptr<::arrow::Schema> schema = > ::arrow::schema(std::move(fields)); > std::unique_ptr<::arrow::RecordBatchBuilder> builder; > ::arrow::RecordBatchBuilder::Make(schema, ::arrow::default_memory_pool(), > ); > > builder->GetFieldAs<::arrow::DoubleBuilder>(0)->Append(std::numeric_limits::quiet_NaN()); > builder->GetFieldAs<::arrow::DoubleBuilder>(0)->AppendNull(); > std::shared_ptr<::arrow::RecordBatch> batch; > builder->Flush(); > arrow::PrettyPrint(*batch, 0, ::cout); std::shared_ptr > table; > arrow::Table::FromRecordBatches({batch}, ); /* Attempt to write */ > std::shared_ptr<::arrow::io::FileOutputStream> os; > arrow::io::FileOutputStream::Open("/tmp/test.parquet", ); > parquet::WriterProperties::Builder writer_props_bld; > // writer_props_bld.disable_statistics("doubles"); > std::shared_ptr writer_props = > writer_props_bld.build(); > std::shared_ptr arrow_props = > parquet::ArrowWriterProperties::Builder().store_schema()->build(); > std::unique_ptr writer; > parquet::arrow::FileWriter::Open( > *table->schema(), arrow::default_memory_pool(), os, > writer_props, arrow_props, ); > writer->WriteTable(*table, 1024); > }{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7376) [C++] parquet NaN/null double statistics can result in endless loop
[ https://issues.apache.org/jira/browse/ARROW-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-7376: - Assignee: Francois Saint-Jacques > [C++] parquet NaN/null double statistics can result in endless loop > --- > > Key: ARROW-7376 > URL: https://issues.apache.org/jira/browse/ARROW-7376 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.15.1 >Reporter: Pierre Belzile >Assignee: Francois Saint-Jacques >Priority: Major > Labels: parquet > Fix For: 0.16.0 > > > There is a bug in the doubles column statistics computation when writing to > parquet an array with only NaNs and nulls. It loops endlessly if the last > cell of a write group is a Null. The line in error is > [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L633] > which checks for NaN but not for Null. Code then falls through and loops > endlessly and causes the program to appear frozen. > This code snippet repeats: > {noformat} > TEST(parquet, nans) { > /* Create a small parquet structure */ > std::vector> fields; > fields.push_back(::arrow::field("doubles", ::arrow::float64())); > std::shared_ptr<::arrow::Schema> schema = > ::arrow::schema(std::move(fields)); > std::unique_ptr<::arrow::RecordBatchBuilder> builder; > ::arrow::RecordBatchBuilder::Make(schema, ::arrow::default_memory_pool(), > ); > > builder->GetFieldAs<::arrow::DoubleBuilder>(0)->Append(std::numeric_limits::quiet_NaN()); > builder->GetFieldAs<::arrow::DoubleBuilder>(0)->AppendNull(); > std::shared_ptr<::arrow::RecordBatch> batch; > builder->Flush(); > arrow::PrettyPrint(*batch, 0, ::cout); std::shared_ptr > table; > arrow::Table::FromRecordBatches({batch}, ); /* Attempt to write */ > std::shared_ptr<::arrow::io::FileOutputStream> os; > arrow::io::FileOutputStream::Open("/tmp/test.parquet", ); > parquet::WriterProperties::Builder writer_props_bld; > // writer_props_bld.disable_statistics("doubles"); > std::shared_ptr writer_props = > writer_props_bld.build(); > std::shared_ptr arrow_props = > parquet::ArrowWriterProperties::Builder().store_schema()->build(); > std::unique_ptr writer; > parquet::arrow::FileWriter::Open( > *table->schema(), arrow::default_memory_pool(), os, > writer_props, arrow_props, ); > writer->WriteTable(*table, 1024); > }{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)