[jira] [Created] (ARROW-9969) [C++] RecordBatchBuilder yields invalid result with dictionary fields
Pierre Belzile created ARROW-9969: - Summary: [C++] RecordBatchBuilder yields invalid result with dictionary fields Key: ARROW-9969 URL: https://issues.apache.org/jira/browse/ARROW-9969 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 1.0.1 Reporter: Pierre Belzile The record batch builder takes a schema as input and uses that schema when creating the record batch. However when one or more fields are dictionaries, the data type is unknown until the dictionary builder flushes and the initial schema often does not match. The builder needs to modify the schema for the actual data type generated. This problem is easily reproduced by providing a schema with a field dictionary(int16(), utf8()) and adding a single row. This yields a data type of dictionary(int8(),utf8()). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9863) [C++] [PARQUET] Optimize meta data recovery of ApplicationVersion
Pierre Belzile created ARROW-9863: - Summary: [C++] [PARQUET] Optimize meta data recovery of ApplicationVersion Key: ARROW-9863 URL: https://issues.apache.org/jira/browse/ARROW-9863 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 1.0.0 Reporter: Pierre Belzile The class contains two large regexes which are compiled in the ApplicationVersion::ApplicationVersion(const std:::string) constructor. This is the constructor that is used when reading files. I stopped a server in gdb that had been processing several files at once and 4 threads out of 8 were building those regexes! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9660) C++ IPC - dictionaries in maps
Pierre Belzile created ARROW-9660: - Summary: C++ IPC - dictionaries in maps Key: ARROW-9660 URL: https://issues.apache.org/jira/browse/ARROW-9660 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 1.0.0 Reporter: Pierre Belzile I created the following record batch which has a single column with a type of map where dict is defined as: dict: {code:java} arrow::MapBuilder map_builder(arrow::default_memory_pool(), std::make_shared(), std::make_shared()); auto key_builder = dynamic_cast(map_builder.key_builder()); auto item_builder = dynamic_cast(map_builder.item_builder()); // Add a first row with k=v for i 0..14; ASSERT_OK(map_builder.Append()); for (int i = 0; i < 15; ++i) { ASSERT_OK(key_builder->Append("k" + std::to_string(i))); ASSERT_OK(item_builder->Append("v" + std::to_string(i))); } // Add a second row with k=w for i 0..14; ASSERT_OK(map_builder.Append()); for (int i = 0; i < 15; ++i) { ASSERT_OK(key_builder->Append("k" + std::to_string(i))); ASSERT_OK(item_builder->Append("w" + std::to_string(i))); } std::shared_ptr array; ASSERT_OK(map_builder.Finish()); std::shared_ptr schema = arrow::schema({arrow::field("s", array->type())}); std::shared_ptr batch = arrow::RecordBatch::Make(schema, array->length(), {array}); {code} When one attempts to send this in a round trip IPC: # On IpcFormatWriter::Start(): The memo records one entry for field_to_id and id_to_type_ where the dict id = 0. # On IpcFormatWriter::CollectDictionaries: The memo records a new entry for field_to_id and id_to_type with id=1 and also records in id_to_dictionary_. At this point we have 2 entries with the entry id=0 having no associated dict. # On IpcFormatWriter;:WriteDictionaries: It writes the dict with entry = 1 When reading: # GetSchema eventually gets to the nested dictionary in FieldFromFlatBuffer # The recovered dict id is 0. # This adds to the memo the field_to_id and id_to_type with id = 0 # My round trip code calls "ReadAll". # RecordBatchStreamReaderImpl::ReadNext attempts to load the initial dicts # It recovers id = 1 # The process aborts because id = 1 is not in the memo: dictionary_memo->GetDictionaryType(id, _type) A similar example with a dict inside a "struct" worked fine and only used dict id = 0. So it looks like something wrong when gathering the schema for the map. Unless I did not construct the map correctly? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8657) Distinguish parquet version 2 logical type vs DataPageV2
Pierre Belzile created ARROW-8657: - Summary: Distinguish parquet version 2 logical type vs DataPageV2 Key: ARROW-8657 URL: https://issues.apache.org/jira/browse/ARROW-8657 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.17.0 Reporter: Pierre Belzile With the recent release of 0.17, the ParquetVersion is used to define the logical type interpretation of fields and the selection of the DataPage format. As a result all parquet files that were created with ParquetVersion::V2 to get features such as unsigned int32s, timestamps with nanosecond resolution, etc are now unreadable. That's TBs of data in my case. Those two concerns should be separated. Given that that DataPageV2 pages were not written prior to 0.17 and in order to allow reading existing files, the existing version property should continue to operate as in 0.16 and inform the logical type mapping. Some consideration should be given to issue a release 0.17.1. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8129) [C++][Compute] Refine compare sorting kernel
[ https://issues.apache.org/jira/browse/ARROW-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060566#comment-17060566 ] Pierre Belzile commented on ARROW-8129: --- Yibo, Given that you are looking at this, I wonder if the following wouldn't be faster for CompareValues: {code:java} auto values_begin = values.raw_values(); // this also exists everywhere std::stable_sort(indices_begin, nulls_begin, [values_begin](uint64_t left, uint64_t right) { return *(values_begin + left) < *(values_begin + right); }); {code} It avoids dereferencing the shared_ptr data_ and adding the offset twice for each comparisons. > [C++][Compute] Refine compare sorting kernel > > > Key: ARROW-8129 > URL: https://issues.apache.org/jira/browse/ARROW-8129 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Yibo Cai >Assignee: Yibo Cai >Priority: Major > > Sorting kernel implements two comparison functions, > [CompareValues|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L67] > use array.Value() for numeric data and > [CompareViews|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L72] > uses array.GetView() for non-numeric ones. It can be simplified by using > GetView() only as all data types support GetView(). > To my surprise, benchmark shows about 40% performance improvement after the > change. > After some digging, I find in current code, the [comparison > callback|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L94] > is not inlined (check disassembled code), it leads to a function call. It's > very bad for this hot loop. Using only GetView() fixes this issue, code > inlined okay. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8006) unsafe arrow dictionary recovered from parquet
[ https://issues.apache.org/jira/browse/ARROW-8006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Belzile updated ARROW-8006: -- Description: When an arrow dictionary of values=strings and indices=intx is written to parquet and recovered, the indices that correspond to null positions are not written. This causes two problems: * when transposing the dictionary, the code encounters indices that are out of bounds with the existing dictionary. This does cause crashes. * a potential security risk because it's unclear whether bytes can be read back inadvertently. I traced using GDB and found that: # My dictionary indices were decoded by RleDecoder::GetBatchSpaced. When the valid bit is unset, that function increments "out" but does not set it. I think it should write a 0. [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/rle_encoding.h#L396] # The recovered data "out" array is written to the dictionary builder using an AppendIndices which moves the memory as a bulk move without checking for nulls. Hence we end-up with the indices buffer holding the "out" from above. [https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670 |https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670]When transpose runs on this ([https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util.cc#L406]), it may attempt to access memory out of bounds. While is would be possible to fix "transpose" and other functions that process dictionary indices (e.g. compare for sorting), it seems safer to initialize to 0. Also that's the default behavior for the arrow dict builder when appending one or more nulls. Incidentally the code recovers the dict with indices int32 instead of the original int8 but I guess that this is covered by another activity. was: When an arrow dictionary of values=strings and indices=intx is written to parquet and recovered, the indices that correspond to null positions are not written. This causes two problems: * when transposing the dictionary, the code encounters indices that are out of bounds with the existing dictionary. This does cause crashes. * a potential security risk because it's unclear whether bytes can be read back inadvertently. I traced using GDB and found that: # My dictionary indices were decoded by RleDecoder::GetBatchSpaced. When the valid bit is unset, that function increments "out" but does not set it. I think it should write a 0. [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/rle_encoding.h#L396] # The recovered data "out" array is written to the dictionary builder using an AppendIndices which moves the memory without as a bulk move without checking for nulls. Hence we end-up with the indices buffer holding the "out" from above. [https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670 |https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670]When transpose runs on this ([https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util.cc#L406]), it may attempt to access memory out of bounds. While is would be possible to fix "transpose" and other functions that process dictionary indices (e.g. compare for sorting), it seems safer to initialize to 0. Also that's the default behavior for the arrow dict builder when appending one or more nulls. Incidentally the code recovers the dict with indices int32 instead of the original int8 but I guess that this is covered by another activity. > unsafe arrow dictionary recovered from parquet > -- > > Key: ARROW-8006 > URL: https://issues.apache.org/jira/browse/ARROW-8006 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.15.1 >Reporter: Pierre Belzile >Priority: Major > > When an arrow dictionary of values=strings and indices=intx is written to > parquet and recovered, the indices that correspond to null positions are not > written. This causes two problems: > * when transposing the dictionary, the code encounters indices that are out > of bounds with the existing dictionary. This does cause crashes. > * a potential security risk because it's unclear whether bytes can be read > back inadvertently. > I traced using GDB and found that: > # My dictionary indices were decoded by RleDecoder::GetBatchSpaced. When the > valid bit is unset, that function increments "out" but does not set it. I > think it should write a 0. > [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/rle_encoding.h#L396] > # The recovered data "out" array is written to the dictionary builder using > an AppendIndices which moves the memory as a bulk move without checking for > nulls. Hence we end-up with the indices buffer holding the "out"
[jira] [Created] (ARROW-8006) unsafe arrow dictionary recovered from parquet
Pierre Belzile created ARROW-8006: - Summary: unsafe arrow dictionary recovered from parquet Key: ARROW-8006 URL: https://issues.apache.org/jira/browse/ARROW-8006 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.15.1 Reporter: Pierre Belzile When an arrow dictionary of values=strings and indices=intx is written to parquet and recovered, the indices that correspond to null positions are not written. This causes two problems: * when transposing the dictionary, the code encounters indices that are out of bounds with the existing dictionary. This does cause crashes. * a potential security risk because it's unclear whether bytes can be read back inadvertently. I traced using GDB and found that: # My dictionary indices were decoded by RleDecoder::GetBatchSpaced. When the valid bit is unset, that function increments "out" but does not set it. I think it should write a 0. [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/rle_encoding.h#L396] # The recovered data "out" array is written to the dictionary builder using an AppendIndices which moves the memory without as a bulk move without checking for nulls. Hence we end-up with the indices buffer holding the "out" from above. [https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670 |https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670]When transpose runs on this ([https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util.cc#L406]), it may attempt to access memory out of bounds. While is would be possible to fix "transpose" and other functions that process dictionary indices (e.g. compare for sorting), it seems safer to initialize to 0. Also that's the default behavior for the arrow dict builder when appending one or more nulls. Incidentally the code recovers the dict with indices int32 instead of the original int8 but I guess that this is covered by another activity. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7376) parquet NaN/null double statistics can result in endless loop
Pierre Belzile created ARROW-7376: - Summary: parquet NaN/null double statistics can result in endless loop Key: ARROW-7376 URL: https://issues.apache.org/jira/browse/ARROW-7376 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.15.1 Reporter: Pierre Belzile There is a bug in the doubles column statistics computation when writing to parquet an array with only NaNs and nulls. It loops endlessly if the last cell of a write group is a Null. The line in error is [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L633] which checks for NaN but not for Null. Code then falls through and loops endlessly and causes the program to appear frozen. This code snippet repeats: {noformat} TEST(parquet, nans) { /* Create a small parquet structure */ std::vector> fields; fields.push_back(::arrow::field("doubles", ::arrow::float64())); std::shared_ptr<::arrow::Schema> schema = ::arrow::schema(std::move(fields)); std::unique_ptr<::arrow::RecordBatchBuilder> builder; ::arrow::RecordBatchBuilder::Make(schema, ::arrow::default_memory_pool(), ); builder->GetFieldAs<::arrow::DoubleBuilder>(0)->Append(std::numeric_limits::quiet_NaN()); builder->GetFieldAs<::arrow::DoubleBuilder>(0)->AppendNull(); std::shared_ptr<::arrow::RecordBatch> batch; builder->Flush(); arrow::PrettyPrint(*batch, 0, ::cout); std::shared_ptr table; arrow::Table::FromRecordBatches({batch}, ); /* Attempt to write */ std::shared_ptr<::arrow::io::FileOutputStream> os; arrow::io::FileOutputStream::Open("/tmp/test.parquet", ); parquet::WriterProperties::Builder writer_props_bld; // writer_props_bld.disable_statistics("doubles"); std::shared_ptr writer_props = writer_props_bld.build(); std::shared_ptr arrow_props = parquet::ArrowWriterProperties::Builder().store_schema()->build(); std::unique_ptr writer; parquet::arrow::FileWriter::Open( *table->schema(), arrow::default_memory_pool(), os, writer_props, arrow_props, ); writer->WriteTable(*table, 1024); }{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)