[jira] [Created] (ARROW-9969) [C++] RecordBatchBuilder yields invalid result with dictionary fields

2020-09-10 Thread Pierre Belzile (Jira)
Pierre Belzile created ARROW-9969:
-

 Summary: [C++] RecordBatchBuilder yields invalid result with 
dictionary fields
 Key: ARROW-9969
 URL: https://issues.apache.org/jira/browse/ARROW-9969
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 1.0.1
Reporter: Pierre Belzile


The record batch builder takes a schema as input and uses that schema when 
creating the record batch.

However when one or more fields are dictionaries, the data type is unknown 
until the dictionary builder flushes and the initial schema often does not 
match. The builder needs to modify the schema for the actual data type 
generated.

This problem is easily reproduced by providing a schema with a field 
dictionary(int16(), utf8()) and adding a single row. This yields a data type of 
dictionary(int8(),utf8()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9863) [C++] [PARQUET] Optimize meta data recovery of ApplicationVersion

2020-08-26 Thread Pierre Belzile (Jira)
Pierre Belzile created ARROW-9863:
-

 Summary: [C++] [PARQUET] Optimize meta data recovery of 
ApplicationVersion
 Key: ARROW-9863
 URL: https://issues.apache.org/jira/browse/ARROW-9863
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 1.0.0
Reporter: Pierre Belzile


The class contains two large regexes which are compiled in the 
ApplicationVersion::ApplicationVersion(const std:::string) constructor. This is 
the constructor that is used when reading files. 

I stopped a server in gdb that had been processing several files at once and 4 
threads out of 8 were building those regexes!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9660) C++ IPC - dictionaries in maps

2020-08-05 Thread Pierre Belzile (Jira)
Pierre Belzile created ARROW-9660:
-

 Summary: C++ IPC - dictionaries in maps
 Key: ARROW-9660
 URL: https://issues.apache.org/jira/browse/ARROW-9660
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 1.0.0
Reporter: Pierre Belzile


I created the following record batch which has a single column with a type of 
map where dict is defined as: dict:

 
{code:java}
arrow::MapBuilder map_builder(arrow::default_memory_pool(),
std::make_shared(),
std::make_shared());
auto key_builder = 
dynamic_cast(map_builder.key_builder());
auto item_builder = 
dynamic_cast(map_builder.item_builder());

// Add a first row with k=v for i 0..14;
ASSERT_OK(map_builder.Append());
for (int i = 0; i < 15; ++i) {
  ASSERT_OK(key_builder->Append("k" + std::to_string(i)));
  ASSERT_OK(item_builder->Append("v" + std::to_string(i)));
}
// Add a second row with k=w for i 0..14;
ASSERT_OK(map_builder.Append());
for (int i = 0; i < 15; ++i) {
  ASSERT_OK(key_builder->Append("k" + std::to_string(i)));
  ASSERT_OK(item_builder->Append("w" + std::to_string(i)));
}
std::shared_ptr array;
ASSERT_OK(map_builder.Finish());
std::shared_ptr schema = 
arrow::schema({arrow::field("s", array->type())});
std::shared_ptr batch = 
arrow::RecordBatch::Make(schema, array->length(), {array});
{code}
When one attempts to send this in a round trip IPC:
 # On IpcFormatWriter::Start(): The memo records one entry for field_to_id and 
id_to_type_ where the dict id = 0.
 # On IpcFormatWriter::CollectDictionaries: The memo records a new entry for 
field_to_id and id_to_type with id=1 and also records in id_to_dictionary_. At 
this point we have 2 entries with the entry id=0 having no associated dict.
 # On IpcFormatWriter;:WriteDictionaries: It writes the dict with entry = 1

When reading:
 # GetSchema eventually gets to the nested dictionary in FieldFromFlatBuffer
 # The recovered dict id is 0.
 # This adds to the memo the field_to_id and id_to_type with id = 0
 # My round trip code calls "ReadAll".
 # RecordBatchStreamReaderImpl::ReadNext attempts to load the initial dicts
 # It recovers id = 1
 # The process aborts because id = 1 is not in the memo: 
dictionary_memo->GetDictionaryType(id, _type)

A similar example with a dict inside a "struct" worked fine and only used dict 
id = 0. So it looks like something wrong when gathering the schema for the map. 
Unless I did not construct the map correctly?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8657) Distinguish parquet version 2 logical type vs DataPageV2

2020-04-30 Thread Pierre Belzile (Jira)
Pierre Belzile created ARROW-8657:
-

 Summary: Distinguish parquet version 2 logical type vs DataPageV2
 Key: ARROW-8657
 URL: https://issues.apache.org/jira/browse/ARROW-8657
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.17.0
Reporter: Pierre Belzile


With the recent release of 0.17, the ParquetVersion is used to define the 
logical type interpretation of fields and the selection of the DataPage format.

As a result all parquet files that were created with ParquetVersion::V2 to get 
features such as unsigned int32s, timestamps with nanosecond resolution, etc 
are now unreadable. That's TBs of data in my case.

Those two concerns should be separated. Given that that DataPageV2 pages were 
not written prior to 0.17 and in order to allow reading existing files, the 
existing version property should continue to operate as in 0.16 and inform the 
logical type mapping.

Some consideration should be given to issue a release 0.17.1.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8129) [C++][Compute] Refine compare sorting kernel

2020-03-16 Thread Pierre Belzile (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060566#comment-17060566
 ] 

Pierre Belzile commented on ARROW-8129:
---

Yibo,

Given that you are looking at this, I wonder if the following wouldn't be 
faster for CompareValues:
{code:java}
auto values_begin = values.raw_values(); // this also exists everywhere
std::stable_sort(indices_begin, nulls_begin,
 [values_begin](uint64_t left, uint64_t right) {
  return *(values_begin + left) < *(values_begin + right);
 });

{code}
It avoids dereferencing the shared_ptr data_ and adding the offset twice for 
each comparisons.

> [C++][Compute] Refine compare sorting kernel
> 
>
> Key: ARROW-8129
> URL: https://issues.apache.org/jira/browse/ARROW-8129
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>
> Sorting kernel implements two comparison functions, 
> [CompareValues|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L67]
>  use array.Value() for numeric data and 
> [CompareViews|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L72]
>  uses array.GetView() for non-numeric ones. It can be simplified by using 
> GetView() only as all data types support GetView().
> To my surprise, benchmark shows about 40% performance improvement after the 
> change.
> After some digging, I find in current code, the [comparison 
> callback|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L94]
>  is not inlined (check disassembled code), it leads to a function call. It's 
> very bad for this hot loop. Using only GetView() fixes this issue, code 
> inlined okay.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8006) unsafe arrow dictionary recovered from parquet

2020-03-04 Thread Pierre Belzile (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Belzile updated ARROW-8006:
--
Description: 
When an arrow dictionary of values=strings and indices=intx is written to 
parquet and recovered, the indices that correspond to null positions are not 
written. This causes two problems:
 * when transposing the dictionary, the code encounters indices that are out of 
bounds with the existing dictionary. This does cause crashes.
 * a potential security risk because it's unclear whether bytes can be read 
back inadvertently.

I traced using GDB and found that:
 # My dictionary indices were decoded by RleDecoder::GetBatchSpaced. When the 
valid bit is unset, that function increments "out" but does not set it. I think 
it should write a 0. 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/rle_encoding.h#L396]
 # The recovered data "out" array is written to the dictionary builder using an 
AppendIndices which moves the memory as a bulk move without checking for nulls. 
Hence we end-up with the indices buffer holding the "out" from above. 
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670 
|https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670]When
 transpose runs on this 
([https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util.cc#L406]),
 it may attempt to access memory out of bounds.

While is would be possible to fix "transpose" and other functions that process 
dictionary indices (e.g. compare for sorting), it seems safer to initialize to 
0. Also that's the default behavior for the arrow dict builder when appending 
one or more nulls.

Incidentally the code recovers the dict with indices int32 instead of the 
original int8 but I guess that this is covered by another activity.

 

  was:
When an arrow dictionary of values=strings and indices=intx is written to 
parquet and recovered, the indices that correspond to null positions are not 
written. This causes two problems:
 * when transposing the dictionary, the code encounters indices that are out of 
bounds with the existing dictionary. This does cause crashes.
 * a potential security risk because it's unclear whether bytes can be read 
back inadvertently.

I traced using GDB and found that:
 # My dictionary indices were decoded by RleDecoder::GetBatchSpaced. When the 
valid bit is unset, that function increments "out" but does not set it. I think 
it should write a 0. 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/rle_encoding.h#L396]
 # The recovered data "out" array is written to the dictionary builder using an 
AppendIndices which moves the memory without as a bulk move without checking 
for nulls. Hence we end-up with the indices buffer holding the "out" from 
above. 
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670 
|https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670]When
 transpose runs on this 
([https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util.cc#L406]),
 it may attempt to access memory out of bounds.

While is would be possible to fix "transpose" and other functions that process 
dictionary indices (e.g. compare for sorting), it seems safer to initialize to 
0. Also that's the default behavior for the arrow dict builder when appending 
one or more nulls.

Incidentally the code recovers the dict with indices int32 instead of the 
original int8 but I guess that this is covered by another activity.

 


> unsafe arrow dictionary recovered from parquet
> --
>
> Key: ARROW-8006
> URL: https://issues.apache.org/jira/browse/ARROW-8006
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Pierre Belzile
>Priority: Major
>
> When an arrow dictionary of values=strings and indices=intx is written to 
> parquet and recovered, the indices that correspond to null positions are not 
> written. This causes two problems:
>  * when transposing the dictionary, the code encounters indices that are out 
> of bounds with the existing dictionary. This does cause crashes.
>  * a potential security risk because it's unclear whether bytes can be read 
> back inadvertently.
> I traced using GDB and found that:
>  # My dictionary indices were decoded by RleDecoder::GetBatchSpaced. When the 
> valid bit is unset, that function increments "out" but does not set it. I 
> think it should write a 0. 
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/rle_encoding.h#L396]
>  # The recovered data "out" array is written to the dictionary builder using 
> an AppendIndices which moves the memory as a bulk move without checking for 
> nulls. Hence we end-up with the indices buffer holding the "out" 

[jira] [Created] (ARROW-8006) unsafe arrow dictionary recovered from parquet

2020-03-04 Thread Pierre Belzile (Jira)
Pierre Belzile created ARROW-8006:
-

 Summary: unsafe arrow dictionary recovered from parquet
 Key: ARROW-8006
 URL: https://issues.apache.org/jira/browse/ARROW-8006
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.15.1
Reporter: Pierre Belzile


When an arrow dictionary of values=strings and indices=intx is written to 
parquet and recovered, the indices that correspond to null positions are not 
written. This causes two problems:
 * when transposing the dictionary, the code encounters indices that are out of 
bounds with the existing dictionary. This does cause crashes.
 * a potential security risk because it's unclear whether bytes can be read 
back inadvertently.

I traced using GDB and found that:
 # My dictionary indices were decoded by RleDecoder::GetBatchSpaced. When the 
valid bit is unset, that function increments "out" but does not set it. I think 
it should write a 0. 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/rle_encoding.h#L396]
 # The recovered data "out" array is written to the dictionary builder using an 
AppendIndices which moves the memory without as a bulk move without checking 
for nulls. Hence we end-up with the indices buffer holding the "out" from 
above. 
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670 
|https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670]When
 transpose runs on this 
([https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util.cc#L406]),
 it may attempt to access memory out of bounds.

While is would be possible to fix "transpose" and other functions that process 
dictionary indices (e.g. compare for sorting), it seems safer to initialize to 
0. Also that's the default behavior for the arrow dict builder when appending 
one or more nulls.

Incidentally the code recovers the dict with indices int32 instead of the 
original int8 but I guess that this is covered by another activity.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7376) parquet NaN/null double statistics can result in endless loop

2019-12-11 Thread Pierre Belzile (Jira)
Pierre Belzile created ARROW-7376:
-

 Summary: parquet NaN/null double statistics can result in endless 
loop
 Key: ARROW-7376
 URL: https://issues.apache.org/jira/browse/ARROW-7376
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.15.1
Reporter: Pierre Belzile


There is a bug in the doubles column statistics computation when writing to 
parquet an array with only NaNs and nulls. It loops endlessly if the last cell 
of a write group is a Null. The line in error is 
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L633]
 which checks for NaN but not for Null. Code then falls through and loops 
endlessly and causes the program to appear frozen.

This code snippet repeats:
{noformat}
TEST(parquet, nans) {
  /* Create a small parquet structure */
  std::vector> fields;
  fields.push_back(::arrow::field("doubles", ::arrow::float64()));
  std::shared_ptr<::arrow::Schema> schema = ::arrow::schema(std::move(fields)); 
 std::unique_ptr<::arrow::RecordBatchBuilder> builder;
  ::arrow::RecordBatchBuilder::Make(schema, ::arrow::default_memory_pool(), 
);
  
builder->GetFieldAs<::arrow::DoubleBuilder>(0)->Append(std::numeric_limits::quiet_NaN());
  builder->GetFieldAs<::arrow::DoubleBuilder>(0)->AppendNull();  
std::shared_ptr<::arrow::RecordBatch> batch;
  builder->Flush();
  arrow::PrettyPrint(*batch, 0, ::cout);  std::shared_ptr 
table;
  arrow::Table::FromRecordBatches({batch}, );  /* Attempt to write */
  std::shared_ptr<::arrow::io::FileOutputStream> os;
  arrow::io::FileOutputStream::Open("/tmp/test.parquet", );
  parquet::WriterProperties::Builder writer_props_bld;
  // writer_props_bld.disable_statistics("doubles");
  std::shared_ptr writer_props = 
writer_props_bld.build();
  std::shared_ptr arrow_props =
  parquet::ArrowWriterProperties::Builder().store_schema()->build();
  std::unique_ptr writer;
  parquet::arrow::FileWriter::Open(
  *table->schema(), arrow::default_memory_pool(), os,
  writer_props, arrow_props, );
  writer->WriteTable(*table, 1024);
}{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)