[jira] [Commented] (ARROW-8749) [C++] IpcFormatWriter writes dictionary batches with wrong ID

2020-05-09 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103559#comment-17103559
 ] 

David Li commented on ARROW-8749:
-

Ah, and it's because DictionaryMemo uses field _addresses_ to determine whether 
a field has been seen before. So to DictionaryMemo, these are completely new 
dictionaries, but the reader won't have made the same confusion...

> [C++] IpcFormatWriter writes dictionary batches with wrong ID
> -
>
> Key: ARROW-8749
> URL: https://issues.apache.org/jira/browse/ARROW-8749
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.16.0, 0.17.0
>Reporter: David Li
>Priority: Major
> Fix For: 1.0.0
>
>
> IpcFormatWriter assigns dictionary IDs once when it writes the schema 
> message. Then, when it writes dictionary batches, it assigns dictionary IDs 
> again because it re-collects dictionaries from the given batch. So for 
> example, if you have 5 dictionaries, the first dictionary will end up with ID 
> 0 but be written with ID 5.
> For example, this will fail with "'_error_or_value11.status()' failed with 
> Key error: No record of dictionary type with id 9"
> {code:cpp}
> TEST_F(TestMetadata, DoPutDictionaries) {
>   ASSERT_OK_AND_ASSIGN(auto sink, arrow::io::BufferOutputStream::Create());
>   std::shared_ptr schema = ExampleDictSchema();
>   BatchVector expected_batches;
>   ASSERT_OK(ExampleDictBatches(_batches));
>   ASSERT_OK_AND_ASSIGN(auto writer, arrow::ipc::NewStreamWriter(sink.get(), 
> schema));
>   for (auto& batch : expected_batches) {
> ASSERT_OK(writer->WriteRecordBatch(*batch));
>   }
>   ASSERT_OK_AND_ASSIGN(auto buf, sink->Finish());
>   arrow::io::BufferReader source(buf);
>   ASSERT_OK_AND_ASSIGN(auto reader, 
> arrow::ipc::RecordBatchStreamReader::Open());
>   AssertSchemaEqual(schema, reader->schema());
>   for (auto& batch : expected_batches) {
> ASSERT_OK_AND_ASSIGN(auto actual, reader->Next());
> AssertBatchesEqual(*actual, *batch);
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8749) [C++] IpcFormatWriter writes dictionary batches with wrong ID

2020-05-09 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103558#comment-17103558
 ] 

David Li commented on ARROW-8749:
-

So the important thing here is that in this test, the schema comes from a 
different (but identical value/schema-wise) batch.

{code:cpp}
  ASSERT_OK_AND_ASSIGN(auto sink, arrow::io::BufferOutputStream::Create());
  BatchVector expected_batches;
  std::shared_ptr batch;
  ASSERT_OK(MakeDictionary());
  std::shared_ptr schema = batch->schema();
  // Uncomment this and the test will fail
  // ASSERT_OK(MakeDictionary());
  expected_batches.push_back(batch);
  ASSERT_OK_AND_ASSIGN(auto writer, arrow::ipc::NewStreamWriter(sink.get(), 
schema));
  for (auto& batch : expected_batches) {
ASSERT_OK(writer->WriteRecordBatch(*batch));
  }
  ASSERT_OK_AND_ASSIGN(auto buf, sink->Finish());
  arrow::io::BufferReader source(buf);
  ASSERT_OK_AND_ASSIGN(auto reader, 
arrow::ipc::RecordBatchStreamReader::Open());
  AssertSchemaEqual(schema, reader->schema());
  for (auto& batch : expected_batches) {
ASSERT_OK_AND_ASSIGN(auto actual, reader->Next());
AssertBatchesEqual(*actual, *batch);
  }
{code}

> [C++] IpcFormatWriter writes dictionary batches with wrong ID
> -
>
> Key: ARROW-8749
> URL: https://issues.apache.org/jira/browse/ARROW-8749
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.16.0, 0.17.0
>Reporter: David Li
>Priority: Major
> Fix For: 1.0.0
>
>
> IpcFormatWriter assigns dictionary IDs once when it writes the schema 
> message. Then, when it writes dictionary batches, it assigns dictionary IDs 
> again because it re-collects dictionaries from the given batch. So for 
> example, if you have 5 dictionaries, the first dictionary will end up with ID 
> 0 but be written with ID 5.
> For example, this will fail with "'_error_or_value11.status()' failed with 
> Key error: No record of dictionary type with id 9"
> {code:cpp}
> TEST_F(TestMetadata, DoPutDictionaries) {
>   ASSERT_OK_AND_ASSIGN(auto sink, arrow::io::BufferOutputStream::Create());
>   std::shared_ptr schema = ExampleDictSchema();
>   BatchVector expected_batches;
>   ASSERT_OK(ExampleDictBatches(_batches));
>   ASSERT_OK_AND_ASSIGN(auto writer, arrow::ipc::NewStreamWriter(sink.get(), 
> schema));
>   for (auto& batch : expected_batches) {
> ASSERT_OK(writer->WriteRecordBatch(*batch));
>   }
>   ASSERT_OK_AND_ASSIGN(auto buf, sink->Finish());
>   arrow::io::BufferReader source(buf);
>   ASSERT_OK_AND_ASSIGN(auto reader, 
> arrow::ipc::RecordBatchStreamReader::Open());
>   AssertSchemaEqual(schema, reader->schema());
>   for (auto& batch : expected_batches) {
> ASSERT_OK_AND_ASSIGN(auto actual, reader->Next());
> AssertBatchesEqual(*actual, *batch);
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8749) [C++] IpcFormatWriter writes dictionary batches with wrong ID

2020-05-09 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103484#comment-17103484
 ] 

Wes McKinney commented on ARROW-8749:
-

I'd guess the bug has been present for a while, then

> [C++] IpcFormatWriter writes dictionary batches with wrong ID
> -
>
> Key: ARROW-8749
> URL: https://issues.apache.org/jira/browse/ARROW-8749
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.16.0, 0.17.0
>Reporter: David Li
>Priority: Major
> Fix For: 1.0.0, 0.17.1
>
>
> IpcFormatWriter assigns dictionary IDs once when it writes the schema 
> message. Then, when it writes dictionary batches, it assigns dictionary IDs 
> again because it re-collects dictionaries from the given batch. So for 
> example, if you have 5 dictionaries, the first dictionary will end up with ID 
> 0 but be written with ID 5.
> For example, this will fail with "'_error_or_value11.status()' failed with 
> Key error: No record of dictionary type with id 9"
> {code:cpp}
> TEST_F(TestMetadata, DoPutDictionaries) {
>   ASSERT_OK_AND_ASSIGN(auto sink, arrow::io::BufferOutputStream::Create());
>   std::shared_ptr schema = ExampleDictSchema();
>   BatchVector expected_batches;
>   ASSERT_OK(ExampleDictBatches(_batches));
>   ASSERT_OK_AND_ASSIGN(auto writer, arrow::ipc::NewStreamWriter(sink.get(), 
> schema));
>   for (auto& batch : expected_batches) {
> ASSERT_OK(writer->WriteRecordBatch(*batch));
>   }
>   ASSERT_OK_AND_ASSIGN(auto buf, sink->Finish());
>   arrow::io::BufferReader source(buf);
>   ASSERT_OK_AND_ASSIGN(auto reader, 
> arrow::ipc::RecordBatchStreamReader::Open());
>   AssertSchemaEqual(schema, reader->schema());
>   for (auto& batch : expected_batches) {
> ASSERT_OK_AND_ASSIGN(auto actual, reader->Next());
> AssertBatchesEqual(*actual, *batch);
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8749) [C++] IpcFormatWriter writes dictionary batches with wrong ID

2020-05-09 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103483#comment-17103483
 ] 

David Li commented on ARROW-8749:
-

It seems to affect 0.16.0 as well. I'd have to do some work to get 0.15.1 to 
build locally.

> [C++] IpcFormatWriter writes dictionary batches with wrong ID
> -
>
> Key: ARROW-8749
> URL: https://issues.apache.org/jira/browse/ARROW-8749
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.16.0, 0.17.0
>Reporter: David Li
>Priority: Major
> Fix For: 1.0.0, 0.17.1
>
>
> IpcFormatWriter assigns dictionary IDs once when it writes the schema 
> message. Then, when it writes dictionary batches, it assigns dictionary IDs 
> again because it re-collects dictionaries from the given batch. So for 
> example, if you have 5 dictionaries, the first dictionary will end up with ID 
> 0 but be written with ID 5.
> For example, this will fail with "'_error_or_value11.status()' failed with 
> Key error: No record of dictionary type with id 9"
> {code:cpp}
> TEST_F(TestMetadata, DoPutDictionaries) {
>   ASSERT_OK_AND_ASSIGN(auto sink, arrow::io::BufferOutputStream::Create());
>   std::shared_ptr schema = ExampleDictSchema();
>   BatchVector expected_batches;
>   ASSERT_OK(ExampleDictBatches(_batches));
>   ASSERT_OK_AND_ASSIGN(auto writer, arrow::ipc::NewStreamWriter(sink.get(), 
> schema));
>   for (auto& batch : expected_batches) {
> ASSERT_OK(writer->WriteRecordBatch(*batch));
>   }
>   ASSERT_OK_AND_ASSIGN(auto buf, sink->Finish());
>   arrow::io::BufferReader source(buf);
>   ASSERT_OK_AND_ASSIGN(auto reader, 
> arrow::ipc::RecordBatchStreamReader::Open());
>   AssertSchemaEqual(schema, reader->schema());
>   for (auto& batch : expected_batches) {
> ASSERT_OK_AND_ASSIGN(auto actual, reader->Next());
> AssertBatchesEqual(*actual, *batch);
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8749) [C++] IpcFormatWriter writes dictionary batches with wrong ID

2020-05-09 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103475#comment-17103475
 ] 

Wes McKinney commented on ARROW-8749:
-

Definitely not good. This may be an artifact of the fact that the way that we 
attribute dictionary id's to fields is not very robust. Is this a regression (I 
marked for 0.17.1 in case it is)?

> [C++] IpcFormatWriter writes dictionary batches with wrong ID
> -
>
> Key: ARROW-8749
> URL: https://issues.apache.org/jira/browse/ARROW-8749
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: David Li
>Priority: Major
> Fix For: 1.0.0, 0.17.1
>
>
> IpcFormatWriter assigns dictionary IDs once when it writes the schema 
> message. Then, when it writes dictionary batches, it assigns dictionary IDs 
> again because it re-collects dictionaries from the given batch. So for 
> example, if you have 5 dictionaries, the first dictionary will end up with ID 
> 0 but be written with ID 5.
> For example, this will fail with "'_error_or_value11.status()' failed with 
> Key error: No record of dictionary type with id 9"
> {code:cpp}
> TEST_F(TestMetadata, DoPutDictionaries) {
>   ASSERT_OK_AND_ASSIGN(auto sink, arrow::io::BufferOutputStream::Create());
>   std::shared_ptr schema = ExampleDictSchema();
>   BatchVector expected_batches;
>   ASSERT_OK(ExampleDictBatches(_batches));
>   ASSERT_OK_AND_ASSIGN(auto writer, arrow::ipc::NewStreamWriter(sink.get(), 
> schema));
>   for (auto& batch : expected_batches) {
> ASSERT_OK(writer->WriteRecordBatch(*batch));
>   }
>   ASSERT_OK_AND_ASSIGN(auto buf, sink->Finish());
>   arrow::io::BufferReader source(buf);
>   ASSERT_OK_AND_ASSIGN(auto reader, 
> arrow::ipc::RecordBatchStreamReader::Open());
>   AssertSchemaEqual(schema, reader->schema());
>   for (auto& batch : expected_batches) {
> ASSERT_OK_AND_ASSIGN(auto actual, reader->Next());
> AssertBatchesEqual(*actual, *batch);
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)