[jira] [Commented] (ARROW-8749) [C++] IpcFormatWriter writes dictionary batches with wrong ID
[ https://issues.apache.org/jira/browse/ARROW-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103559#comment-17103559 ] David Li commented on ARROW-8749: - Ah, and it's because DictionaryMemo uses field _addresses_ to determine whether a field has been seen before. So to DictionaryMemo, these are completely new dictionaries, but the reader won't have made the same confusion... > [C++] IpcFormatWriter writes dictionary batches with wrong ID > - > > Key: ARROW-8749 > URL: https://issues.apache.org/jira/browse/ARROW-8749 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.16.0, 0.17.0 >Reporter: David Li >Priority: Major > Fix For: 1.0.0 > > > IpcFormatWriter assigns dictionary IDs once when it writes the schema > message. Then, when it writes dictionary batches, it assigns dictionary IDs > again because it re-collects dictionaries from the given batch. So for > example, if you have 5 dictionaries, the first dictionary will end up with ID > 0 but be written with ID 5. > For example, this will fail with "'_error_or_value11.status()' failed with > Key error: No record of dictionary type with id 9" > {code:cpp} > TEST_F(TestMetadata, DoPutDictionaries) { > ASSERT_OK_AND_ASSIGN(auto sink, arrow::io::BufferOutputStream::Create()); > std::shared_ptr schema = ExampleDictSchema(); > BatchVector expected_batches; > ASSERT_OK(ExampleDictBatches(_batches)); > ASSERT_OK_AND_ASSIGN(auto writer, arrow::ipc::NewStreamWriter(sink.get(), > schema)); > for (auto& batch : expected_batches) { > ASSERT_OK(writer->WriteRecordBatch(*batch)); > } > ASSERT_OK_AND_ASSIGN(auto buf, sink->Finish()); > arrow::io::BufferReader source(buf); > ASSERT_OK_AND_ASSIGN(auto reader, > arrow::ipc::RecordBatchStreamReader::Open()); > AssertSchemaEqual(schema, reader->schema()); > for (auto& batch : expected_batches) { > ASSERT_OK_AND_ASSIGN(auto actual, reader->Next()); > AssertBatchesEqual(*actual, *batch); > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8749) [C++] IpcFormatWriter writes dictionary batches with wrong ID
[ https://issues.apache.org/jira/browse/ARROW-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103558#comment-17103558 ] David Li commented on ARROW-8749: - So the important thing here is that in this test, the schema comes from a different (but identical value/schema-wise) batch. {code:cpp} ASSERT_OK_AND_ASSIGN(auto sink, arrow::io::BufferOutputStream::Create()); BatchVector expected_batches; std::shared_ptr batch; ASSERT_OK(MakeDictionary()); std::shared_ptr schema = batch->schema(); // Uncomment this and the test will fail // ASSERT_OK(MakeDictionary()); expected_batches.push_back(batch); ASSERT_OK_AND_ASSIGN(auto writer, arrow::ipc::NewStreamWriter(sink.get(), schema)); for (auto& batch : expected_batches) { ASSERT_OK(writer->WriteRecordBatch(*batch)); } ASSERT_OK_AND_ASSIGN(auto buf, sink->Finish()); arrow::io::BufferReader source(buf); ASSERT_OK_AND_ASSIGN(auto reader, arrow::ipc::RecordBatchStreamReader::Open()); AssertSchemaEqual(schema, reader->schema()); for (auto& batch : expected_batches) { ASSERT_OK_AND_ASSIGN(auto actual, reader->Next()); AssertBatchesEqual(*actual, *batch); } {code} > [C++] IpcFormatWriter writes dictionary batches with wrong ID > - > > Key: ARROW-8749 > URL: https://issues.apache.org/jira/browse/ARROW-8749 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.16.0, 0.17.0 >Reporter: David Li >Priority: Major > Fix For: 1.0.0 > > > IpcFormatWriter assigns dictionary IDs once when it writes the schema > message. Then, when it writes dictionary batches, it assigns dictionary IDs > again because it re-collects dictionaries from the given batch. So for > example, if you have 5 dictionaries, the first dictionary will end up with ID > 0 but be written with ID 5. > For example, this will fail with "'_error_or_value11.status()' failed with > Key error: No record of dictionary type with id 9" > {code:cpp} > TEST_F(TestMetadata, DoPutDictionaries) { > ASSERT_OK_AND_ASSIGN(auto sink, arrow::io::BufferOutputStream::Create()); > std::shared_ptr schema = ExampleDictSchema(); > BatchVector expected_batches; > ASSERT_OK(ExampleDictBatches(_batches)); > ASSERT_OK_AND_ASSIGN(auto writer, arrow::ipc::NewStreamWriter(sink.get(), > schema)); > for (auto& batch : expected_batches) { > ASSERT_OK(writer->WriteRecordBatch(*batch)); > } > ASSERT_OK_AND_ASSIGN(auto buf, sink->Finish()); > arrow::io::BufferReader source(buf); > ASSERT_OK_AND_ASSIGN(auto reader, > arrow::ipc::RecordBatchStreamReader::Open()); > AssertSchemaEqual(schema, reader->schema()); > for (auto& batch : expected_batches) { > ASSERT_OK_AND_ASSIGN(auto actual, reader->Next()); > AssertBatchesEqual(*actual, *batch); > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8749) [C++] IpcFormatWriter writes dictionary batches with wrong ID
[ https://issues.apache.org/jira/browse/ARROW-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103484#comment-17103484 ] Wes McKinney commented on ARROW-8749: - I'd guess the bug has been present for a while, then > [C++] IpcFormatWriter writes dictionary batches with wrong ID > - > > Key: ARROW-8749 > URL: https://issues.apache.org/jira/browse/ARROW-8749 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.16.0, 0.17.0 >Reporter: David Li >Priority: Major > Fix For: 1.0.0, 0.17.1 > > > IpcFormatWriter assigns dictionary IDs once when it writes the schema > message. Then, when it writes dictionary batches, it assigns dictionary IDs > again because it re-collects dictionaries from the given batch. So for > example, if you have 5 dictionaries, the first dictionary will end up with ID > 0 but be written with ID 5. > For example, this will fail with "'_error_or_value11.status()' failed with > Key error: No record of dictionary type with id 9" > {code:cpp} > TEST_F(TestMetadata, DoPutDictionaries) { > ASSERT_OK_AND_ASSIGN(auto sink, arrow::io::BufferOutputStream::Create()); > std::shared_ptr schema = ExampleDictSchema(); > BatchVector expected_batches; > ASSERT_OK(ExampleDictBatches(_batches)); > ASSERT_OK_AND_ASSIGN(auto writer, arrow::ipc::NewStreamWriter(sink.get(), > schema)); > for (auto& batch : expected_batches) { > ASSERT_OK(writer->WriteRecordBatch(*batch)); > } > ASSERT_OK_AND_ASSIGN(auto buf, sink->Finish()); > arrow::io::BufferReader source(buf); > ASSERT_OK_AND_ASSIGN(auto reader, > arrow::ipc::RecordBatchStreamReader::Open()); > AssertSchemaEqual(schema, reader->schema()); > for (auto& batch : expected_batches) { > ASSERT_OK_AND_ASSIGN(auto actual, reader->Next()); > AssertBatchesEqual(*actual, *batch); > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8749) [C++] IpcFormatWriter writes dictionary batches with wrong ID
[ https://issues.apache.org/jira/browse/ARROW-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103483#comment-17103483 ] David Li commented on ARROW-8749: - It seems to affect 0.16.0 as well. I'd have to do some work to get 0.15.1 to build locally. > [C++] IpcFormatWriter writes dictionary batches with wrong ID > - > > Key: ARROW-8749 > URL: https://issues.apache.org/jira/browse/ARROW-8749 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.16.0, 0.17.0 >Reporter: David Li >Priority: Major > Fix For: 1.0.0, 0.17.1 > > > IpcFormatWriter assigns dictionary IDs once when it writes the schema > message. Then, when it writes dictionary batches, it assigns dictionary IDs > again because it re-collects dictionaries from the given batch. So for > example, if you have 5 dictionaries, the first dictionary will end up with ID > 0 but be written with ID 5. > For example, this will fail with "'_error_or_value11.status()' failed with > Key error: No record of dictionary type with id 9" > {code:cpp} > TEST_F(TestMetadata, DoPutDictionaries) { > ASSERT_OK_AND_ASSIGN(auto sink, arrow::io::BufferOutputStream::Create()); > std::shared_ptr schema = ExampleDictSchema(); > BatchVector expected_batches; > ASSERT_OK(ExampleDictBatches(_batches)); > ASSERT_OK_AND_ASSIGN(auto writer, arrow::ipc::NewStreamWriter(sink.get(), > schema)); > for (auto& batch : expected_batches) { > ASSERT_OK(writer->WriteRecordBatch(*batch)); > } > ASSERT_OK_AND_ASSIGN(auto buf, sink->Finish()); > arrow::io::BufferReader source(buf); > ASSERT_OK_AND_ASSIGN(auto reader, > arrow::ipc::RecordBatchStreamReader::Open()); > AssertSchemaEqual(schema, reader->schema()); > for (auto& batch : expected_batches) { > ASSERT_OK_AND_ASSIGN(auto actual, reader->Next()); > AssertBatchesEqual(*actual, *batch); > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8749) [C++] IpcFormatWriter writes dictionary batches with wrong ID
[ https://issues.apache.org/jira/browse/ARROW-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103475#comment-17103475 ] Wes McKinney commented on ARROW-8749: - Definitely not good. This may be an artifact of the fact that the way that we attribute dictionary id's to fields is not very robust. Is this a regression (I marked for 0.17.1 in case it is)? > [C++] IpcFormatWriter writes dictionary batches with wrong ID > - > > Key: ARROW-8749 > URL: https://issues.apache.org/jira/browse/ARROW-8749 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.17.0 >Reporter: David Li >Priority: Major > Fix For: 1.0.0, 0.17.1 > > > IpcFormatWriter assigns dictionary IDs once when it writes the schema > message. Then, when it writes dictionary batches, it assigns dictionary IDs > again because it re-collects dictionaries from the given batch. So for > example, if you have 5 dictionaries, the first dictionary will end up with ID > 0 but be written with ID 5. > For example, this will fail with "'_error_or_value11.status()' failed with > Key error: No record of dictionary type with id 9" > {code:cpp} > TEST_F(TestMetadata, DoPutDictionaries) { > ASSERT_OK_AND_ASSIGN(auto sink, arrow::io::BufferOutputStream::Create()); > std::shared_ptr schema = ExampleDictSchema(); > BatchVector expected_batches; > ASSERT_OK(ExampleDictBatches(_batches)); > ASSERT_OK_AND_ASSIGN(auto writer, arrow::ipc::NewStreamWriter(sink.get(), > schema)); > for (auto& batch : expected_batches) { > ASSERT_OK(writer->WriteRecordBatch(*batch)); > } > ASSERT_OK_AND_ASSIGN(auto buf, sink->Finish()); > arrow::io::BufferReader source(buf); > ASSERT_OK_AND_ASSIGN(auto reader, > arrow::ipc::RecordBatchStreamReader::Open()); > AssertSchemaEqual(schema, reader->schema()); > for (auto& batch : expected_batches) { > ASSERT_OK_AND_ASSIGN(auto actual, reader->Next()); > AssertBatchesEqual(*actual, *batch); > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)