[jira] [Commented] (ARROW-6861) [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

2019-10-13 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950648#comment-16950648
 ] 

Wes McKinney commented on ARROW-6861:
-

I started looking at this

> [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: 
> Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
> --
>
> Key: ARROW-6861
> URL: https://issues.apache.org/jira/browse/ARROW-6861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
> Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64)
>Reporter: Adam Hooper
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0, 0.15.1
>
> Attachments: fix-dict-builder-capacity.diff, 
> parquet-written-by-arrow-0-14-1.7z
>
>
> I'll need to jump through hoops to upload the (seemingly-valid) Parquet file 
> that triggers this bug. In the meantime, here's the error I get, reading the 
> Parquet file with read_dictionary=true. I'll start with the stack trace:
> {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot 
> downsize}}
> {{#0 0x00b9fffd in __cxa_throw ()}}
>  {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow 
> (this=0x56612e50, num_values=67339, null_count=0, 
> valid_bits=0x7f39a764b780 '\377' ..., 
> valid_bits_offset=748544,}}
>  \{{ builder=0x56616330) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}}
>  {{#2 0x0046d703 in 
> parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced 
> (this=0x56616260, values_to_read=67339, null_count=0)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}}
>  {{#3 0x004a13f8 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData (this=0x56616260, num_records=67339)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}}
>  {{#4 0x00493876 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords (this=0x56616260, num_records=815883)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}}
>  {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch 
> (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}}
>  {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}}
>  {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}}
>  {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}}
> And now a report of my gdb adventures:
> In Arrow 0.15.0, when reading a particular dictionary column 
> ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow 
> 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} 
> is called twice (once with 493568 values, once with 254976 values); and then 
> {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't 
> know why this column comes in three batches.) On first {{AppendIndices()}} 
> call, the buffer capacity is equal to the number of values. On second call, 
> that's no longer the case: the buffer grows using 
> {{BufferBuilder::GrowByFactor}}, so its capacity is 987136.
> But there's a bug: the 987136-capacity buffer is in 
> {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in 
> {{Dictionary32Builder::indices_builder_.capacity_}}. 
> {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} 
> is called. (Dictionary32Builder behaves like a proxy for its 
> {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things 
> are messy.)
> So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, 
> via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But 
> {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its 
> wrong, cached value) to {{length_ + num_values}} (815883). Since 
> {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws 
> an exception.
> The only workaround I can find: use {{read_dictionaries=false}}.
> This affects Python, too.
> I've attached a patch that fixes the issue for my file. I don't know how to 
> formulate a reduction, though, so I haven't contributed unit tests. I'm also 
> not certain how FinishInternal is meant to work, so this definitely needs 
> expert 

[jira] [Commented] (ARROW-6861) [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

2019-10-11 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949809#comment-16949809
 ] 

Wes McKinney commented on ARROW-6861:
-

Seems like a good candidate for 0.15.1. Marked as such

> [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: 
> Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
> --
>
> Key: ARROW-6861
> URL: https://issues.apache.org/jira/browse/ARROW-6861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
> Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64)
>Reporter: Adam Hooper
>Priority: Major
> Fix For: 1.0.0, 0.15.1
>
> Attachments: fix-dict-builder-capacity.diff, 
> parquet-written-by-arrow-0-14-1.7z
>
>
> I'll need to jump through hoops to upload the (seemingly-valid) Parquet file 
> that triggers this bug. In the meantime, here's the error I get, reading the 
> Parquet file with read_dictionary=true. I'll start with the stack trace:
> {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot 
> downsize}}
> {{#0 0x00b9fffd in __cxa_throw ()}}
>  {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow 
> (this=0x56612e50, num_values=67339, null_count=0, 
> valid_bits=0x7f39a764b780 '\377' ..., 
> valid_bits_offset=748544,}}
>  \{{ builder=0x56616330) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}}
>  {{#2 0x0046d703 in 
> parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced 
> (this=0x56616260, values_to_read=67339, null_count=0)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}}
>  {{#3 0x004a13f8 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData (this=0x56616260, num_records=67339)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}}
>  {{#4 0x00493876 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords (this=0x56616260, num_records=815883)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}}
>  {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch 
> (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}}
>  {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}}
>  {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}}
>  {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}}
> And now a report of my gdb adventures:
> In Arrow 0.15.0, when reading a particular dictionary column 
> ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow 
> 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} 
> is called twice (once with 493568 values, once with 254976 values); and then 
> {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't 
> know why this column comes in three batches.) On first {{AppendIndices()}} 
> call, the buffer capacity is equal to the number of values. On second call, 
> that's no longer the case: the buffer grows using 
> {{BufferBuilder::GrowByFactor}}, so its capacity is 987136.
> But there's a bug: the 987136-capacity buffer is in 
> {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in 
> {{Dictionary32Builder::indices_builder_.capacity_}}. 
> {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} 
> is called. (Dictionary32Builder behaves like a proxy for its 
> {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things 
> are messy.)
> So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, 
> via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But 
> {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its 
> wrong, cached value) to {{length_ + num_values}} (815883). Since 
> {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws 
> an exception.
> The only workaround I can find: use {{read_dictionaries=false}}.
> This affects Python, too.
> I've attached a patch that fixes the issue for my file. I don't know how to 
> formulate a reduction, though, so I haven't contributed unit tests. I'm also 
> not certain how FinishInternal is meant to work, so this definitely needs 
> expert 

[jira] [Commented] (ARROW-6861) [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

2019-10-11 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949801#comment-16949801
 ] 

Wes McKinney commented on ARROW-6861:
-

Thanks. This should be enough information to help write a unit test to 
reproduce the issue. [~bkietz] are you interested in taking a look?

> [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: 
> Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
> --
>
> Key: ARROW-6861
> URL: https://issues.apache.org/jira/browse/ARROW-6861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
> Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64)
>Reporter: Adam Hooper
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: fix-dict-builder-capacity.diff, 
> parquet-written-by-arrow-0-14-1.7z
>
>
> I'll need to jump through hoops to upload the (seemingly-valid) Parquet file 
> that triggers this bug. In the meantime, here's the error I get, reading the 
> Parquet file with read_dictionary=true. I'll start with the stack trace:
> {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot 
> downsize}}
> {{#0 0x00b9fffd in __cxa_throw ()}}
>  {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow 
> (this=0x56612e50, num_values=67339, null_count=0, 
> valid_bits=0x7f39a764b780 '\377' ..., 
> valid_bits_offset=748544,}}
>  \{{ builder=0x56616330) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}}
>  {{#2 0x0046d703 in 
> parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced 
> (this=0x56616260, values_to_read=67339, null_count=0)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}}
>  {{#3 0x004a13f8 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData (this=0x56616260, num_records=67339)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}}
>  {{#4 0x00493876 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords (this=0x56616260, num_records=815883)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}}
>  {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch 
> (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}}
>  {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}}
>  {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}}
>  {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}}
> And now a report of my gdb adventures:
> In Arrow 0.15.0, when reading a particular dictionary column 
> ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow 
> 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} 
> is called twice (once with 493568 values, once with 254976 values); and then 
> {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't 
> know why this column comes in three batches.) On first {{AppendIndices()}} 
> call, the buffer capacity is equal to the number of values. On second call, 
> that's no longer the case: the buffer grows using 
> {{BufferBuilder::GrowByFactor}}, so its capacity is 987136.
> But there's a bug: the 987136-capacity buffer is in 
> {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in 
> {{Dictionary32Builder::indices_builder_.capacity_}}. 
> {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} 
> is called. (Dictionary32Builder behaves like a proxy for its 
> {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things 
> are messy.)
> So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, 
> via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But 
> {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its 
> wrong, cached value) to {{length_ + num_values}} (815883). Since 
> {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws 
> an exception.
> The only workaround I can find: use {{read_dictionaries=false}}.
> This affects Python, too.
> I've attached a patch that fixes the issue for my file. I don't know how to 
> formulate a reduction, though, so I haven't contributed unit tests. I'm also 
> not certain