Something must have changed with the bzip2 codec in later versions of
hadoop. When I get time I'll investigate which version actually breaks it
and see what changed.

On Thu, Sep 5, 2019 at 11:40 AM Lukasz Cwik <[email protected]> wrote:

> Sorry for the poor experience and thanks for sharing a solution with
> others.
>
> On Thu, Sep 5, 2019 at 6:34 AM Shannon Duncan <[email protected]>
> wrote:
>
>> FYI this was due to hadoop version. 3.2.0 was throwing this error, but
>> rolled back to version in googles pom.xml 2.7.4 and it is working fine now.
>>
>> Kindof annoying cause I wasted several hours jumping through hoops trying
>> to get 3.2.0 working :(
>>
>> On Wed, Sep 4, 2019 at 5:09 PM Shannon Duncan <[email protected]>
>> wrote:
>>
>>> I have successfully been using the sequence file source located here:
>>>
>>>
>>> https://github.com/googleapis/java-bigtable-hbase/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
>>>
>>> However recently we started to do block level compression with bzip2 on
>>> the SequenceFile. This is supported out of the box on the Hadoop side of
>>> things.
>>>
>>> However when reading in the files, while most records parse out just
>>> fine there are a handful of records that throw:
>>>
>>> ####
>>> Exception in thread "main" java.lang.IndexOutOfBoundsException:
>>> offs(1368) + len(1369) > dest.length(1467).
>>> at
>>> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398)
>>> ####
>>>
>>> I've gone in circles looking at this. It seems that the last record
>>> being read from the sequencefile in each thread is hitting this on the
>>> value retrieval (Key retrieves just fine, but value throws this error).
>>>
>>> Any clues as to what this could be?
>>>
>>> File is KV<Text, Text> aka
>>> "SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text(org.apache.hadoop.io.compress.BZip2Codec"
>>>
>>> Any help is appreciated!
>>>
>>> - Shannon
>>>
>>

Reply via email to