I have successfully been using the sequence file source located here:

https://github.com/googleapis/java-bigtable-hbase/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java

However recently we started to do block level compression with bzip2 on the
SequenceFile. This is supported out of the box on the Hadoop side of things.

However when reading in the files, while most records parse out just fine
there are a handful of records that throw:

####
Exception in thread "main" java.lang.IndexOutOfBoundsException: offs(1368)
+ len(1369) > dest.length(1467).
at
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398)
####

I've gone in circles looking at this. It seems that the last record being
read from the sequencefile in each thread is hitting this on the value
retrieval (Key retrieves just fine, but value throws this error).

Any clues as to what this could be?

File is KV<Text, Text> aka
"SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text(org.apache.hadoop.io.compress.BZip2Codec"

Any help is appreciated!

- Shannon

Reply via email to