Re: IndexOutOfBoundsException with Snappy compressed SequenceFile from Flume

Shreepadma Venugopalan Mon, 15 Jul 2013 16:16:33 -0700

Hi Keith,

Were you able to resolve this? Or, is this still an issue?


Thanks.
Shreepadma


On Tue, May 28, 2013 at 6:02 AM, Keith Wright <kwri...@nanigans.com> wrote:

> Hi all,
>
>    This is my first post to the hive mailing list and I was hoping to get
> some help with the exception I am getting below.  I am using CDH4.2 (hive
> 0.10.0) to query snappy compressed, Sequence files that are built using
> Flume (relevant portion of flume conf below as well).  Note that I'm using
> a SequenceFile as it was needed for Impala integration.  Has anyone see
> this error before?  Couple of additional points to help diagnose:
>
>    1. Queries seem to be able to process some mappers without issues.
>    In fact, I can do a simple select * from <table> limit 10 without issue.
>    However if I make the limit high enough, it will eventually fail presumably
>    as it needs to read in a file that has this issue.
>    2. The same query runs in Impala without errors but appears to "skip"
>    some data.  I can confirm that the missing data is present via a custom
>    map/reduce job
>    3. I am able to write a map/reduce job that reads through all of the
>    same data without issue and have been unable to identify data corruption
>    4. This is a partitioned table and queries fail that touch ANY of the
>    partitions (and there are hundreds) so this does not appear to be a
>    sporadic, data integrity problem (table definition below)
>    5. We are using '\001' as our field separator.  We are capturing other
>    data also with SequenceFile, snappy but using '|' as our delimiter and we
>    do not have any issues querying there.  Although we are using a different
>    flume source.
>
> My next step for debugging was to disable snappy compression and see if I
> could query the data.  If not, switch from SequenceFile to simple text.
>
> I appreciate the help!!!
>
> CREATE EXTERNAL TABLE ORGANIC_EVENTS (
> event_id BIGINT,
> app_id INT,
> user_id BIGINT,
> type STRING,
> name STRING,
> value STRING,
> extra STRING,
> ip_address STRING,
> user_agent STRING,
> referrer STRING,
> event_time BIGINT,
> install_flag TINYINT,
> first_for_user TINYINT,
> cookie STRING,
> year int,
> month int,
> day int,
> hour int)  PARTITIONED BY (year int, month int, day int,hour int)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
> COLLECTION ITEMS TERMINATED BY '\002'
> MAP KEYS TERMINATED BY '\003'
> STORED AS SEQUENCEFILE
> LOCATION '/events/organic';
>
> agent.sinks.exhaustHDFSSink3.type = HDFS
> agent.sinks.exhaustHDFSSink3.channel = exhaustFileChannel
> agent.sinks.exhaustHDFSSink3.hdfs.path = hdfs://lxscdh001.nanigans.com:8020
> %{path}
> agent.sinks.exhaustHDFSSink3.hdfs.filePrefix = 3.%{hostname}
> agent.sinks.exhaustHDFSSink3.hdfs.rollInterval = 0
> agent.sinks.exhaustHDFSSink3.hdfs.idleTimeout = 600
> agent.sinks.exhaustHDFSSink3.hdfs.rollSize = 0
> agent.sinks.exhaustHDFSSink3.hdfs.rollCount = 0
> agent.sinks.exhaustHDFSSink3.hdfs.batchSize = 5000
> agent.sinks.exhaustHDFSSink3.hdfs.txnEventMax = 5000
> agent.sinks.exhaustHDFSSink3.hdfs.fileType = SequenceFile
> agent.sinks.exhaustHDFSSink3.hdfs.maxOpenFiles = 100
> agent.sinks.exhaustHDFSSink3.hdfs.codeC = snappy
> agent.sinks.exhaustHDFSSink.3hdfs.writeFormat = Text
>
> 2013-05-28 12:29:00,919 WARN org.apache.hadoop.mapred.Child: Error running 
> child                              java.io.IOException: java.io.IOException: 
> java.lang.IndexOutOfBoundsException
>                               at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>                               at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>                               at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:330)
>                               at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:246)
>                               at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:216)
>                               at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:201)
>                               at 
> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>                               at 
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
>                               at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
>                               at 
> org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>                               at 
> java.security.AccessController.doPrivileged(Native Method)
>                               at 
> javax.security.auth.Subject.doAs(Subject.java:396)
>                               at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>                               at 
> org.apache.hadoop.mapred.Child.main(Child.java:262)
>                               Caused by: java.io.IOException: 
> java.lang.IndexOutOfBoundsException
>                               at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>                               at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>                               at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
>                               at 
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101)
>                               at 
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41)
>                               at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
>                               at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:328)
>                               ... 11 more
>                               Caused by: java.lang.IndexOutOfBoundsException
>                               at 
> java.io.DataInputStream.readFully(DataInputStream.java:175)
>                               at 
> org.apache.hadoop.io.Text.readFields(Text.java:284)
>                               at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
>                               at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
>                               at 
> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2180)
>                               at 
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2164)
>                               at 
> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
>                               at 
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
>                               at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
>                               ... 15 more
>
>

Re: IndexOutOfBoundsException with Snappy compressed SequenceFile from Flume

Reply via email to