Re: Avro MR job problem with empty strings

Scott Carey Fri, 02 Sep 2011 17:36:40 -0700

Some ideas:

A String is encoded as a Long length, followed by that number of bytes in
Utf8.
An empty string is therefore encoded as the number 0L -- which is one
byte, 0x00.
It appears that it is trying to skip a string or Long, but it is the end
of the byte[].


So either it is expecting a Long or String to skip, and there is nothing
there.  Perhaps the empty String was not encoded as an empty string, but
skipped.  Perhaps a Long count or other number (What is the Schema being
compared?)  

WordCount is often key = word, val = count, and so it would need to read
the string word, and skip the long count.  If either of these is left out
and not written, I would expect the sort of error below.

I hope that helps,

-Scott

On 9/1/11 5:42 AM, "Friso van Vollenhoven" <[email protected]>
wrote:

>Hi All,
>
>I am working on a modified version of the Avro MapReduce support to make
>it play nice with the new Hadoop API (0.20.2). Most of the code if
>borrowed from the Avro mapred package, but I decided not to fully
>abstract away the Mapper and Reducer classes (like Avro does now using
>HadoopMapper and HadoopReducer classes). All else is much the same as the
>mapred implementation.
>
>When testing, I ran into a issues when emitting empty strings (empty
>Utf8) from the mapper as key. I get the following:
>org.apache.avro.AvroRuntimeException: java.io.EOFException
>       at org.apache.avro.io.BinaryData.compare(BinaryData.java:74)
>       at org.apache.avro.io.BinaryData.compare(BinaryData.java:60)
>       at 
>org.apache.avro.mapreduce.AvroKeyComparator.compare(AvroKeyComparator.java
>:45)        <== this is my own code
>       at 
>org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:
>120)
>       at 
>org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
>       at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:572)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
>       at 
>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:256)
>Caused by: java.io.EOFException
>       at org.apache.avro.io.BinaryDecoder.readLong(BinaryDecoder.java:182)
>       at 
>org.apache.avro.generic.GenericDatumReader.skip(GenericDatumReader.java:38
>9)
>       at org.apache.avro.io.BinaryData.compare(BinaryData.java:86)
>       at org.apache.avro.io.BinaryData.compare(BinaryData.java:72)
>       ... 8 more
>
>
>The root cause stack trace is as follows (taken from debugger, breakpoint
>on the throw new EOFException(); line):
>Thread [Thread-11] (Suspended (breakpoint at line 182 in BinaryDecoder))       
>       BinaryDecoder.readLong() line: 182      
>       GenericDatumReader<D>.skip(Schema, Decoder) line: 389   
>       BinaryData.compare(BinaryData$Decoders, Schema) line: 86        
>       BinaryData.compare(byte[], int, int, byte[], int, int, Schema) line: 72 
>       BinaryData.compare(byte[], int, byte[], int, Schema) line: 60   
>       AvroKeyComparator<T>.compare(byte[], int, int, byte[], int, int) line:
>45     
>       
> Reducer$Context(ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>).nextKeyValu
>e() line: 120  
>       Reducer$Context(ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>).nextKey()
>line: 92       
>       
> AvroMapReduceTest$WordCountingAvroReducer(Reducer<KEYIN,VALUEIN,KEYOUT,VA
>LUEOUT>).run(Reducer<KEYIN,VALUEIN,KEYOUT,Contex>) line: 175   
>       ReduceTask.runNewReducer(JobConf, TaskUmbilicalProtocol, TaskReporter,
>RawKeyValueIterator, RawComparator<INKEY>, Class<INKEY>, Class<INVALUE>)
>line: 572      
>       ReduceTask.run(JobConf, TaskUmbilicalProtocol) line: 414        
>       LocalJobRunner$Job.run() line: 256      
>
>I went through the decoding code to see where this comes from, but I
>can't immediately spot where it goes wrong. I am guessing the actual
>problem is earlier during execution where it possibly increases pos too
>often.
>
>Has anyone experienced this? I can live without emitting empty keys from
>MR jobs, but I ran into this implementing a word count job on a text file
>with empty lines (counting those could be a valid use case). I am using
>Avro 1.5.2.
>
>Thanks for any clues.
>
>
>Cheers,
>Friso
>

Re: Avro MR job problem with empty strings

Reply via email to