Re: Reduce Side Combiner

Doug Cutting Tue, 25 Oct 2011 09:44:27 -0700

I've filed a Jira and posted a patch:

https://issues.apache.org/jira/browse/AVRO-944


Can you please tell me whether this patch fixes things for you?

Thanks,

Doug

On 10/19/2011 06:20 PM, Elliott Clark wrote:
> When running a map reduce job using avro mapred we're having some issues
> with combiners.
> 
> When running over a small data set map side combiners run and report
> that they combined records.
> When running over a larger data set combiners run and report that they
> combined 1.4 Billion records into 6 million.  However the reduce phase
> fails with:
> 
> 2011-10-19 21:37:34,777 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_201109220009_0156_r_000000_0 Merge of the inmemory files threw an 
> exception: java.io.IOException: Intermediate merge failed
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2714)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2639)
> Caused by: org.apache.avro.AvroRuntimeException: No field named rowKey in: 
> null
>       at org.apache.avro.reflect.ReflectData.findField(ReflectData.java:194)
>       at org.apache.avro.reflect.ReflectData.getField(ReflectData.java:179)
>       at org.apache.avro.reflect.ReflectData.getField(ReflectData.java:96)
>       at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:102)
>       at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:65)
>       at 
> org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:102)
>       at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:57)
>       at 
> org.apache.avro.mapred.AvroSerialization$AvroWrapperSerializer.serialize(AvroSerialization.java:131)
>       at 
> org.apache.avro.mapred.AvroSerialization$AvroWrapperSerializer.serialize(AvroSerialization.java:114)
>       at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:179)
>       at 
> org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:1025)
>       at 
> org.apache.avro.mapred.HadoopCombiner$PairCollector.collect(HadoopCombiner.java:52)
>       at 
> org.apache.avro.mapred.HadoopCombiner$PairCollector.collect(HadoopCombiner.java:40)
>       at 
> com.ngmoco.ngpipes.sourcing.NgBucketingEventCountingCombiner.reduce(NgBucketingEventCountingCombiner.java:63)
>       at 
> com.ngmoco.ngpipes.sourcing.NgBucketingEventCountingCombiner.reduce(NgBucketingEventCountingCombiner.java:17)
>       at 
> org.apache.avro.mapred.HadoopReducerBase.reduce(HadoopReducerBase.java:61)
>       at 
> org.apache.avro.mapred.HadoopReducerBase.reduce(HadoopReducerBase.java:30)
>       at 
> org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1296)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2701)
>       ... 1 more
> 
> 
> 
> rowKey is only present in our output schema.  In looking at the code it
> looks like the combiner is using the wrong collector.
> 
> Commenting out the Combiner means that everything works well.  Running
> over a smaller dataset results in the job running well.  Basically
> anything that means that
> https://issues.apache.org/jira/browse/HADOOP-3226 doesn't run means that
> the job works.
> 
> Any ideas on how to either fix this?  The above patch to hadoop was
> committed to trunk without any additional tests so I'm not really sure
> how to get this to repro on a small non-distributed scale for a unit test.

Re: Reduce Side Combiner

Reply via email to