I'm afraid I don't have an example -- the code I have is very entangled with our internal stuff; it would take a while to extract the general-purpose parts.
I do mean <AvroWrapper<GenericRecord>, NullWritable> as input for mappers, since those are the types produced by AvroInputFormat: http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroInputFormat.java?view=markup The reducer input types are just your mapper output types, so you can choose those yourself (any Hadoop writables). Martin On 25 April 2013 08:26, Sripad Sriram <[email protected]> wrote: > Thanks! Martin, would you happen to have a gist of an example? Did you > mean the reducer input is NullWritable? > > On Apr 25, 2013, at 7:44 AM, Martin Kleppmann <[email protected]> > wrote: > > Oh, sorry, you're right. I was too hasty. > > One approach that I've used for joining Avro inputs is to use regular > Hadoop mappers and reducers (instead of AvroMapper/AvroReducer) with > MultipleInputs and AvroInputFormat. Your mapper input key type is then > AvroWrapper<GenericRecord>, and mapper input value type is NullWritable. > This approach uses Hadoop sequence files (rather than Avro files) between > mappers and reducers, so you have to take care of serializing mapper output > and unserializing reducer input yourself. It works, but you have to write > quite a bit of annoying boilerplate code. > > I'd also be interested if anyone has a better solution. Perhaps we just > need to create the AvroMultipleInputs that I thought existed, but doesn't :) > > Martin > > > On 24 April 2013 12:02, Sripad Sriram <[email protected]> wrote: > >> Hey Martin, >> >> I think those classes refer to outputting to multiple files rather than >> reading from multiple files, which is what's needed for a reduce-side join. >> >> thanks, >> Sripad >> >> >> On Wed, Apr 24, 2013 at 3:35 AM, Martin Kleppmann >> <[email protected]>wrote: >> >>> Hey Sripad, >>> >>> Take a look at AvroMultipleInputs. >>> >>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroMultipleOutputs.html(mapred >>> version) >>> >>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapreduce/AvroMultipleOutputs.html(mapreduce >>> version) >>> >>> Martin >>> >>> >>> On 23 April 2013 17:01, Sripad Sriram <[email protected]> wrote: >>> >>>> Hey folks, >>>> >>>> Aware that I can use Pig, Hive, etc to join avro files together, but I >>>> have several use cases where I need to perform a reduce-side join on two >>>> avro files. MultipleInputs doesn't seem to like AvroInputFormat - any >>>> thoughts? >>>> >>>> thanks! >>>> Sripad >>>> >>> >>> >> >
