Sripad, have you considered simply using a union of the two schemas as the input schema?
Schema.createUnion(Lists.newArrayList(schema1,schema2)); In the mapper you have to check for the record type / schema name / specificrecord instance to extract your join key, but otherwise it's really straightforward.. Johannes On Thu, Apr 25, 2013 at 6:05 PM, Martin Kleppmann <[email protected]>wrote: > I'm afraid I don't have an example -- the code I have is very entangled > with our internal stuff; it would take a while to extract the > general-purpose parts. > > I do mean <AvroWrapper<GenericRecord>, NullWritable> as input for > mappers, since those are the types produced by AvroInputFormat: > http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroInputFormat.java?view=markup > > The reducer input types are just your mapper output types, so you can > choose those yourself (any Hadoop writables). > > Martin > > > On 25 April 2013 08:26, Sripad Sriram <[email protected]> wrote: > >> Thanks! Martin, would you happen to have a gist of an example? Did you >> mean the reducer input is NullWritable? >> >> On Apr 25, 2013, at 7:44 AM, Martin Kleppmann <[email protected]> >> wrote: >> >> Oh, sorry, you're right. I was too hasty. >> >> One approach that I've used for joining Avro inputs is to use regular >> Hadoop mappers and reducers (instead of AvroMapper/AvroReducer) with >> MultipleInputs and AvroInputFormat. Your mapper input key type is then >> AvroWrapper<GenericRecord>, and mapper input value type is NullWritable. >> This approach uses Hadoop sequence files (rather than Avro files) between >> mappers and reducers, so you have to take care of serializing mapper output >> and unserializing reducer input yourself. It works, but you have to write >> quite a bit of annoying boilerplate code. >> >> I'd also be interested if anyone has a better solution. Perhaps we just >> need to create the AvroMultipleInputs that I thought existed, but doesn't :) >> >> Martin >> >> >> On 24 April 2013 12:02, Sripad Sriram <[email protected]> wrote: >> >>> Hey Martin, >>> >>> I think those classes refer to outputting to multiple files rather than >>> reading from multiple files, which is what's needed for a reduce-side join. >>> >>> thanks, >>> Sripad >>> >>> >>> On Wed, Apr 24, 2013 at 3:35 AM, Martin Kleppmann <[email protected] >>> > wrote: >>> >>>> Hey Sripad, >>>> >>>> Take a look at AvroMultipleInputs. >>>> >>>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroMultipleOutputs.html(mapred >>>> version) >>>> >>>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapreduce/AvroMultipleOutputs.html(mapreduce >>>> version) >>>> >>>> Martin >>>> >>>> >>>> On 23 April 2013 17:01, Sripad Sriram <[email protected]> wrote: >>>> >>>>> Hey folks, >>>>> >>>>> Aware that I can use Pig, Hive, etc to join avro files together, but I >>>>> have several use cases where I need to perform a reduce-side join on two >>>>> avro files. MultipleInputs doesn't seem to like AvroInputFormat - any >>>>> thoughts? >>>>> >>>>> thanks! >>>>> Sripad >>>>> >>>> >>>> >>> >> >
