That makes a lot of sense - thanks, and I'll give it a shot!
On Fri, Apr 26, 2013 at 1:41 PM, Johannes Schulte < [email protected]> wrote: > Sripad, > > have you considered simply using a union of the two schemas as the input > schema? > > Schema.createUnion(Lists.newArrayList(schema1,schema2)); > > In the mapper you have to check for the record type / schema name / > specificrecord instance to extract your join key, but otherwise it's really > straightforward.. > > Johannes > > > On Thu, Apr 25, 2013 at 6:05 PM, Martin Kleppmann > <[email protected]>wrote: > >> I'm afraid I don't have an example -- the code I have is very entangled >> with our internal stuff; it would take a while to extract the >> general-purpose parts. >> >> I do mean <AvroWrapper<GenericRecord>, NullWritable> as input for >> mappers, since those are the types produced by AvroInputFormat: >> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroInputFormat.java?view=markup >> >> The reducer input types are just your mapper output types, so you can >> choose those yourself (any Hadoop writables). >> >> Martin >> >> >> On 25 April 2013 08:26, Sripad Sriram <[email protected]> wrote: >> >>> Thanks! Martin, would you happen to have a gist of an example? Did you >>> mean the reducer input is NullWritable? >>> >>> On Apr 25, 2013, at 7:44 AM, Martin Kleppmann <[email protected]> >>> wrote: >>> >>> Oh, sorry, you're right. I was too hasty. >>> >>> One approach that I've used for joining Avro inputs is to use regular >>> Hadoop mappers and reducers (instead of AvroMapper/AvroReducer) with >>> MultipleInputs and AvroInputFormat. Your mapper input key type is then >>> AvroWrapper<GenericRecord>, and mapper input value type is NullWritable. >>> This approach uses Hadoop sequence files (rather than Avro files) between >>> mappers and reducers, so you have to take care of serializing mapper output >>> and unserializing reducer input yourself. It works, but you have to write >>> quite a bit of annoying boilerplate code. >>> >>> I'd also be interested if anyone has a better solution. Perhaps we just >>> need to create the AvroMultipleInputs that I thought existed, but doesn't :) >>> >>> Martin >>> >>> >>> On 24 April 2013 12:02, Sripad Sriram <[email protected]> wrote: >>> >>>> Hey Martin, >>>> >>>> I think those classes refer to outputting to multiple files rather than >>>> reading from multiple files, which is what's needed for a reduce-side join. >>>> >>>> thanks, >>>> Sripad >>>> >>>> >>>> On Wed, Apr 24, 2013 at 3:35 AM, Martin Kleppmann < >>>> [email protected]> wrote: >>>> >>>>> Hey Sripad, >>>>> >>>>> Take a look at AvroMultipleInputs. >>>>> >>>>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroMultipleOutputs.html(mapred >>>>> version) >>>>> >>>>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapreduce/AvroMultipleOutputs.html(mapreduce >>>>> version) >>>>> >>>>> Martin >>>>> >>>>> >>>>> On 23 April 2013 17:01, Sripad Sriram <[email protected]> wrote: >>>>> >>>>>> Hey folks, >>>>>> >>>>>> Aware that I can use Pig, Hive, etc to join avro files together, but >>>>>> I have several use cases where I need to perform a reduce-side join on >>>>>> two >>>>>> avro files. MultipleInputs doesn't seem to like AvroInputFormat - any >>>>>> thoughts? >>>>>> >>>>>> thanks! >>>>>> Sripad >>>>>> >>>>> >>>>> >>>> >>> >> >
