Suggestions when using Pair.getPairSchema for Reduce-Side Joins in MR2

Jacob Metcalf Wed, 27 Jun 2012 15:10:30 -0700

I spent an hour or so of today debugging some map reduce jobs I had developed 
in Avro 1.7 and Map Reduce 2 and thought it might be constructive to share. I 
needed to do a reduce side join for which you need a composite key. The key 
consists of the key you are actually grouping by and an integer which is just 
used for sorting (the technique is described in many places but there is a nice 
picture on page 24 of 
http://www.inf.ed.ac.uk/publications/thesis/online/IM100859.pdf). 
For this I thought it would be ideal to use Avro pair class which has a handy 
function for creating its own schema so I could configure the shuffle something 
like this:
Schema joinKeySchema = Pair.getPairSchema( Schema.create( Schema.type.STRING ), 
Schema.create( Schema.type.INTEGER ));AvroJob.setMapOutputKeySchema( 
joinKeySchema ); I then planned to use the standard AvroKeyComparator for 
sorting and a specialised comparator for grouping/partitioning which would 
ignore the integer part. However it did not work as the sort on the integer did 
not appear to take place and my map output would arrive in the wrong order at 
the reducer. I finally tracked the issue down to the fact that the pair schema 
by default ignores the second part of the pair:
private static Schema makePairSchema(Schema key, Schema value) {    Schema pair 
= Schema.createRecord(PAIR, null, null, false);    List<Field> fields = new 
ArrayList<Field>();    fields.add(new Field(KEY, key, "", null));    
fields.add(new Field(VALUE, value, "", null, Field.Order.IGNORE));    
pair.setFields(fields);    return pair;  }
In the end it was easy enough to work around by creating my own pair schema. I 
am not an expert but I suspect there is a very valid application for this 
ignore in MR1. As a suggestion it may help going forwards if a second version 
with a boolean to toggle the ignore were introduced to make the semantics 
clearer .
Jacob
Suggestions when using Pair.getPairSchema for Reduce-Side Joins in MR2

Reply via email to