RE: Secondary sort in hadoop with avro

Jacob Metcalf Tue, 11 Sep 2012 15:09:51 -0700

Frank
I have spent a bit of time doing this recently but with MR2 and CDH4 which may 
not be appropriate to your use case. However assuming some similarities, I 
suspect your problem is that you also need to override compare(byte[] b1, int 
s1, int l1, byte[] b2, int s2, int l2) on AvroKeyComparator. 
The advantage to Avro is that Hadoop does not need to deserialize to sort in 
the shuffle. This function in RawComparator allows Hadoop to quickly compare 
the bytes directly.
Whilst this seems a bit daunting my trick to doing this in MR2 is to leverage 
Avro's excellent support for projections - subsets of schemas. For example 
let's say you want to "group" by attribute A but then "sort" by attribute B. In 
this case I would use a composite key with schema {A, B} and the out of the box 
AvroKeyComparator as the sort comparator. Then I would implement my own 
grouping comparator which uses a schema of just {A} then uses the BinaryData 
function to compare:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.avro/avro/1.4.0/org/apache/avro/mapred/AvroKeyComparator.java
I assume you can do something similar in MR1.
Regards
Jacob


> Subject: Secondary sort in hadoop with avro
> From: [email protected]
> Date: Tue, 11 Sep 2012 17:36:06 +0200
> To: [email protected]
> 
> I need to implement secondary sort within an avro based MR sequence. I 
> however find little to documentation or examples online.
> I would like to implement this by overriding the  'int compare(AvroWrapper<T> 
> x, AvroWrapper<T> y)' method but I fail to have it invoked.
> Does anybody have experience implementing secondary sort on deserialised avro 
> objects ?
> 
> Some help, advise or pointers will be very much appreciated !

RE: Secondary sort in hadoop with avro

Reply via email to