Very interesting concept you mention there - avro projections ! This sounds indeed like a clever way to leverage the avro capability of comparance without deserialisation which will be obviously beneficial. Now as with a lot of avro related hadoop topics I am not able to find a clear example but from what I did mention to find I would like to get your feedback on my question -
Does avro projection involve defining a secondary schema describing only the desired subset of fields ? Does this then imply that when I define my own AvroKeyComparator<A> the byte arrays will only contain the data for set A ? How should the BinaryCompare be used differently from the base impl in AvroKeyComparator ? Secondary I've tried to implement a custom AvroKeyComparator and in specific the - compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) - method. I am wowfully unaware on how to exactly do this and cannot find a lot of examples on the topic. Could you write me a small sample of pseudo code perhaps ? Or point me to some documentation to get me on my way ? 2012/9/12 Jacob Metcalf <[email protected]> > Frank > > I have spent a bit of time doing this recently but with MR2 and CDH4 which > may not be appropriate to your use case. However assuming some > similarities, I suspect your problem is that you also need to override > compare(byte[] > b1, int s1, int l1, byte[] b2, int s2, int l2) on AvroKeyComparator. > > The advantage to Avro is that Hadoop does not need to deserialize to sort > in the shuffle. This function in RawComparator allows Hadoop to quickly > compare the bytes directly. > > Whilst this seems a bit daunting my trick to doing this in MR2 is to > leverage Avro's excellent support for projections - subsets of schemas. For > example let's say you want to "group" by attribute A but then "sort" by > attribute B. In this case I would use a composite key with schema {A, B} > and the out of the box AvroKeyComparator as the sort comparator. Then I > would implement my own grouping comparator which uses a schema of just {A} > then uses the BinaryData function to compare: > > > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.avro/avro/1.4.0/org/apache/avro/mapred/AvroKeyComparator.java > > I assume you can do something similar in MR1. > > Regards > > Jacob > > > Subject: Secondary sort in hadoop with avro > > From: [email protected] > > Date: Tue, 11 Sep 2012 17:36:06 +0200 > > To: [email protected] > > > > > I need to implement secondary sort within an avro based MR sequence. I > however find little to documentation or examples online. > > I would like to implement this by overriding the 'int > compare(AvroWrapper<T> x, AvroWrapper<T> y)' method but I fail to have it > invoked. > > Does anybody have experience implementing secondary sort on deserialised > avro objects ? > > > > Some help, advise or pointers will be very much appreciated ! > -- Mvrgr. Frank
