I suspect the best way would be to work out how to apply the techniques to MR1.
However for MR2 support look at AVRO-593 and odiago-avro on github. Garret Wu has written a series of extensions which support use of Avro in the shuffle. These have been integrated into Avro as of 17. Jacob -----Original Message----- From: Frank Kootte Sent: 12 Sep 2012 14:42:29 GMT To: [email protected] Subject: Re: Secondary sort in hadoop with avro I would like to use MR2 in conjunction with avro but cannot find too much documentation on the topic. Do you have any pointers in that region ? AVRO 1.7.1 does not have any AvroReducer / Mapper in the mapreduce package. I didnt look into it enough to see if perhaps the compatibility with the v2 is solved under the hood transparently now. In short I am having tremendous trouble finding documentation on the topic. Hopefully you guys are able to help me along. 2012/9/12 Frank Kootte <[email protected]> > Very interesting concept you mention there - avro projections ! > This sounds indeed like a clever way to leverage the avro capability of > comparance without deserialisation which will be obviously beneficial. > Now as with a lot of avro related hadoop topics I am not able to find a > clear example but from what I did mention to find I would like to get your > feedback on my question - > > Does avro projection involve defining a secondary schema describing only > the desired subset of fields ? > Does this then imply that when I define my own AvroKeyComparator<A> the > byte arrays will only contain the data for set A ? > How should the BinaryCompare be used differently from the base impl > in AvroKeyComparator ? > > Secondary I've tried to implement a custom AvroKeyComparator and in > specific the - compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int > l2) - method. > I am wowfully unaware on how to exactly do this and cannot find a lot of > examples on the topic. > > Could you write me a small sample of pseudo code perhaps ? > Or point me to some documentation to get me on my way ? > > > 2012/9/12 Jacob Metcalf <[email protected]> > >> Frank >> >> I have spent a bit of time doing this recently but with MR2 and CDH4 >> which may not be appropriate to your use case. However assuming some >> similarities, I suspect your problem is that you also need to override >> compare(byte[] >> b1, int s1, int l1, byte[] b2, int s2, int l2) on AvroKeyComparator. >> >> The advantage to Avro is that Hadoop does not need to deserialize to sort >> in the shuffle. This function in RawComparator allows Hadoop to quickly >> compare the bytes directly. >> >> Whilst this seems a bit daunting my trick to doing this in MR2 is to >> leverage Avro's excellent support for projections - subsets of schemas. For >> example let's say you want to "group" by attribute A but then "sort" by >> attribute B. In this case I would use a composite key with schema {A, B} >> and the out of the box AvroKeyComparator as the sort comparator. Then I >> would implement my own grouping comparator which uses a schema of just {A} >> then uses the BinaryData function to compare: >> >> >> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.avro/avro/1.4.0/org/apache/avro/mapred/AvroKeyComparator.java >> >> I assume you can do something similar in MR1. >> >> Regards >> >> Jacob >> >> > Subject: Secondary sort in hadoop with avro >> > From: [email protected] >> > Date: Tue, 11 Sep 2012 17:36:06 +0200 >> > To: [email protected] >> >> > >> > I need to implement secondary sort within an avro based MR sequence. I >> however find little to documentation or examples online. >> > I would like to implement this by overriding the 'int >> compare(AvroWrapper<T> x, AvroWrapper<T> y)' method but I fail to have it >> invoked. >> > Does anybody have experience implementing secondary sort on >> deserialised avro objects ? >> > >> > Some help, advise or pointers will be very much appreciated ! >> > > > > -- > Mvrgr. Frank > -- Mvrgr. Frank
