Re: Secondary sort in hadoop with avro

Jacob Metcalf Thu, 13 Sep 2012 02:25:22 -0700

I suspect the best way would be to work out how to apply the techniques to MR1.


However for MR2 support look at AVRO-593 and odiago-avro on github. Garret Wu 
has written a series of extensions which support use of Avro in the shuffle. 
These have been integrated into Avro as of 17.

Jacob

-----Original Message-----

From: Frank Kootte
Sent: 12 Sep 2012 14:42:29 GMT
To: [email protected]
Subject: Re: Secondary sort in hadoop with avro

I would like to use MR2 in conjunction with avro but cannot find too much
documentation on the topic. Do you have any pointers in that region ?
AVRO 1.7.1 does not have any AvroReducer / Mapper in the mapreduce package.
I didnt look into it enough to see if perhaps the compatibility with the v2
is solved under the hood transparently now.
In short I am having tremendous trouble finding documentation on the topic.
Hopefully you guys are able to help me along.


2012/9/12 Frank Kootte <[email protected]>

> Very interesting concept you mention there - avro projections !
> This sounds indeed like a clever way to leverage the avro capability of
> comparance without deserialisation which will be obviously beneficial.
> Now as with a lot of avro related hadoop topics I am not able to find a
> clear example but from what I did mention to find I would like to get your
> feedback on my question -
>
> Does avro projection involve defining a secondary schema describing only
> the desired subset of fields ?
> Does this then imply that when I define my own AvroKeyComparator<A> the
> byte arrays will only contain the data for set A ?
> How should the BinaryCompare be used differently from the base impl
> in AvroKeyComparator ?
>
> Secondary I've tried to implement a custom AvroKeyComparator and in
> specific the - compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int
> l2)  - method.
> I am wowfully unaware on how to exactly do this and cannot find a lot of
> examples on the topic.
>
> Could you write me a small sample of pseudo code perhaps ?
> Or point me to some documentation to get me on my way ?
>
>
> 2012/9/12 Jacob Metcalf <[email protected]>
>
>>  Frank
>>
>> I have spent a bit of time doing this recently but with MR2 and CDH4
>> which may not be appropriate to your use case. However assuming some
>> similarities, I suspect your problem is that you also need to override 
>> compare(byte[]
>> b1, int s1, int l1, byte[] b2, int s2, int l2) on AvroKeyComparator.
>>
>> The advantage to Avro is that Hadoop does not need to deserialize to sort
>> in the shuffle. This function in RawComparator allows Hadoop to quickly
>> compare the bytes directly.
>>
>> Whilst this seems a bit daunting my trick to doing this in MR2 is to
>> leverage Avro's excellent support for projections - subsets of schemas. For
>> example let's say you want to "group" by attribute A but then "sort" by
>> attribute B. In this case I would use a composite key with schema {A, B}
>> and the out of the box AvroKeyComparator as the sort comparator. Then I
>> would implement my own grouping comparator which uses a schema of just {A}
>> then uses the BinaryData function to compare:
>>
>>
>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.avro/avro/1.4.0/org/apache/avro/mapred/AvroKeyComparator.java
>>
>> I assume you can do something similar in MR1.
>>
>> Regards
>>
>> Jacob
>>
>> > Subject: Secondary sort in hadoop with avro
>> > From: [email protected]
>> > Date: Tue, 11 Sep 2012 17:36:06 +0200
>> > To: [email protected]
>>
>> >
>> > I need to implement secondary sort within an avro based MR sequence. I
>> however find little to documentation or examples online.
>> > I would like to implement this by overriding the 'int
>> compare(AvroWrapper<T> x, AvroWrapper<T> y)' method but I fail to have it
>> invoked.
>> > Does anybody have experience implementing secondary sort on
>> deserialised avro objects ?
>> >
>> > Some help, advise or pointers will be very much appreciated !
>>
>
>
>
> --
> Mvrgr. Frank
>



--
Mvrgr. Frank

Re: Secondary sort in hadoop with avro

Reply via email to