Re: Secondary sort in hadoop with avro

Frank Kootte Tue, 11 Sep 2012 23:52:01 -0700

Very interesting concept you mention there - avro projections !
This sounds indeed like a clever way to leverage the avro capability of
comparance without deserialisation which will be obviously beneficial.
Now as with a lot of avro related hadoop topics I am not able to find a
clear example but from what I did mention to find I would like to get your
feedback on my question -


Does avro projection involve defining a secondary schema describing only
the desired subset of fields ?
Does this then imply that when I define my own AvroKeyComparator<A> the
byte arrays will only contain the data for set A ?
How should the BinaryCompare be used differently from the base impl
in AvroKeyComparator ?

Secondary I've tried to implement a custom AvroKeyComparator and in
specific the - compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)
- method.
I am wowfully unaware on how to exactly do this and cannot find a lot of
examples on the topic.

Could you write me a small sample of pseudo code perhaps ?
Or point me to some documentation to get me on my way ?


2012/9/12 Jacob Metcalf <[email protected]>

>  Frank
>
> I have spent a bit of time doing this recently but with MR2 and CDH4 which
> may not be appropriate to your use case. However assuming some
> similarities, I suspect your problem is that you also need to override 
> compare(byte[]
> b1, int s1, int l1, byte[] b2, int s2, int l2) on AvroKeyComparator.
>
> The advantage to Avro is that Hadoop does not need to deserialize to sort
> in the shuffle. This function in RawComparator allows Hadoop to quickly
> compare the bytes directly.
>
> Whilst this seems a bit daunting my trick to doing this in MR2 is to
> leverage Avro's excellent support for projections - subsets of schemas. For
> example let's say you want to "group" by attribute A but then "sort" by
> attribute B. In this case I would use a composite key with schema {A, B}
> and the out of the box AvroKeyComparator as the sort comparator. Then I
> would implement my own grouping comparator which uses a schema of just {A}
> then uses the BinaryData function to compare:
>
>
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.avro/avro/1.4.0/org/apache/avro/mapred/AvroKeyComparator.java
>
> I assume you can do something similar in MR1.
>
> Regards
>
> Jacob
>
> > Subject: Secondary sort in hadoop with avro
> > From: [email protected]
> > Date: Tue, 11 Sep 2012 17:36:06 +0200
> > To: [email protected]
>
> >
> > I need to implement secondary sort within an avro based MR sequence. I
> however find little to documentation or examples online.
> > I would like to implement this by overriding the 'int
> compare(AvroWrapper<T> x, AvroWrapper<T> y)' method but I fail to have it
> invoked.
> > Does anybody have experience implementing secondary sort on deserialised
> avro objects ?
> >
> > Some help, advise or pointers will be very much appreciated !
>



-- 
Mvrgr. Frank

Re: Secondary sort in hadoop with avro

Reply via email to