Re: secondary sort in crunch on spark.

Josh Wills Tue, 23 Jun 2015 07:08:42 -0700

Hey Kidong,

The short answer is that we cheat. The class to look at for the
implementation details is:


https://github.com/apache/crunch/blob/master/crunch-spark/src/main/java/org/apache/crunch/impl/spark/collect/PGroupedTableImpl.java

...and you sort of have to walk through three different tricks we do to
make MapReduce partitioners, sorting classes, and grouping classes-- all of
which we use in the secondary sort implementation-- to work on Spark.

J

On Tue, Jun 23, 2015 at 6:57 AM, David Ortiz <[email protected]> wrote:

> Correct me if I'm wrong, but if you are using an avro record or a Tuple
> data structure, couldn't you get a secondary sort by just sticking the
> fields in the order you want to apply the sort, and then using the regular
> sort api?  For example, if I had say, itemid, itemprice, nosold and I
> wanted to do something like....
>
> select itemid, itemprice, sum(nosold) from table group by itemid,
> itemprice, order by itemid, itemprice asc;
>
> I could implement that as...
> PTable<Pair<Integer, Double>, Long> items = {...some code to load the data
> into this
> structure...}.groupByKey().combineValues(Aggregators.SUM_LONGS).sort() and
> get something similar right?
>
>
> On Tue, Jun 23, 2015 at 8:52 AM Kidong Lee <[email protected]> wrote:
>
>> Hi,
>>
>> I have been using spark to implement our recommendation algorithm, for
>> which it was hard to get secondary sort by value, thus, I have implemented
>> this algorithm with the help of hive.
>> I think, spark does not support secondary sort yet.
>>
>> I have recently implemented the same recommendation algorithm in crunch
>> running on spark with using crunch secondary sort API.
>>
>> I am wondering how to implement secondary sort in crunch running on spark.
>>
>> Anybody can give me some explanations about the implementation of
>> secondary sort in crunch spark?
>>
>> thanks,
>>
>> - Kidong.
>>
>>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: secondary sort in crunch on spark.

Reply via email to