Re: secondary sort in crunch on spark.

Micah Whitacre Wed, 24 Jun 2015 06:30:06 -0700

There has been some investigation into Crunch on Tez here[1].  I don't
believe anyone is currently actively working on it but we'd love patches if
someone had the time to bake it.


[1] - https://issues.apache.org/jira/browse/CRUNCH-441

On Wed, Jun 24, 2015 at 8:20 AM, Kidong Lee <[email protected]> wrote:

> thanks for your reply, your answer is very helpful to understand it.
>
> I have another question.  Is there any plan to support Tez on which crunch
> can be run?
>
> - Kidong.
>
>
>
>
>
> 2015-06-23 23:06 GMT+09:00 Josh Wills <[email protected]>:
>
>> Hey Kidong,
>>
>> The short answer is that we cheat. The class to look at for the
>> implementation details is:
>>
>>
>> https://github.com/apache/crunch/blob/master/crunch-spark/src/main/java/org/apache/crunch/impl/spark/collect/PGroupedTableImpl.java
>>
>> ...and you sort of have to walk through three different tricks we do to
>> make MapReduce partitioners, sorting classes, and grouping classes-- all of
>> which we use in the secondary sort implementation-- to work on Spark.
>>
>> J
>>
>> On Tue, Jun 23, 2015 at 6:57 AM, David Ortiz <[email protected]> wrote:
>>
>>> Correct me if I'm wrong, but if you are using an avro record or a Tuple
>>> data structure, couldn't you get a secondary sort by just sticking the
>>> fields in the order you want to apply the sort, and then using the regular
>>> sort api?  For example, if I had say, itemid, itemprice, nosold and I
>>> wanted to do something like....
>>>
>>> select itemid, itemprice, sum(nosold) from table group by itemid,
>>> itemprice, order by itemid, itemprice asc;
>>>
>>> I could implement that as...
>>> PTable<Pair<Integer, Double>, Long> items = {...some code to load the
>>> data into this
>>> structure...}.groupByKey().combineValues(Aggregators.SUM_LONGS).sort() and
>>> get something similar right?
>>>
>>>
>>> On Tue, Jun 23, 2015 at 8:52 AM Kidong Lee <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have been using spark to implement our recommendation algorithm, for
>>>> which it was hard to get secondary sort by value, thus, I have implemented
>>>> this algorithm with the help of hive.
>>>> I think, spark does not support secondary sort yet.
>>>>
>>>> I have recently implemented the same recommendation algorithm in crunch
>>>> running on spark with using crunch secondary sort API.
>>>>
>>>> I am wondering how to implement secondary sort in crunch running on
>>>> spark.
>>>>
>>>> Anybody can give me some explanations about the implementation of
>>>> secondary sort in crunch spark?
>>>>
>>>> thanks,
>>>>
>>>> - Kidong.
>>>>
>>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>

Re: secondary sort in crunch on spark.

Reply via email to