thanks for your reply, your answer is very helpful to understand it. I have another question. Is there any plan to support Tez on which crunch can be run?
- Kidong. 2015-06-23 23:06 GMT+09:00 Josh Wills <[email protected]>: > Hey Kidong, > > The short answer is that we cheat. The class to look at for the > implementation details is: > > > https://github.com/apache/crunch/blob/master/crunch-spark/src/main/java/org/apache/crunch/impl/spark/collect/PGroupedTableImpl.java > > ...and you sort of have to walk through three different tricks we do to > make MapReduce partitioners, sorting classes, and grouping classes-- all of > which we use in the secondary sort implementation-- to work on Spark. > > J > > On Tue, Jun 23, 2015 at 6:57 AM, David Ortiz <[email protected]> wrote: > >> Correct me if I'm wrong, but if you are using an avro record or a Tuple >> data structure, couldn't you get a secondary sort by just sticking the >> fields in the order you want to apply the sort, and then using the regular >> sort api? For example, if I had say, itemid, itemprice, nosold and I >> wanted to do something like.... >> >> select itemid, itemprice, sum(nosold) from table group by itemid, >> itemprice, order by itemid, itemprice asc; >> >> I could implement that as... >> PTable<Pair<Integer, Double>, Long> items = {...some code to load the >> data into this >> structure...}.groupByKey().combineValues(Aggregators.SUM_LONGS).sort() and >> get something similar right? >> >> >> On Tue, Jun 23, 2015 at 8:52 AM Kidong Lee <[email protected]> wrote: >> >>> Hi, >>> >>> I have been using spark to implement our recommendation algorithm, for >>> which it was hard to get secondary sort by value, thus, I have implemented >>> this algorithm with the help of hive. >>> I think, spark does not support secondary sort yet. >>> >>> I have recently implemented the same recommendation algorithm in crunch >>> running on spark with using crunch secondary sort API. >>> >>> I am wondering how to implement secondary sort in crunch running on >>> spark. >>> >>> Anybody can give me some explanations about the implementation of >>> secondary sort in crunch spark? >>> >>> thanks, >>> >>> - Kidong. >>> >>> > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
