Hey Kidong, The short answer is that we cheat. The class to look at for the implementation details is:
https://github.com/apache/crunch/blob/master/crunch-spark/src/main/java/org/apache/crunch/impl/spark/collect/PGroupedTableImpl.java ...and you sort of have to walk through three different tricks we do to make MapReduce partitioners, sorting classes, and grouping classes-- all of which we use in the secondary sort implementation-- to work on Spark. J On Tue, Jun 23, 2015 at 6:57 AM, David Ortiz <[email protected]> wrote: > Correct me if I'm wrong, but if you are using an avro record or a Tuple > data structure, couldn't you get a secondary sort by just sticking the > fields in the order you want to apply the sort, and then using the regular > sort api? For example, if I had say, itemid, itemprice, nosold and I > wanted to do something like.... > > select itemid, itemprice, sum(nosold) from table group by itemid, > itemprice, order by itemid, itemprice asc; > > I could implement that as... > PTable<Pair<Integer, Double>, Long> items = {...some code to load the data > into this > structure...}.groupByKey().combineValues(Aggregators.SUM_LONGS).sort() and > get something similar right? > > > On Tue, Jun 23, 2015 at 8:52 AM Kidong Lee <[email protected]> wrote: > >> Hi, >> >> I have been using spark to implement our recommendation algorithm, for >> which it was hard to get secondary sort by value, thus, I have implemented >> this algorithm with the help of hive. >> I think, spark does not support secondary sort yet. >> >> I have recently implemented the same recommendation algorithm in crunch >> running on spark with using crunch secondary sort API. >> >> I am wondering how to implement secondary sort in crunch running on spark. >> >> Anybody can give me some explanations about the implementation of >> secondary sort in crunch spark? >> >> thanks, >> >> - Kidong. >> >> -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
