i guess i could sort by (hashcode(key), key, secondarySortColumn) and then do mapPartitions?
sorry thinking out loud a bit here. ok i think that could work. thanks On Thu, Nov 3, 2016 at 10:25 PM, Koert Kuipers <[email protected]> wrote: > thats an interesting thought about orderBy and mapPartitions. i guess i > could emulate a groupBy with secondary sort using those two. however isn't > using an orderBy expensive since it is a total sort? i mean a groupBy with > secondary sort is also a total sort under the hood, but its on > (hashCode(key), secondarySortColumn) which is easier to distribute and > therefore can be implemented more efficiently. > > > > > > On Thu, Nov 3, 2016 at 8:59 PM, Michael Armbrust <[email protected]> > wrote: > >> It is still unclear to me why we should remember all these tricks (or add >>> lots of extra little functions) when this elegantly can be expressed in a >>> reduce operation with a simple one line lamba function. >>> >> I think you can do that too. KeyValueGroupedDataset has a reduceGroups >> function. This probably won't be as fast though because you end up >> creating objects where as the version I gave will get codgened to operate >> on binary data the whole way though. >> >>> The same applies to these Window functions. I had to read it 3 times to >>> understand what it all means. Maybe it makes sense for someone who has been >>> forced to use such limited tools in sql for many years but that's not >>> necessary what we should aim for. Why can I not just have the sortBy and >>> then an Iterator[X] => Iterator[Y] to express what I want to do? >>> >> We also have orderBy and mapPartitions. >> >>> All these functions (rank etc.) can be trivially expressed in this, plus >>> I can add other operations if needed, instead of being locked in like this >>> Window framework. >>> >> I agree that window functions would probably not be my first choice for >> many problems, but for people coming from SQL it was a very popular >> feature. My real goal is to give as many paradigms as possible in a single >> unified framework. Let people pick the right mode of expression for any >> given job :) >> > >
