Re: Why there is no top method in dataset api

2016-09-13 Thread Jakub Dubovsky
Thanks Sean, the important part of your answer for me is that orderBy + limit is doing only "partial sort" because of optimizer. That's what I was missing. I will give it a try... J.D. On Mon, Sep 5, 2016 at 2:26 PM, Sean Owen wrote: > ​No, ​ > I'm not advising you to use .rdd, just saying it

Re: Why there is no top method in dataset api

2016-09-05 Thread Sean Owen
​No, ​ I'm not advising you to use .rdd, just saying it is possible. ​Although I'd only use RDDs if you had a good reason to, given Datasets now, they are not gone or even deprecated.​ You do not need to order the whole data set to get the top eleme ​nt. That isn't what top does though. You might

Re: Why there is no top method in dataset api

2016-09-05 Thread Jakub Dubovsky
Thanks Sean, I was under impression that spark creators are trying to persuade user community not to use RDD api directly. Spark summit I attended was full of this. So I am a bit surprised that I hear use-rdd-api as an advice from you. But if this is a way then I have a second question. For conver

Re: Why there is no top method in dataset api

2016-09-01 Thread Sean Owen
You can always call .rdd.top(n) of course. Although it's slightly clunky, you can also .orderBy($"value".desc).take(n). Maybe there's an easier way. I don't think if there's a strong reason other than it wasn't worth it to write this and many other utility wrappers that a) already exist on the und