Thanks Sean, I was under impression that spark creators are trying to persuade user community not to use RDD api directly. Spark summit I attended was full of this. So I am a bit surprised that I hear use-rdd-api as an advice from you. But if this is a way then I have a second question. For conversion from dataset to rdd I would use Dataset.rdd lazy val. Since it is a lazy val it suggests there is some computation going on to create rdd as a copy. The question is how much computationally expansive is this conversion? If there is a significant overhead then it is clear why one would want to have top method directly on Dataset class.
Ordering whole dataset only to take first 10 or so top records is not really an acceptable option for us. Comparison function can be expansive and the size of dataset is (unsurprisingly) big. To be honest I do not really understand what do you mean by b). Since DataFrame is now only an alias for Dataset[Row] what do you mean by "DataFrame-like counterpart"? Thanks On Thu, Sep 1, 2016 at 2:31 PM, Sean Owen <so...@cloudera.com> wrote: > You can always call .rdd.top(n) of course. Although it's slightly > clunky, you can also .orderBy($"value".desc).take(n). Maybe there's an > easier way. > > I don't think if there's a strong reason other than it wasn't worth it > to write this and many other utility wrappers that a) already exist on > the underlying RDD API if you want them, and b) have a DataFrame-like > counterpart already that doesn't really need wrapping in a different > API. > > On Thu, Sep 1, 2016 at 12:53 PM, Jakub Dubovsky > <spark.dubovsky.ja...@gmail.com> wrote: > > Hey all, > > > > in RDD api there is very usefull method called top. It finds top n > records > > in according to certain ordering without sorting all records. Very > usefull! > > > > There is no top method nor similar functionality in Dataset api. Has > anybody > > any clue why? Is there any specific reason for this? > > > > Any thoughts? > > > > thanks > > > > Jakub D. >