hi Anton: I check the docs you mentioned, and have code accordingly, however met an exception like "org.apache.spark.sql.AnalysisException: Window function row_number does not take a frame specification.;" It Seems that the row_number API is giving a global row numbers of every row across all frames, by my understanding. If wrong,please correct me. I checked all the window function API of http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$, and just found that maybe row_number() seems matches. I am not quit sure about it. here is my code: val hc = new org.apache.spark.sql.hive.HiveContext(sc) val hivetable = hc.sql("select * from house_sale_pv_location") val overLocation = Window.partitionBy(hivetable.col("location")).rowsBetween(0, 49) val sortedDF = hivetable.withColumn("rowNumber", row_number().over(overLocation)) sortedDF.registerTempTable("sortedDF") val top50 = hc.sql("select id,location from sortedDF where rowNumber<=50") top50.registerTempTable("top50") hc.sql("select * from top50 where location=30").collect.foreach(println)
here, hivetable is a DF that I mentioned with 3 columns "id , pv, location", which is already sorted by pv in desc. So I didn't call orderby in the 3rd line of my code. I just want the first 50 rows, based on physical location, of each frame. To Tal: I tried rank API, however this is not the API I want , because there are some values have same pv are ranked as same values. And first 50 rows of each frame is what I'm expecting. the attached file shows what I got by using rank. Thank you anyway, I learnt what rank could provide from your advice. -------------------------------- Thanks&Best regards! San.Luo ----- 原始邮件 ----- 发件人:Anton Okolnychyi <anton.okolnyc...@gmail.com> 收件人:user <user@spark.apache.org> 抄送人:luohui20...@sina.com 主题:Re: how to select first 50 value of each group after group by? 日期:2016年07月06日 23点22分 The following resources should be useful: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-windows.html The last link should have the exact solution 2016-07-06 16:55 GMT+02:00 Tal Grynbaum <tal.grynb...@gmail.com>: You can use rank window function to rank each row in the group, and then filter the rowz with rank < 50 On Wed, Jul 6, 2016, 14:07 <luohui20...@sina.com> wrote: hi thereI have a DF with 3 columns: id , pv, location.(the rows are already grouped by location and sort by pv in des) I wanna get the first 50 id values grouped by location. I checked the API of dataframe,groupeddata,pairRDD, and found no match. is there a way to do this naturally? any info will be appreciated. -------------------------------- Thanks&Best regards! San.Luo
--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org