Thanks, Silvio, If we write
schemaRDD.map(row => (key, row)) .groupBy(key) .map((key, rows) => row) // take the first row from Iterable[ROW] We get an RDD[ROW], however, we need a SchemaRDD for following query. In our case, the ROW has about 80 columns which exceeds the case class limit. 2014-08-21 21:05 GMT+08:00 Silvio Fiorito <silvio.fior...@granturing.com>: > Yeah, unfortunately SparkSQL is missing a lot of the nice analytical > functions in Hive. But using a combo of SQL and Spark operations you should > be able to run the basic SQL, then do a groupBy on the SchemaRDD, then for > each group just take the first record. > > From: Fengyun RAO <raofeng...@gmail.com> > Date: Thursday, August 21, 2014 at 8:26 AM > To: "user@spark.apache.org" <user@spark.apache.org> > Subject: Re: [Spark SQL] How to select first row in each GROUP BY group? > > Could anybody help? I googled and read a lot, but didn’t find anything > helpful. > > or to make the question simple: > > *How to set row number for each group? * > > SELECT a, > ROW_NUMBER() OVER (PARTITION BY a) AS num FROM table. > > 2014-08-20 15:52 GMT+08:00 Fengyun RAO <raofeng...@gmail.com>: > > I have a table with 4 columns: a, b, c, time >> >> What I need is something like: >> >> SELECT a, b, GroupFirst(c) >> FROM t >> GROUP BY a, b >> >> GroupFirst means "the first" item of column c group, >> and by "the first" I mean minimal "time" in that group. >> >> >> In Oracle/Sql Server, we could write: >> >> WITH summary AS ( >> SELECT a, >> b, c, >> ROW_NUMBER() OVER(PARTITION BY a, b ORDER BY time) AS >> num >> FROM t)SELECT s.*FROM summary sWHERE s.num = 1 >> >> but in Spark SQL, there is no such thing as ROW_NUMBER() >> >> I wonder how to make it. >> >> >> >