What you require is secondary sort which is not available as such for a DataFrame. The Window operator is what comes closest but it is strangely limited in its abilities (probably because it was inspired by a SQL construct instead of a more generic programmatic transformation capability).
On Nov 3, 2016 7:53 AM, "Rabin Banerjee" <[email protected]> wrote: > Hi All , > > I want to do a dataframe operation to find the rows having the latest > timestamp in each group using the below operation > > df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr")) > .select("customername","service_type","mobileno","cust_addr") > > > *Spark Version :: 1.6.x* > > My Question is *"Will Spark guarantee the Order while doing the groupBy , if > DF is ordered using OrderBy previously in Spark 1.6.x"??* > > > *I referred a blog here :: > **https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/ > > <https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/>* > > *Which claims it will work except in Spark 1.5.1 and 1.5.2 .* > > > *I need a bit elaboration of how internally spark handles it ? also is it > more efficient than using a Window function ?* > > > *Thanks in Advance ,* > > *Rabin Banerjee* > > > >
