Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

Koert Kuipers Thu, 03 Nov 2016 16:47:23 -0700

Oh okay that makes sense. The trick is to take max on tuple2 so you carry
the other column along.

It is still unclear to me why we should remember all these tricks (or add
lots of extra little functions) when this elegantly can be expressed in a
reduce operation with a simple one line lamba function.

The same applies to these Window functions. I had to read it 3 times to
understand what it all means. Maybe it makes sense for someone who has been
forced to use such limited tools in sql for many years but that's not
necessary what we should aim for. Why can I not just have the sortBy and
then an Iterator[X] => Iterator[Y] to express what I want to do? All these
functions (rank etc.) can be trivially expressed in this, plus I can add
other operations if needed, instead of being locked in like this Window
framework.

On Nov 3, 2016 4:10 PM, "Michael Armbrust" <[email protected]> wrote:

You are looking to perform an *argmax*, which you can do with a single
aggregation.  Here is an example
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/3170497669323442/2840265927289860/latest.html>
.

On Thu, Nov 3, 2016 at 4:53 AM, Rabin Banerjee <[email protected]
> wrote:

> Hi All ,
>
>   I want to do a dataframe operation to find the rows having the latest
> timestamp in each group using the below operation
>
> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
> .select("customername","service_type","mobileno","cust_addr")
>
>
> *Spark Version :: 1.6.x*
>
> My Question is *"Will Spark guarantee the Order while doing the groupBy , if 
> DF is ordered using OrderBy previously in Spark 1.6.x"??*
>
>
> *I referred a blog here :: 
> **https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>  
> <https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/>*
>
> *Which claims it will work except in Spark 1.5.1 and 1.5.2 .*
>
>
> *I need a bit elaboration of how internally spark handles it ? also is it 
> more efficient than using a Window function ?*
>
>
> *Thanks in Advance ,*
>
> *Rabin Banerjee*
>
>
>
>

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

Reply via email to