Hi Ayan,

I have a DF constructed from the following case class Event:

case class State { attr1: String, ....}

case class Event {
  userId: String,
  time: Long,
  state: State
}

I would like to generate a DF which contains the latest state of each
userId. I could have first compute the latest time of each user, and join
it back to the original data frame. But that involves two shuffles. Hence
would like to see if there are ways to improve the performance.

Thanks.

Justin


On Fri, May 15, 2015 at 6:32 AM, ayan guha <[email protected]> wrote:

> can you kindly elaborate on this? it should be possible to write udafs in
> similar lines of sum/min etc.
>
> On Fri, May 15, 2015 at 5:49 AM, Justin Yip <[email protected]>
> wrote:
>
>> Hello,
>>
>> May I know if these is way to implement aggregate function for grouped
>> data in DataFrame? I dug into the doc but didn't find any apart from the
>> UDF functions which applies on a Row. Maybe I have missed something. Thanks.
>>
>> Justin
>>
>> ------------------------------
>> View this message in context: Custom Aggregate Function for DataFrame
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Aggregate-Function-for-DataFrame-tp22893.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Reply via email to