Re: DataFrame Min By Column

Xinh Huynh Fri, 08 Jul 2016 17:07:14 -0700

Hi Pedro,

I could not think of a way using an aggregate. It's possible with a window
function, partitioned on user and ordered by time:


// Assuming "df" holds your dataframe ...

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val wSpec = Window.partitionBy("user").orderBy("time")
df.select($"user", $"time", rank().over(wSpec).as("rank"))
  .where($"rank" === 1)

Xinh

On Fri, Jul 8, 2016 at 12:57 PM, Pedro Rodriguez <ski.rodrig...@gmail.com>
wrote:

> Is there a way to on a GroupedData (from groupBy in DataFrame) to have an
> aggregate that returns column A based on a min of column B? For example, I
> have a list of sites visited by a given user and I would like to find the
> event with the minimum time (first event)
>
> Thanks,
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>

Re: DataFrame Min By Column

Reply via email to