Re: DataFrame Min By Column

Michael Armbrust Sat, 09 Jul 2016 13:19:50 -0700

You can do whats called an *argmax/argmin*, where you take the min/max of a
couple of columns that have been grouped together as a struct.  We sort in
column order, so you can put the timestamp first.


Here is an example
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/3170497669323442/2840265927289860/latest.html>
.

On Sat, Jul 9, 2016 at 6:10 AM, Pedro Rodriguez <ski.rodrig...@gmail.com>
wrote:

> I implemented a more generic version which I posted here:
> https://gist.github.com/EntilZha/3951769a011389fef25e930258c20a2a
>
> I think I could generalize this by pattern matching on DataType to use
> different getLong/getDouble/etc functions ( not trying to use getAs[]
> because getting T from Array[T] is hard it seems).
>
> Is there a way to go further and make the arguments unnecessary or
> inferable at runtime, particularly for the valueType since it doesn’t
> matter what it is? DataType is abstract so I can’t instantiate it, is there
> a way to define the method so that it pulls from the user input at runtime?
>
> Thanks,
> —
> Pedro Rodriguez
> PhD Student in Large-Scale Machine Learning | CU Boulder
> Systems Oriented Data Scientist
> UC Berkeley AMPLab Alumni
>
> pedrorodriguez.io | 909-353-4423
> github.com/EntilZha | LinkedIn
> <https://www.linkedin.com/in/pedrorodriguezscience>
>
> On July 9, 2016 at 1:33:18 AM, Pedro Rodriguez (ski.rodrig...@gmail.com)
> wrote:
>
> Hi Xinh,
>
> A co-worker also found that solution but I thought it was possibly
> overkill/brittle so looks into UDAFs (user defined aggregate functions). I
> don’t have code, but Databricks has a post that has an example
> https://databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.html.
> From that, I was able to write a MinLongByTimestamp function, but was
> having a hard time writing a generic aggregate to any column by an order
> able column.
>
> Anyone know how you might go about using generics in a UDAF, or something
> that would mimic union types to express that order able spark sql types are
> allowed?
>
> —
> Pedro Rodriguez
> PhD Student in Large-Scale Machine Learning | CU Boulder
> Systems Oriented Data Scientist
> UC Berkeley AMPLab Alumni
>
> pedrorodriguez.io | 909-353-4423
> github.com/EntilZha | LinkedIn
> <https://www.linkedin.com/in/pedrorodriguezscience>
>
> On July 8, 2016 at 6:06:32 PM, Xinh Huynh (xinh.hu...@gmail.com) wrote:
>
> Hi Pedro,
>
> I could not think of a way using an aggregate. It's possible with a window
> function, partitioned on user and ordered by time:
>
> // Assuming "df" holds your dataframe ...
>
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.expressions.Window
> val wSpec = Window.partitionBy("user").orderBy("time")
> df.select($"user", $"time", rank().over(wSpec).as("rank"))
>   .where($"rank" === 1)
>
> Xinh
>
> On Fri, Jul 8, 2016 at 12:57 PM, Pedro Rodriguez <ski.rodrig...@gmail.com>
> wrote:
>
>> Is there a way to on a GroupedData (from groupBy in DataFrame) to have an
>> aggregate that returns column A based on a min of column B? For example, I
>> have a list of sites visited by a given user and I would like to find the
>> event with the minimum time (first event)
>>
>> Thanks,
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>

Re: DataFrame Min By Column

Reply via email to