Nathan,

I achieve this using rowNumber.  Here is a Python DataFrame example:

from pyspark.sql.window import Window
from pyspark.sql.functions import desc, rowNumber

yourOutputDF = (
    yourInputDF
    .withColumn("first", rowNumber()
                .over(Window.partitionBy("userID").orderBy("datetime"))
               )
    .withColumn("last", rowNumber()

.over(Window.partitionBy("userID").orderBy(desc("datetime")))
               )
)

You can get the first url like this:
yourOutputDF.filter("first=1").select("userID", "url")

...and the last like this:
yourOutputDF.filter("last=1").select("userID", "url")

If you wanted the first and last url as columns with one row per userID,
you could do a groupBy and take the max of a when column that returns the
url if last is 1, or null otherwise.  (You would need a similar column
where first is 1.)  Not sure if this makes sense, but I don't have time now
to provide a code example.

Regards,
Dan


On Fri, Aug 21, 2015 at 4:09 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Did you try sorting it by datetime and doing a groupBy on the userID?
> On Aug 21, 2015 12:47 PM, "Nathan Skone" <nat...@skone.org> wrote:
>
>> Raghavendra,
>>
>> Thanks for the quick reply! I don’t think I included enough information
>> in my question. I am hoping to get fields that are not directly part of the
>> aggregation. Imagine a dataframe representing website views with a userID,
>> datetime, and a webpage address. How could I find the oldest or newest
>> webpage address that an user visited? As I understand it you can only
>> access fields that are part of the aggregation itself.
>>
>> Thanks,
>> Impact
>>
>>
>> On Aug 21, 2015, at 11:11 AM, Raghavendra Pandey <
>> raghavendra.pan...@gmail.com> wrote:
>>
>> Impact,
>> You can group by the data and then sort it by timestamp and take max to
>> select the oldest value.
>> On Aug 21, 2015 11:15 PM, "Impact" <nat...@skone.org> wrote:
>>
>>> I am also looking for a way to achieve the reducebykey functionality on
>>> data
>>> frames. In my case I need to select one particular row (the oldest,
>>> based on
>>> a timestamp column value) by key.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636p24399.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>

Reply via email to