Nathan, I achieve this using rowNumber. Here is a Python DataFrame example:
from pyspark.sql.window import Window from pyspark.sql.functions import desc, rowNumber yourOutputDF = ( yourInputDF .withColumn("first", rowNumber() .over(Window.partitionBy("userID").orderBy("datetime")) ) .withColumn("last", rowNumber() .over(Window.partitionBy("userID").orderBy(desc("datetime"))) ) ) You can get the first url like this: yourOutputDF.filter("first=1").select("userID", "url") ...and the last like this: yourOutputDF.filter("last=1").select("userID", "url") If you wanted the first and last url as columns with one row per userID, you could do a groupBy and take the max of a when column that returns the url if last is 1, or null otherwise. (You would need a similar column where first is 1.) Not sure if this makes sense, but I don't have time now to provide a code example. Regards, Dan On Fri, Aug 21, 2015 at 4:09 PM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > Did you try sorting it by datetime and doing a groupBy on the userID? > On Aug 21, 2015 12:47 PM, "Nathan Skone" <nat...@skone.org> wrote: > >> Raghavendra, >> >> Thanks for the quick reply! I don’t think I included enough information >> in my question. I am hoping to get fields that are not directly part of the >> aggregation. Imagine a dataframe representing website views with a userID, >> datetime, and a webpage address. How could I find the oldest or newest >> webpage address that an user visited? As I understand it you can only >> access fields that are part of the aggregation itself. >> >> Thanks, >> Impact >> >> >> On Aug 21, 2015, at 11:11 AM, Raghavendra Pandey < >> raghavendra.pan...@gmail.com> wrote: >> >> Impact, >> You can group by the data and then sort it by timestamp and take max to >> select the oldest value. >> On Aug 21, 2015 11:15 PM, "Impact" <nat...@skone.org> wrote: >> >>> I am also looking for a way to achieve the reducebykey functionality on >>> data >>> frames. In my case I need to select one particular row (the oldest, >>> based on >>> a timestamp column value) by key. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636p24399.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>