Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Nathan Skone
Raghavendra,

Thanks for the quick reply! I don’t think I included enough information in my 
question. I am hoping to get fields that are not directly part of the 
aggregation. Imagine a dataframe representing website views with a userID, 
datetime, and a webpage address. How could I find the oldest or newest webpage 
address that an user visited? As I understand it you can only access fields 
that are part of the aggregation itself.

Thanks,
Impact


 On Aug 21, 2015, at 11:11 AM, Raghavendra Pandey 
 raghavendra.pan...@gmail.com wrote:
 
 Impact,
 You can group by the data and then sort it by timestamp and take max to 
 select the oldest value.
 
 On Aug 21, 2015 11:15 PM, Impact nat...@skone.org 
 mailto:nat...@skone.org wrote:
 I am also looking for a way to achieve the reducebykey functionality on data
 frames. In my case I need to select one particular row (the oldest, based on
 a timestamp column value) by key.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636p24399.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636p24399.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 



Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Akhil Das
Did you try sorting it by datetime and doing a groupBy on the userID?
On Aug 21, 2015 12:47 PM, Nathan Skone nat...@skone.org wrote:

 Raghavendra,

 Thanks for the quick reply! I don’t think I included enough information in
 my question. I am hoping to get fields that are not directly part of the
 aggregation. Imagine a dataframe representing website views with a userID,
 datetime, and a webpage address. How could I find the oldest or newest
 webpage address that an user visited? As I understand it you can only
 access fields that are part of the aggregation itself.

 Thanks,
 Impact


 On Aug 21, 2015, at 11:11 AM, Raghavendra Pandey 
 raghavendra.pan...@gmail.com wrote:

 Impact,
 You can group by the data and then sort it by timestamp and take max to
 select the oldest value.
 On Aug 21, 2015 11:15 PM, Impact nat...@skone.org wrote:

 I am also looking for a way to achieve the reducebykey functionality on
 data
 frames. In my case I need to select one particular row (the oldest, based
 on
 a timestamp column value) by key.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636p24399.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Raghavendra Pandey
Impact,
You can group by the data and then sort it by timestamp and take max to
select the oldest value.
On Aug 21, 2015 11:15 PM, Impact nat...@skone.org wrote:

 I am also looking for a way to achieve the reducebykey functionality on
 data
 frames. In my case I need to select one particular row (the oldest, based
 on
 a timestamp column value) by key.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636p24399.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Impact
I am also looking for a way to achieve the reducebykey functionality on data
frames. In my case I need to select one particular row (the oldest, based on
a timestamp column value) by key.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636p24399.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Dan LaBar
Nathan,

I achieve this using rowNumber.  Here is a Python DataFrame example:

from pyspark.sql.window import Window
from pyspark.sql.functions import desc, rowNumber

yourOutputDF = (
yourInputDF
.withColumn(first, rowNumber()
.over(Window.partitionBy(userID).orderBy(datetime))
   )
.withColumn(last, rowNumber()

.over(Window.partitionBy(userID).orderBy(desc(datetime)))
   )
)

You can get the first url like this:
yourOutputDF.filter(first=1).select(userID, url)

...and the last like this:
yourOutputDF.filter(last=1).select(userID, url)

If you wanted the first and last url as columns with one row per userID,
you could do a groupBy and take the max of a when column that returns the
url if last is 1, or null otherwise.  (You would need a similar column
where first is 1.)  Not sure if this makes sense, but I don't have time now
to provide a code example.

Regards,
Dan


On Fri, Aug 21, 2015 at 4:09 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Did you try sorting it by datetime and doing a groupBy on the userID?
 On Aug 21, 2015 12:47 PM, Nathan Skone nat...@skone.org wrote:

 Raghavendra,

 Thanks for the quick reply! I don’t think I included enough information
 in my question. I am hoping to get fields that are not directly part of the
 aggregation. Imagine a dataframe representing website views with a userID,
 datetime, and a webpage address. How could I find the oldest or newest
 webpage address that an user visited? As I understand it you can only
 access fields that are part of the aggregation itself.

 Thanks,
 Impact


 On Aug 21, 2015, at 11:11 AM, Raghavendra Pandey 
 raghavendra.pan...@gmail.com wrote:

 Impact,
 You can group by the data and then sort it by timestamp and take max to
 select the oldest value.
 On Aug 21, 2015 11:15 PM, Impact nat...@skone.org wrote:

 I am also looking for a way to achieve the reducebykey functionality on
 data
 frames. In my case I need to select one particular row (the oldest,
 based on
 a timestamp column value) by key.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636p24399.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Aggregate to array (or 'slice by key') with DataFrames

2015-07-05 Thread Alex Beatson
Hello,

I'm migrating some RDD-based code to using DataFrames. We've seen massive
speedups so far!

One of the operations in the old code creates an array of the values for
each key, as follows:

val collatedRDD =
valuesRDD.mapValues(value=Array(value)).reduceByKey((array1,array2) =
array1++array2)

I was wondering if there is a similar way to achieve this with a DataFrame
via the DataFrame API, or whether we need to use RDD operations on the
DataFrame to get this functionality?

From what I've seen all the SQL aggregations output a single value, and
slices output a single array of rows. To rephrase my question I guess I'm
wondering if there is some way to use aggregation or slicing on a DataFrame
to output some collection (rdd / array / etc) of arrays, with one array for
each distinct value in a given column of the DataFrame.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org