Re: Aggregate to array (or 'slice by key') with DataFrames
Raghavendra, Thanks for the quick reply! I don’t think I included enough information in my question. I am hoping to get fields that are not directly part of the aggregation. Imagine a dataframe representing website views with a userID, datetime, and a webpage address. How could I find the oldest or newest webpage address that an user visited? As I understand it you can only access fields that are part of the aggregation itself. Thanks, Impact On Aug 21, 2015, at 11:11 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: Impact, You can group by the data and then sort it by timestamp and take max to select the oldest value. On Aug 21, 2015 11:15 PM, Impact nat...@skone.org mailto:nat...@skone.org wrote: I am also looking for a way to achieve the reducebykey functionality on data frames. In my case I need to select one particular row (the oldest, based on a timestamp column value) by key. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636p24399.html http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636p24399.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Re: Aggregate to array (or 'slice by key') with DataFrames
Did you try sorting it by datetime and doing a groupBy on the userID? On Aug 21, 2015 12:47 PM, Nathan Skone nat...@skone.org wrote: Raghavendra, Thanks for the quick reply! I don’t think I included enough information in my question. I am hoping to get fields that are not directly part of the aggregation. Imagine a dataframe representing website views with a userID, datetime, and a webpage address. How could I find the oldest or newest webpage address that an user visited? As I understand it you can only access fields that are part of the aggregation itself. Thanks, Impact On Aug 21, 2015, at 11:11 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: Impact, You can group by the data and then sort it by timestamp and take max to select the oldest value. On Aug 21, 2015 11:15 PM, Impact nat...@skone.org wrote: I am also looking for a way to achieve the reducebykey functionality on data frames. In my case I need to select one particular row (the oldest, based on a timestamp column value) by key. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636p24399.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Aggregate to array (or 'slice by key') with DataFrames
Impact, You can group by the data and then sort it by timestamp and take max to select the oldest value. On Aug 21, 2015 11:15 PM, Impact nat...@skone.org wrote: I am also looking for a way to achieve the reducebykey functionality on data frames. In my case I need to select one particular row (the oldest, based on a timestamp column value) by key. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636p24399.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Aggregate to array (or 'slice by key') with DataFrames
I am also looking for a way to achieve the reducebykey functionality on data frames. In my case I need to select one particular row (the oldest, based on a timestamp column value) by key. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636p24399.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Aggregate to array (or 'slice by key') with DataFrames
Nathan, I achieve this using rowNumber. Here is a Python DataFrame example: from pyspark.sql.window import Window from pyspark.sql.functions import desc, rowNumber yourOutputDF = ( yourInputDF .withColumn(first, rowNumber() .over(Window.partitionBy(userID).orderBy(datetime)) ) .withColumn(last, rowNumber() .over(Window.partitionBy(userID).orderBy(desc(datetime))) ) ) You can get the first url like this: yourOutputDF.filter(first=1).select(userID, url) ...and the last like this: yourOutputDF.filter(last=1).select(userID, url) If you wanted the first and last url as columns with one row per userID, you could do a groupBy and take the max of a when column that returns the url if last is 1, or null otherwise. (You would need a similar column where first is 1.) Not sure if this makes sense, but I don't have time now to provide a code example. Regards, Dan On Fri, Aug 21, 2015 at 4:09 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you try sorting it by datetime and doing a groupBy on the userID? On Aug 21, 2015 12:47 PM, Nathan Skone nat...@skone.org wrote: Raghavendra, Thanks for the quick reply! I don’t think I included enough information in my question. I am hoping to get fields that are not directly part of the aggregation. Imagine a dataframe representing website views with a userID, datetime, and a webpage address. How could I find the oldest or newest webpage address that an user visited? As I understand it you can only access fields that are part of the aggregation itself. Thanks, Impact On Aug 21, 2015, at 11:11 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: Impact, You can group by the data and then sort it by timestamp and take max to select the oldest value. On Aug 21, 2015 11:15 PM, Impact nat...@skone.org wrote: I am also looking for a way to achieve the reducebykey functionality on data frames. In my case I need to select one particular row (the oldest, based on a timestamp column value) by key. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636p24399.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Aggregate to array (or 'slice by key') with DataFrames
Hello, I'm migrating some RDD-based code to using DataFrames. We've seen massive speedups so far! One of the operations in the old code creates an array of the values for each key, as follows: val collatedRDD = valuesRDD.mapValues(value=Array(value)).reduceByKey((array1,array2) = array1++array2) I was wondering if there is a similar way to achieve this with a DataFrame via the DataFrame API, or whether we need to use RDD operations on the DataFrame to get this functionality? From what I've seen all the SQL aggregations output a single value, and slices output a single array of rows. To rephrase my question I guess I'm wondering if there is some way to use aggregation or slicing on a DataFrame to output some collection (rdd / array / etc) of arrays, with one array for each distinct value in a given column of the DataFrame. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org