Hello, I'm migrating some RDD-based code to using DataFrames. We've seen massive speedups so far!
One of the operations in the old code creates an array of the values for each key, as follows: val collatedRDD = valuesRDD.mapValues(value=>Array(value)).reduceByKey((array1,array2) => array1++array2) I was wondering if there is a similar way to achieve this with a DataFrame via the DataFrame API, or whether we need to use RDD operations on the DataFrame to get this functionality? >From what I've seen all the SQL aggregations output a single value, and slices output a single array of rows. To rephrase my question I guess I'm wondering if there is some way to use aggregation or slicing on a DataFrame to output some collection (rdd / array / etc) of arrays, with one array for each distinct value in a given column of the DataFrame. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org