You can use a Hive UDF.

import org.apache.spark.sql.functions._
callUDF("collect_set", $"columnName")

or just SELECT collect_set(columnName) FROM ...

Note that in 1.5 I think this actually does not use tungsten.  In 1.6 it
should though.

I'll add that the experimental Dataset API (preview in 1.6) might have a
better implementation of what you are looking for in the form of mapGroups.


On Thu, Oct 29, 2015 at 2:19 AM, saurfang <[email protected]> wrote:

> Sorry if this functionality already exists or has been asked before, but
> I'm
> looking for an aggregate function in SparkSQL that allows me to collect a
> column into array per group in a grouped dataframe.
>
> For example, if I have the following table
> user, score
> --------
> user1, 1
> user2, 2
> user1, 3
> user2, 4
> user1, 5
>
> I want to produce a dataframe that is like
>
> user1, [1, 3]
> user2, [2, 4, 5]
>
> (possibly via select collect(score) from table group by user)
>
>
> I realize I can probably implement this as a UDAF but just want to double
> check if such thing already exists. If not, would there be interests to
> have
> this in SparkSQL?
>
> To give some context, I am trying to do a cogroup on datasets and then
> persist as parquet but want to take advantage of Tungsten. So the plan is
> to, collapse row into key, value of struct => group by key and collect
> value
> as array[struct] => outer join dataframes.
>
> p.s. Here are some resources I have found so far but all of them concerns
> top K per key instead:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-get-top-K-values-per-key-in-key-value-RDD-td20370.html
> and
> https://issues.apache.org/jira/browse/SPARK-5954 (where Reynold mentioned
> this could be an API in dataframe)
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Collect-Column-as-Array-in-Grouped-DataFrame-tp25223.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to