I have a Spark DataFrame that looks like:
| id | value | bin |
|----+-------+-----|
| 1 | 3.4 | 2 |
| 2 | 2.6 | 1 |
| 3 | 1.8 | 1 |
| 4 | 9.6 | 2 |
I have a function `f` that takes an array of values and returns a number. I
want to add a column to the above DataFrame where the value for the new
column in each row is the value of `f` for all the `value` entries that have
the same `bin` entry, i.e:
| id | value | bin | f_value |
|----+-------+-----+---------------|
| 1 | 3.4 | 2 | f([3.4, 9.6]) |
| 2 | 2.6 | 1 | f([2.6, 1.8]) |
| 3 | 1.8 | 1 | f([2.6, 1.8]) |
| 4 | 9.6 | 2 | f([3.4, 9.6]) |
Since I need to aggregate all `value`s per `bin`, I cannot use the
`withColumn` function to add this new column. What is the best way to do
this until user defined aggregation functions make there way into Spark?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Add-Custom-Aggregate-Column-to-Spark-DataFrame-tp23075.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]