I have a Spark DataFrame that looks like:

    | id | value | bin |
    |----+-------+-----|
    |  1 |   3.4 |   2 |
    |  2 |   2.6 |   1 |
    |  3 |   1.8 |   1 |
    |  4 |   9.6 |   2 |

I have a function `f` that takes an array of values and returns a number.  I
want to add a column to the above DataFrame where the value for the new
column in each row is the value of `f` for all the `value` entries that have
the same `bin` entry, i.e:

    | id | value | bin | f_value       |
    |----+-------+-----+---------------|
    |  1 |   3.4 |   2 | f([3.4, 9.6]) |
    |  2 |   2.6 |   1 | f([2.6, 1.8]) |
    |  3 |   1.8 |   1 | f([2.6, 1.8]) |
    |  4 |   9.6 |   2 | f([3.4, 9.6]) |

Since I need to aggregate all `value`s per `bin`, I cannot use the
`withColumn` function to add this new column.  What is the best way to do
this until user defined aggregation functions make there way into Spark?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Add-Custom-Aggregate-Column-to-Spark-DataFrame-tp23075.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to