Hello,
I am using following transformations on RDD:
rddAgg = df.map(lambda l: (Row(a = l.a, b= l.b, c = l.c), l))\
.aggregateByKey([], lambda accumulatorList, value:
accumulatorList + [value], lambda list1, list2: [list1] + [list2])
I want to use the dataframe groupBy + agg transformation instead of
map + aggregateByKey because as far as I know dataframe
transformations are faster than RDD transformations.
I just can't figure out how to use custom aggregate functions with agg.
*First step is clear:*
groupedData = df.groupBy("a","b","c")
*Second step is not very clear to me:*
dfAgg = groupedData.agg(<I should call here a UDF that transforms each
row to a list and merges it?>)
The agg documentations says the following:
agg(**exprs*)
<https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html?highlight=min#pyspark.sql.GroupedData.agg>
Compute aggregates and returns the result as a DataFrame
<https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html?highlight=min#pyspark.sql.DataFrame>
.
The available aggregate functions are avg, max, min, sum, count.
If exprs is a single dict mapping from string to string, then the key is
the column to perform aggregation on, and the value is the aggregate
function.
Alternatively, exprs can also be a list of aggregate Column
<https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html?highlight=min#pyspark.sql.Column>
expressions.
Parameters: *exprs* – a dict mapping from column name (string) to aggregate
functions (string), or a list of Column
<https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html?highlight=min#pyspark.sql.Column>
.
Thanks for help!
--
Viktor
*P* Don't print this email, unless it's really necessary. Take care of
the environment.