itioning(key#0, 200)
> >>> +- SortAggregate(key=[key#0],
> >>> functions=[partial_aggregatefunction(key#0, nested#1, nestedArray#2,
> >>> nestedObjectArray#3, value#4L, com.[...]uDf@108b121f, 0, 0)],
> >>> output=[key#0,internal
+- Scan ExistingRDD[key#0,nested#1,nes
>>> tedArray#2,nestedObjectArray#3,value#4L]
>>>
>>> How can I make Spark to use HashAggregate (like the count(*) expression)
>>> instead of SortAggregate with my UDAF?
>>>
>>> Is it int
Hi Takeshi,
Thanks for the answer. My UDAF aggregates data into an array of rows.
Apparently this makes it ineligible to using Hash-based aggregate based on
the logic at:
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationM
Hi,
Spark always uses hash-based aggregates if the types of aggregated data are
supported there;
otherwise, spark fails to use hash-based ones, then it uses sort-based ones.
See:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.s
Hi all,
It appears to me that Dataset.groupBy().agg(udaf) requires a full sort,
which is very inefficient for certain aggration:
The code is very simple:
- I have a UDAF
- What I want to do is: dataset.groupBy(cols).agg(udaf).count()
The physical plan I got was:
*HashAggregate(keys=[], functions