Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-10 Thread Andy Dang
itioning(key#0, 200) > >>> +- SortAggregate(key=[key#0], > >>> functions=[partial_aggregatefunction(key#0, nested#1, nestedArray#2, > >>> nestedObjectArray#3, value#4L, com.[...]uDf@108b121f, 0, 0)], > >>> output=[key#0,internal

Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Liang-Chi Hsieh
+- Scan ExistingRDD[key#0,nested#1,nes >>> tedArray#2,nestedObjectArray#3,value#4L] >>> >>> How can I make Spark to use HashAggregate (like the count(*) expression) >>> instead of SortAggregate with my UDAF? >>> >>> Is it int

Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Andy Dang
Hi Takeshi, Thanks for the answer. My UDAF aggregates data into an array of rows. Apparently this makes it ineligible to using Hash-based aggregate based on the logic at: https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationM

Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Takeshi Yamamuro
Hi, Spark always uses hash-based aggregates if the types of aggregated data are supported there; otherwise, spark fails to use hash-based ones, then it uses sort-based ones. See: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.s

How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Andy Dang
Hi all, It appears to me that Dataset.groupBy().agg(udaf) requires a full sort, which is very inefficient for certain aggration: The code is very simple: - I have a UDAF - What I want to do is: dataset.groupBy(cols).agg(udaf).count() The physical plan I got was: *HashAggregate(keys=[], functions