Yanbo Liang created SPARK-21591: ----------------------------------- Summary: Implement treeAggregate on Dataset API Key: SPARK-21591 URL: https://issues.apache.org/jira/browse/SPARK-21591 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.2.0 Reporter: Yanbo Liang
The Tungsten execution engine substantially improved the efficiency of memory and CPU for Spark application. However, in MLlib we still not migrate the internal computing workload from {{RDD}} to {{DataFrame}}. The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. As we all know, {{RDD}} based {{treeAggregate}} reduces the aggregation time by an order of magnitude for lots of MLlib algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html). I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API and do the performance benchmark related issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org