[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-21 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20806 Yeah! Let me close this for now. We can discuss this further in the Jira ticket then. --- - To unsubscribe, e-mail:

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-19 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20806 great! maybe we can hold this PR for a real SQL tree aggregate in the future, with some proper design and discussion. --- -

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-19 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20806 @cloud-fan @WeichenXu123 Ok. I've setup a Spark cluster with 5 nodes for the benchmark. The used data: ``` val r = new Random val ds = (0 to 1).map { _ => val a =

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20806 @WeichenXu123 As the discussion with @cloud-fan at https://github.com/apache/spark/pull/20806#discussion_r174277864, I'd like to see some performance gain it, but I need to run benchmark to see if

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-16 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/20806 @viirya ok. but there're already a class in ML use `TypedImperativeAggregator`, see `Summarizer`. And do you benchmark and compare this PR and `df.rdd.treeAggregate`? Seems they're

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-16 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20806 @WeichenXu123 The `seqOp`/`comOp` can be arbitrary and works on domain objects, I'm not sure if built-in agg functions can satisfy all the use of it. For example, it seems hard to express

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/20806 But I haven't benchmark. Maybe it do not worth to do codegen for treeAggregate. --- - To unsubscribe, e-mail:

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/20806 @viirya Yes. `treeAggregate` should only apply to global aggregate. But in this PR the API have to use `seqOp`/`combOp`. What I expect is that the dataframe version treeAggregate can

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-14 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20806 @WeichenXu123 I feel `groupBy` is more SQL-like aggregation by which we can specify a key to grouping by. At least `rdd.treeAggregate` does not support key-specified aggregation. For typed

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/20806 The API seems not dataframe style. What I expect is something like: ``` dataset.groupBy().setAggregateLevel(2).agg(Map("age" -> "max", "salary" -> "avg")) ``` ---

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20806 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20806 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88200/ Test PASSed. ---

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20806 **[Test build #88200 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88200/testReport)** for PR 20806 at commit

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20806 **[Test build #88200 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88200/testReport)** for PR 20806 at commit

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20806 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20806 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1482/

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20806 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20806 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20806 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88194/ Test FAILed. ---

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20806 **[Test build #88194 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88194/testReport)** for PR 20806 at commit

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-12 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20806 **[Test build #88194 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88194/testReport)** for PR 20806 at commit

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20806 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1478/

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20806 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-12 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20806 cc @dbtsai @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: