[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-21 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20806
  
Yeah! Let me close this for now. We can discuss this further in the Jira 
ticket then.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-19 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20806
  
great! maybe we can hold this PR for a real SQL tree aggregate in the 
future, with some proper design and discussion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20806
  
@cloud-fan @WeichenXu123 Ok. I've setup a Spark cluster with 5 nodes for 
the benchmark. 

The used data:
```
val r = new Random
val ds = (0 to 1).map { _ =>
  val a = Array.fill(1)(if (r.nextDouble() > 0.5) 1.0 else 0.0 )
  Tuple1(Vectors.dense(a))
}.toDS
```

Two versions of `treeAggregate` perform very close. Thus, directly using 
`RDD.treeAggregate` can be much simpler.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-18 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20806
  
@WeichenXu123 As the discussion with @cloud-fan at 
https://github.com/apache/spark/pull/20806#discussion_r174277864, I'd like to 
see some performance gain it, but I need to run benchmark to see if there is 
any.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-16 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/20806
  
@viirya ok. but there're already a class in ML use 
`TypedImperativeAggregator`, see `Summarizer`.

And do you benchmark and compare this PR and `df.rdd.treeAggregate`?
Seems they're almost the same. Is there some difference which can make 
remarkable performance improvement ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-16 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20806
  
@WeichenXu123 The `seqOp`/`comOp` can be arbitrary and works on domain 
objects, I'm not sure if built-in agg functions can satisfy all the use of it. 
For example, it seems hard to express `IDF.DocumentFrequencyAggregator` in 
built-in agg functions if any. One possible way is to use `Aggregator`  and 
developers can write their aggregation function when doing treeAggregate.

One advantage of `seqOp`/`comOp` is that ML developers don't need to learn 
how to write `Aggregator`. It may let them exposed to some concepts like 
`Encoder`.  I have concerned that ML developer should know this or not.

Anyway, to work with built-in agg functions or `Aggregator`, because it 
uses SQL aggregation system, we may need to overhaul the current aggregation 
system to support tree-style aggregation. Although it can benefit more 
situations not just ML, it needs more thinking and design.

You can think of this as a workaround for now. Thus it is only for private 
use.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/20806
  
But I haven't benchmark. Maybe it do not worth to do codegen for 
treeAggregate.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/20806
  
@viirya Yes. `treeAggregate` should only apply to global aggregate.
But in this PR the API have to use `seqOp`/`combOp`.
What I expect is that the dataframe version treeAggregate can exploit 
built-in agg function (suppose in the future we have built-in agg function for 
vector type)

`dataset.groupBy()` if do not given any key column then it will group the 
whole dataset so it can match the case of treeAggregate, or do you have some 
better design ?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-14 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20806
  
@WeichenXu123 I feel `groupBy` is more SQL-like aggregation by which we can 
specify a key to grouping by. At least `rdd.treeAggregate` does not support 
key-specified aggregation.

For typed grouping `groupByKey`, it constructs `KeyValueGroupedDataset` by 
which we rely on SQL `Aggregate` execution to grouping data. Currently it 
doesn't support tree-based aggregation.

This work doesn't intend to overhaul SQL aggregation to support tree-based 
aggregation. So the API will looks more like as is.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/20806
  
The API seems not dataframe style. What I expect is something like:
```
dataset.groupBy().setAggregateLevel(2).agg(Map("age" -> "max", "salary" -> 
"avg"))
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20806
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20806
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88200/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20806
  
**[Test build #88200 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88200/testReport)**
 for PR 20806 at commit 
[`a254d15`](https://github.com/apache/spark/commit/a254d1501c0119b4881c0443f28c263f0c9dec0e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20806
  
**[Test build #88200 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88200/testReport)**
 for PR 20806 at commit 
[`a254d15`](https://github.com/apache/spark/commit/a254d1501c0119b4881c0443f28c263f0c9dec0e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20806
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20806
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1482/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20806
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20806
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20806
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88194/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20806
  
**[Test build #88194 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88194/testReport)**
 for PR 20806 at commit 
[`a254d15`](https://github.com/apache/spark/commit/a254d1501c0119b4881c0443f28c263f0c9dec0e).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20806
  
**[Test build #88194 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88194/testReport)**
 for PR 20806 at commit 
[`a254d15`](https://github.com/apache/spark/commit/a254d1501c0119b4881c0443f28c263f0c9dec0e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20806
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1478/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20806
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...

2018-03-12 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20806
  
cc @dbtsai @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org