[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-08-01 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17419 I am going to close this PR, since this is being taken over by @WeichenXu123 in #18798 . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-07-22 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17419 The copy problem is fixed in https://github.com/apache/spark/pull/18483 , I think we can remove this workaround in `ObjectHashAggregateExec`. --- If your project is set up for it, you can reply

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-07-20 Thread liancheng
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/17419 @WeichenXu123 and I did some profiling using `jvisualvm` and found that 40% of the time is spent in the copy performed by [this `safeProjection`][1]. This is a known issue used to fight against

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-07-20 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17419 As the dataframe version is much slower than RDD version (currently test against vector of size 1) I also guess there is some performance issue in

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-30 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17419 **[Test build #75406 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75406/testReport)** for PR 17419 at commit

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17419 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17419 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75406/ Test FAILed. ---

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-30 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17419 I looked a bit deeper into the performance aspect. Here are some quick insights: - there was an immediate bottleneck in `VectorUDT`, which boosts the performance already by 3x - it is

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-30 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17419 **[Test build #75406 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75406/testReport)** for PR 17419 at commit

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-29 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17419 > RDD = [2482 ~ 46150 ~ 48354] records / milli the number is so varied? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17419 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75285/ Test PASSed. ---

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17419 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17419 **[Test build #75285 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75285/testReport)** for PR 17419 at commit

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17419 I have added a small perf test to find the performance bottlenecks. Note that this test works on the worst case (vectors of size 1) from the perspective of overhead. Here are the numbers I

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17419 **[Test build #75285 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75285/testReport)** for PR 17419 at commit

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread thunterdb
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17419 @sethah it would have been nice, but I do not think we should merge it this late into the release cycle. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17419 Is this being targeted for Spark 2.2? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17419 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75268/ Test PASSed. ---

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17419 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17419 **[Test build #75268 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75268/testReport)** for PR 17419 at commit

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread dbtsai
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/17419 Will be really interested to see the performance benchmark durning the QA period so users can know when to use the dataframe apis or existing rdd apis. Thanks. --- If your project is set up for

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17419 **[Test build #75268 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75268/testReport)** for PR 17419 at commit

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/17419 I'll take a look later today. Also CCing @dbtsai I know you worked on the old summarizer, so I thought I'd ping you here. + @yanboliang and @holdenk since you commented on the design doc. ---

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17419 **[Test build #3614 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3614/testReport)** for PR 17419 at commit

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17419 **[Test build #3614 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3614/testReport)** for PR 17419 at commit

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17419 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75190/ Test FAILed. ---

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17419 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17419 **[Test build #75190 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75190/testReport)** for PR 17419 at commit

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17419 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17419 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75189/ Test FAILed. ---

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17419 **[Test build #75189 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75189/testReport)** for PR 17419 at commit

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17419 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17419 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75187/ Test FAILed. ---

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17419 **[Test build #75187 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75187/testReport)** for PR 17419 at commit

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17419 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75188/ Test FAILed. ---

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17419 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17419 **[Test build #75188 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75188/testReport)** for PR 17419 at commit

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17419 **[Test build #75190 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75190/testReport)** for PR 17419 at commit

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17419 **[Test build #75189 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75189/testReport)** for PR 17419 at commit