[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-26 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-71585503 Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have t

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-26 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4073 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enab

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-26 Thread MechCoder
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-71584396 @mengxr This can also be viewd as a bugfix which prevents overwriting of the param `subSamplingRate`, which was hardcoded to 1.0 --- If your project is set up for it,

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-26 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-71510669 @MechCoder Thanks! LGTM CC: @mengxr Note this is sort of an API change: RandomForest can now be run with subsampled rows. (But this seems fine to me since us

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-71352330 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-24 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-71352326 [Test build #26061 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26061/consoleFull) for PR 4073 at commit [`8012fb2`](https://gith

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-24 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-71349963 [Test build #26061 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26061/consoleFull) for PR 4073 at commit [`8012fb2`](https://githu

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-24 Thread MechCoder
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-71349920 @jkbradley Fixed. I can haz merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-24 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-71338606 @MechCoder This is an addition instead of a correction, but I just realized that Strategy.assertValid() does not check subsamplingRate. Would you mind adding that ch

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-24 Thread MechCoder
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-71335512 ping @jkbradley Could you please have a final look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If y

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70998086 [Test build #25961 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25961/consoleFull) for PR 4073 at commit [`e0e0d9c`](https://gith

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70998096 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70990119 [Test build #25961 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25961/consoleFull) for PR 4073 at commit [`e0e0d9c`](https://githu

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70989958 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70989944 [Test build #25955 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25955/consoleFull) for PR 4073 at commit [`d5d68e7`](https://gith

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-21 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70983710 [Test build #25955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25955/consoleFull) for PR 4073 at commit [`d5d68e7`](https://githu

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-21 Thread MechCoder
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70983304 Repushed after fixing the style checks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-21 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70982966 [Test build #25953 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25953/consoleFull) for PR 4073 at commit [`8a0acb5`](https://gith

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70982967 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-21 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70982888 [Test build #25953 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25953/consoleFull) for PR 4073 at commit [`8a0acb5`](https://githu

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-21 Thread MechCoder
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70982719 @jkbradley Thanks for the tip. Fixed. Anything more? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4073#discussion_r23330643 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/RandomForestSuite.scala --- @@ -196,6 +196,24 @@ class RandomForestSuite extends FunSuite with M

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4073#discussion_r23329625 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala --- @@ -132,6 +132,7 @@ private class RandomForest ( timer.start("ini

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4073#discussion_r23327520 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/RandomForestSuite.scala --- @@ -196,6 +196,24 @@ class RandomForestSuite extends FunSuite with M

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-20 Thread MechCoder
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/4073#discussion_r23250726 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/RandomForestSuite.scala --- @@ -196,6 +196,24 @@ class RandomForestSuite extends FunSuite with M

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-19 Thread MechCoder
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70538422 @mengxr @jkbradley Any more comments? Sorry for spamming, but I would like to work on other issues related to GBRT and RandomForests as well. --- If your project is se

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-17 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70378420 [Test build #25705 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25705/consoleFull) for PR 4073 at commit [`d1df1b2`](https://gith

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70378423 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-17 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70375392 [Test build #25705 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25705/consoleFull) for PR 4073 at commit [`d1df1b2`](https://githu

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-17 Thread MechCoder
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70375368 @jkbradley I've added a test according to the other tests in the `RandomForestSuite` . Let me know if there is anything left. --- If your project is set up for it, you

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70372791 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-17 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70372786 [Test build #25704 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25704/consoleFull) for PR 4073 at commit [`a7bfc70`](https://gith

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-17 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70370361 [Test build #25704 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25704/consoleFull) for PR 4073 at commit [`a7bfc70`](https://githu

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-17 Thread MechCoder
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70369703 Could you please tell me what is the preferred way to generate random data in spark? --- If your project is set up for it, you can reply to this email and have your re

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70307249 I'd vote for not adding it to train since that part of the API is so unwieldy. --- If your project is set up for it, you can reply to this email and have your reply ap

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread MechCoder
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70306545 Thanks, Also a design decision, is it worthy enough to add this as an option to `train` given that it is now within the "style limit"? --- If your project is set up fo

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70305493 Also, as far as testingit's hard. One way might be to: * Run RF with a random seed and subsampling rate 1.0 * Run it the same way, but with with rate < 1.0

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70305015 [Test build #25672 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25672/consoleFull) for PR 4073 at commit [`6685b44`](https://gith

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70305019 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70304959 Good point, yes, I think it's worth fixing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your proj

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70304874 [Test build #25672 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25672/consoleFull) for PR 4073 at commit [`6685b44`](https://githu

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread MechCoder
GitHub user MechCoder reopened a pull request: https://github.com/apache/spark/pull/4073 [SPARK-3726] [MLlib] Allow sampling_rate not equal to 1.0 I've added support for sampling_rate not equal to 1.0 . I have two major questions. 1. A Scala style test is failing, since the

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread MechCoder
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70304290 Oh well, but still if I'm not mistaken, the `subSamplingRate` is overriden by the condition `numTrees > 1`. This should not be the case as having a lower sampling, mig

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread MechCoder
Github user MechCoder closed the pull request at: https://github.com/apache/spark/pull/4073 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is e

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70303258 @MechCoder Taking a closer look, I now realize that part of this functionality is already there...see the JIRA & let me know what you think. --- If your project is se

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread MechCoder
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70302451 @jkbradley Oops, the comments got deleted somehow. I meant that this is because there are 10 arguments in `trainClassifier` and `trainRegressor` --- If your project is

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread MechCoder
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70300939 @jkbradley, the issue is that the function `train` has more than 10 args. --- If your project is set up for it, you can reply to this email and have your reply appear o

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70300152 You can run dev/scalastyle locally to see what the issues are. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread MechCoder
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70297536 I've made changes such that this not break anything, i.e everything is backward compat. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70283762 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70283759 [Test build #25666 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25666/consoleFull) for PR 4073 at commit [`6685b44`](https://gith

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread MechCoder
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70283608 @jkbradley @mengxr it would be great if you could have a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70283599 [Test build #25666 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25666/consoleFull) for PR 4073 at commit [`6685b44`](https://githu

[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-16 Thread MechCoder
GitHub user MechCoder opened a pull request: https://github.com/apache/spark/pull/4073 [SPARK-3726] [MLlib] Allow sampling_rate not equal to 1.0 I've added support for sampling_rate not equal to 1.0 . I have two major questions. 1. A Scala style test is failing, since the n