[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @thesuperzapper unfortunately I haven't been able to keep up-to-date with Spark over the past year (first year of grad school has been occupying me). I don't think I can make any contributions right now or for a while. Are you thinking about taking over? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user thesuperzapper commented on the issue: https://github.com/apache/spark/pull/14547 @vlad17 sorry to bump, but what is the status of this, and by proxy. https://issues.apache.org/jira/browse/SPARK-4240 AND https://issues.apache.org/jira/browse/SPARK-16718 We have suggested to the community that TreeBoost (Friedman, 1999), [Which this effectively implements] will be added to SparkML for some time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @HyukjinKwon sorry for the inactivity (I have some free time now). @jkbradley is SPARK-4240 still on the roadmap? I can resume work on this (and the subsequent GBT work) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14547 @vlad17 any update and opinion for the last review comment? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/14547 I'd recommend overriding setImpurity in the relevant concrete classes. In those, you can add warnings in the Scala doc and also add logWarning messages about deprecation. That's almost as good as deprecation annotations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @jkbradley There seems to be more issues with deprecating impurity: [error] [warn] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala:114: method setImpurity overrides concrete, non-deprecated symbol(s):setImpurity [error] [warn] override def setImpurity(value: String): this.type = super.setImpurity(value) [error] [warn] [error] [warn] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala:111: method setImpurity overrides concrete, non-deprecated symbol(s):setImpurity [error] [warn] override def setImpurity(value: String): this.type = super.setImpurity(value) [error] [warn] The shared superclass for GBT* (Tree*Params) can't have setImpurity deprecated because it's shared with derived classes that should allow impurity-setting, and therefore can't have the base class method deprecated. I find it weird that a derived class can't add a deprecation, though. Why is that rule there? Can I disable it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67908/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #67908 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67908/consoleFull)** for PR 14547 at commit [`4e20a70`](https://github.com/apache/spark/commit/4e20a709e9278e18302835070a148f891e42a3c1). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #67908 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67908/consoleFull)** for PR 14547 at commit [`4e20a70`](https://github.com/apache/spark/commit/4e20a709e9278e18302835070a148f891e42a3c1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @jkbradley it seems I can only deprecate `setImpurity`: the value can't be deprecated since it's used internally, which triggers a fatal warning, and getImpurity has scaladoc shared between other classes where it's valid to use. In any case, `setImpurity` is the only one that needs to have the warning. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #67858 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67858/consoleFull)** for PR 14547 at commit [`5f54f4d`](https://github.com/apache/spark/commit/5f54f4dbf94addf8b4df1af13a417f0fd0971633). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67858/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #67858 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67858/consoleFull)** for PR 14547 at commit [`5f54f4d`](https://github.com/apache/spark/commit/5f54f4dbf94addf8b4df1af13a417f0fd0971633). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/14547 I don't think the impurity question is a huge deal because of what you have pointed out: it's an expert param for GBT. * Let's put it in group ```expertParam``` in the documentation. * Removing the Param entirely seems like a good idea for the future. I'd be OK with deprecating it, to be removed in Spark 3.0. (We probably should not remove it earlier since I bet a fair number of people set it b/c of copying decision tree examples.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #67400 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67400/consoleFull)** for PR 14547 at commit [`66d3396`](https://github.com/apache/spark/commit/66d33963fcba05b4303d34891635607f54e10364). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67400/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #67400 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67400/consoleFull)** for PR 14547 at commit [`66d3396`](https://github.com/apache/spark/commit/66d33963fcba05b4303d34891635607f54e10364). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @sethah You raise good points. Regarding (1), I don't know if it is actually true. I don't want to speak for @jkbradley, but I was just going off of "software engineering intuition" about backwards capability of the algorithm's behavior. But let's consider an analogous example - if LogisticRegression was using regular batch GD, and we moved it to L-BFGS, it wouldn't make much sense to offer a new option for "gd". I think the question is whether reverting to original behavior is common enough to merit a larger, more clunky, and more confusing API. And as the notion of "original" will be changing over time, I'm starting to see the attractiveness of @sethah's original proposition to get rid of this option entirely, and let us do whatever we want under the hood impurity-wise. **TL; DR:** I can see at no point a data scientist saying "you know what will help my l1 error? A mean predictor!" The strongest point in favor of this that comes to me is the following: people who would be changing the impurity metric are going to be people who are working on a GBT model tuning; but there's no good reason to use variance-based impurity with mean predictions for a loss that isn't optimized by those changes! Any model tuning which would, in some way or another, be checking `.setImpurity("variance")` vs `.setImpurity("loss-based")` that happens to show that you do better when choosing variance with CV, then all you've done is grid search on GBT model parameters to overfit to noise in your data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user sethah commented on the issue: https://github.com/apache/spark/pull/14547 A few observations: * Before this patch, users could not set an impurity (in fact, if you call `getImpurity` on a gbt classifier it returns "gini", which is not true. Seems an unrelated bug) * After this patch, users can technically set an impurity, but there are really only two options - "loss-based" (which is ambiguous to me) and "variance." Setting "variance" for a classifier could be confusing without an understanding of gbt internals * Scikit GBT and R's gbm do not expose an impurity API * After this patch, the impurity defaults to "loss-based", when it in fact may not be loss based at all. For the case of logistic loss in classification, we use a variance impurity, but indicate to the user that we are using a loss-based impurity. This is part of why I think it's confusing/unclear. I realize we _intend_ to have a fully loss-based solution in the future, but we don't have it right now. Seems quite misleading to say the impurity is "loss-based" when it truly is not. If we feel that we _must_ provide users the option to use the terminal node refinements or not use them (it seems that is the consensus) then exposing the impurity as a set-able param is one way. But impurity is really a binary choice right now - use terminal node refinements or don't (I'm omitting the special case of variance). We could alternatively expose an `expertParam` which could support "treeBoost", "gradientBoost" for now, and potentially "xgboost" in the future. You can argue that being confusing isn't all that much a detriment since probably only users that mess with this will be those that understand well enough, which may be true. I guess I want to make sure that 1.) we feel we have to expose this as an option and 2.) what is the best way to do it given that 1.) is true. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66907/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #66907 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66907/consoleFull)** for PR 14547 at commit [`b7d66df`](https://github.com/apache/spark/commit/b7d66df8b6376d9c4ad5a13bf3a9f2e7bda9410d). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #66907 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66907/consoleFull)** for PR 14547 at commit [`b7d66df`](https://github.com/apache/spark/commit/b7d66df8b6376d9c4ad5a13bf3a9f2e7bda9410d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @jkbradley Re test scripts: `res8: Double = 0.5193104784040287` is the value outputted by `counts.max / counts.sum`. Indeed, it's just a sanity check that the value isn't 1 - i.e., we don't have a model that just makes everything a 1 or 0. Also, indeed, I chose minimum observations in Spark in accordance with `gbm`'s default `n.minobsinnode = 10`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/14547 Merge conflict with MimaExcludes is will keep this from being able to be tested in jenkins :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @setah do you have any opinion on "loss-based" vs. "auto" or @jkbradley do you feel strongly about this? I think the trade-off is between being explicit vs. possibly confusing the user. I prefer being explicit. @setah one important thing to note is that option 2 is only strictly better than 1 if we have converged to an optimal terminal node prediction. In log loss, for example, I only take a single NR step per Friedman. Finally, apologies to everyone for the delay. I've had some deadlines at school and am currently traveling, but should be able to address comments when I get back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/14547 Just a heads up there is a merge conflict with the excludes you might want to update for so that jenkins can run its tests on this PR :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/14547 @sethah AFAIK, the original gradient boosting algorithm was generic, not specific to trees. That's Algorithm 1 from [https://statweb.stanford.edu/~jhf/ftp/trebst.pdf] and is what MLlib has currently. I agree with your intuition about options 3 > 2 > 1 and encouraging users to use option 3 via our API. I'd be OK with disallowing option 1. As a software engineer, I'd want to allow 1 for backwards API compatibility, where behavior and algorithms are part of the API. But as an ML person, I'd be Ok with not even allowing 1 in the future to prevent users from doing the wrong thing. Combining these, I'd recommend: * For now, we make 2 the default behavior but still allow 1. (as in this PR) * In the future, we make 3 the default behavior, maybe allow 2, and do not allow 1. > "loss-based" What exactly does that mean to the user? If this is unclear, then let's make the documentation for that Param clearer and/or use a more intuitive name such as "auto." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user sethah commented on the issue: https://github.com/apache/spark/pull/14547 So, taking a look at the current patch, the API for this "loss-based" impurity feels clunky and a bit confusing. To enumerate, we have the following scenarios: **1. Completely decoupled loss and impurity** - Spark 2.0 and before **2. Terminal node updates** - trees trained with _some_ impurity, but predictions are optimal (this patch), aka "TreeBoost" **3. Train trees with loss directly**- this incorporates both terminal node refinements and minimizing the loss directly during tree training. This could be done in the future and is how XGBoost works. I don't exactly see why we need to maintain 1 at all. Terminal node refinements should be a strict improvement over 1. By attempting to maintain both options, we expose an "impurity" parameter for GBTs, which IMO does not need to be public. To me, it is confusing also. e.g. scala scala> gbtc.getImpurity res1: String = loss-based What exactly does that mean to the user? I think that, ideally in the future, we would support 2 and 3 and users could select between the two. I can envision there being trade-offs between 2 and 3, but in theory 2 should be a strict superset of 1. AFAIK, scikit does not expose the option to perform terminal-node updates to the user and I am not even sure there is a paper documenting gradient tree boosting without explicitly performing the terminal node updates. For instance, we have previously referred to this paper https://statweb.stanford.edu/~jhf/ftp/stobst.pdf (please correct me if I'm wrong here) in the code as justification for describing the current approach as "Stochastic Gradient Boosting." But in the paper both sections 1 and 2 use terminal node refinements. Silently changing the algorithm could be tricky, but if we can be certain that this change is a strict improvement then I'm not sure it's a problem. I'm curious to hear others' thoughts on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/14547 @sethah I agree with you that the original TreeBoost does not use the loss to choose the structure of the tree; it only uses the loss to recompute example labels and to choose predicted values at leaf nodes. But as Vlad said, xgboost uses the loss to choose tree structure, which intuitively should help the GBM to fit the data faster. Vlad's design allows testing vs. TreeBoost by setting impurity and loss separately, as well as testing vs. xgboost by setting impurity to be "loss-based." * One question is whether we should change the default impurity to be "loss-based," which will change behavior to be closer to xgboost. @vlad17 Test gists: I had a few questions about the gists you referenced in the PR description for comparing MLlib with R's gbm. * ```setMinInstancesPerNode(10)```: For MLlib, you set minInstancesPerNode=10. Is this the same value used by gbm by default? I'm trying to match up how the tests were run. * At one point, you have the MLlib script output the value ```counts.max / counts.sum```. I wasn't sure what the value was for. My guess was that it was a sanity check to verify that the train/test splits are identical across tests, but I don't see it output by the gbm script. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65320/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #65320 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65320/consoleFull)** for PR 14547 at commit [`517b082`](https://github.com/apache/spark/commit/517b082590855b080c69e16f41ecd572c89618f6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #65320 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65320/consoleFull)** for PR 14547 at commit [`517b082`](https://github.com/apache/spark/commit/517b082590855b080c69e16f41ecd572c89618f6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65297/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #65297 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65297/consoleFull)** for PR 14547 at commit [`3d052dc`](https://github.com/apache/spark/commit/3d052dc64e8a080e706d65b7d6c04f534978771e). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #65297 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65297/consoleFull)** for PR 14547 at commit [`3d052dc`](https://github.com/apache/spark/commit/3d052dc64e8a080e706d65b7d6c04f534978771e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @jkbradley I addressed your comments (will be pushing new version after tests run), but I didn't understand what you were referring to in the "test gists" comment. Would you mind clarifying? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user sethah commented on the issue: https://github.com/apache/spark/pull/14547 TBH, I'm not certain after having read many of those papers exactly what constitutes "TreeBoost". From the following excerpt, it seems to me like TreeBoost is simply defined by making terminal node updates to minimize boosting loss, and *not* by minimizing the loss when splitting the tree nodes. The terminal node updates are based on medians. An alternative approach would be to build a tree directly to minimize the loss criterion. That being said, I'm not certain about it and I don't think there's a much better way to implement this than coupling the loss and impurity, since we need to collect certain sufficient statistics to make terminal node updates anyway. Thanks for your notes and clarification! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @sethah Was that coupling not already there beforehand? I didn't really change any of the implementation class' interfaces, I just added the Bernoulli impurity to the existing Impurity framework, which itself couples the Impurity class with the ImpurityAggregator, that necessarily returns an ImpurityCalculator, which makes the prediction. It seems like the existing design is already doing the coupling. As for the interface making this coupling explicit, yes, I completely agree I'm doing that. But I think this is a good thing. 1. The coupled loss functions / splitting impurity is the whole point of tree boost. The papers themselves say to construct intermediate trees to minimize loss. They only offer using other impurity measures for ease of implementation. XGBoost, for instance, splits on (an approximation of) the losses directly. 2. The fact that the underlying impurity/predictions are all done by the same class (though not my choice), is also probably better from an implementation perspective. Both need to gather summary statistics about each leaf's partition of the data, so it's easiest to just do it in one place. 3. I don't think we're giving up the "Decoupled version" either. If we so choose to in the future, setting impurity to "variance" but loss function to "absolute" can use a new ImpurityAggregator that offers the variance for splitting but median for predicting. My goal with this PR was to make as minimal a change as possible (it's mostly an API change introducing the loss-based impurity, which also makes loss-based terminal node predictions). I'm not trying to change the GBT design here at all (though if it appears to be the case because of something I'm missing, please let me know). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user sethah commented on the issue: https://github.com/apache/spark/pull/14547 One questions I had - this PR creates an inherent coupling between the impurity used to train the tree and the loss used for boosting. This is not how I understood tree boost. My impression was that, regardless of how the tree was trained (i.e. what impurity was used), that tree boost would simply modify the leaf node predictions to minimize the *boosting loss*. In fact, there is no real coupling done in this PR, but the framework is there. In scikit, there is no implied coupling. They simply train the tree, and modify the leaf node predictions after training. It may be hard to do this in a performant way here, so I'm not sure what is best. Just wanted to get some clarification on the design. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/14547 @vlad17 Thanks for the PR! I'm not done with a review pass, but I'll go ahead and send comments from a partial pass. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/14547 Test gists * ```setMinInstancesPerNode(10)```: Is this the same value used by gbm by default? * Is ```counts.max / counts.sum``` meant to verify that the train/test splits are identical? I don't see it computed for gbm. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #3224 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3224/consoleFull)** for PR 14547 at commit [`a040da5`](https://github.com/apache/spark/commit/a040da5ea64778d766720ecd6a8859893d7204f0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #3224 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3224/consoleFull)** for PR 14547 at commit [`a040da5`](https://github.com/apache/spark/commit/a040da5ea64778d766720ecd6a8859893d7204f0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63875/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #63875 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63875/consoleFull)** for PR 14547 at commit [`a040da5`](https://github.com/apache/spark/commit/a040da5ea64778d766720ecd6a8859893d7204f0). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #63875 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63875/consoleFull)** for PR 14547 at commit [`a040da5`](https://github.com/apache/spark/commit/a040da5ea64778d766720ecd6a8859893d7204f0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63500/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #63500 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63500/consoleFull)** for PR 14547 at commit [`233c6cc`](https://github.com/apache/spark/commit/233c6cc267efc98de4b694a82711cf568e124b93). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #63500 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63500/consoleFull)** for PR 14547 at commit [`233c6cc`](https://github.com/apache/spark/commit/233c6cc267efc98de4b694a82711cf568e124b93). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #3210 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3210/consoleFull)** for PR 14547 at commit [`fe256f7`](https://github.com/apache/spark/commit/fe256f736a3a11625b6c3983a1a27ef9c5543280). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/14547 CC: @hhbyyh Would you mind taking a look at this since you're familiar with GBTs? Thanks in advance! This should be one of the most important improvements in terms of accuracy, especially once we get soft predictions (for AUC measurements) from GBTs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #3210 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3210/consoleFull)** for PR 14547 at commit [`fe256f7`](https://github.com/apache/spark/commit/fe256f736a3a11625b6c3983a1a27ef9c5543280). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63447/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #63447 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63447/consoleFull)** for PR 14547 at commit [`fe256f7`](https://github.com/apache/spark/commit/fe256f736a3a11625b6c3983a1a27ef9c5543280). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #63447 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63447/consoleFull)** for PR 14547 at commit [`fe256f7`](https://github.com/apache/spark/commit/fe256f736a3a11625b6c3983a1a27ef9c5543280). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63428/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #63428 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63428/consoleFull)** for PR 14547 at commit [`b4e5e6c`](https://github.com/apache/spark/commit/b4e5e6cc6a48ba5160c9aa8a0e03800f193b561e). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #63428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63428/consoleFull)** for PR 14547 at commit [`b4e5e6c`](https://github.com/apache/spark/commit/b4e5e6cc6a48ba5160c9aa8a0e03800f193b561e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63407/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #63407 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63407/consoleFull)** for PR 14547 at commit [`cfaee0f`](https://github.com/apache/spark/commit/cfaee0fa9d00e4d763eb9a9af32e75a4ea800b50). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #63407 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63407/consoleFull)** for PR 14547 at commit [`cfaee0f`](https://github.com/apache/spark/commit/cfaee0fa9d00e4d763eb9a9af32e75a4ea800b50). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @sethah Thanks for the FYI. I'm pretty confident that it'll help since now we're directly optimizing the loss function. However, it would be nice to prove this. Unfortunately, the example I linked above uses a skewed dataset. The only estimator whose behavior changed is GBTClassifier (now the bernoulli predictions use an NR step rather than guess the mean). And since the raw prediction column is unavailable for the GBTClassifier, I can't really compare the classifiers sensibly on skewed datasets since AUC is out of the question. I'm going to have to spend some time trying to find a "real" dataset that's not skewed but large enough to be meaningful or just make an artificial one. And also spark-perf will need to be re-run. Also, regarding the binary incompatibility failure - part of that was my fault, part of it was due to an incompatibility with a package-private method. I added an exception for the binary incompatibility for the package-private method - is that OK? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user sethah commented on the issue: https://github.com/apache/spark/pull/14547 @vlad17 I do not get alerted when you comment on the squashed PR, as an FYI. I was using the databricks spark-perf package for performance testing. I'd be interested to see that TreeBoost algorithm provides "better" results than the non-TreeBoost version if that's possible. I think we need some provable improvement to show before we proceed with merging this patch. (It sounds like you are working on that currently). Thanks for the PR! I'll try to have a look sometime, but it may not be immediately. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63388/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #63388 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63388/consoleFull)** for PR 14547 at commit [`06fc4a9`](https://github.com/apache/spark/commit/06fc4a917c92403c50eb7906f8d5bbef8662f427). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @hhbyyh Would you mind reviewing this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #63388 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63388/consoleFull)** for PR 14547 at commit [`06fc4a9`](https://github.com/apache/spark/commit/06fc4a917c92403c50eb7906f8d5bbef8662f427). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/14547 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org