Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@thesuperzapper unfortunately I haven't been able to keep up-to-date with
Spark over the past year (first year of grad school has been occupying me). I
don't think I can make any contributions right
Github user thesuperzapper commented on the issue:
https://github.com/apache/spark/pull/14547
@vlad17 sorry to bump, but what is the status of this, and by proxy.
https://issues.apache.org/jira/browse/SPARK-4240
AND
https://issues.apache.org/jira/browse/SPARK-16718
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@HyukjinKwon sorry for the inactivity (I have some free time now).
@jkbradley is SPARK-4240 still on the roadmap? I can resume work on this (and
the subsequent GBT work)
---
If your project is set
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/14547
@vlad17 any update and opinion for the last review comment?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/14547
I'd recommend overriding setImpurity in the relevant concrete classes. In
those, you can add warnings in the Scala doc and also add logWarning messages
about deprecation. That's almost as good
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@jkbradley There seems to be more issues with deprecating impurity:
[error] [warn]
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67908/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #67908 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67908/consoleFull)**
for PR 14547 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #67908 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67908/consoleFull)**
for PR 14547 at commit
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@jkbradley it seems I can only deprecate `setImpurity`: the value can't be
deprecated since it's used internally, which triggers a fatal warning, and
getImpurity has scaladoc shared between other
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #67858 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67858/consoleFull)**
for PR 14547 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67858/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #67858 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67858/consoleFull)**
for PR 14547 at commit
Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/14547
I don't think the impurity question is a huge deal because of what you have
pointed out: it's an expert param for GBT.
* Let's put it in group ```expertParam``` in the documentation.
*
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #67400 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67400/consoleFull)**
for PR 14547 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67400/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #67400 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67400/consoleFull)**
for PR 14547 at commit
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@sethah You raise good points.
Regarding (1), I don't know if it is actually true. I don't want to speak
for @jkbradley, but I was just going off of "software engineering intuition"
about
Github user sethah commented on the issue:
https://github.com/apache/spark/pull/14547
A few observations:
* Before this patch, users could not set an impurity (in fact, if you call
`getImpurity` on a gbt classifier it returns "gini", which is not true. Seems
an unrelated
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66907/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #66907 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66907/consoleFull)**
for PR 14547 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #66907 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66907/consoleFull)**
for PR 14547 at commit
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@jkbradley Re test scripts:
`res8: Double = 0.5193104784040287` is the value outputted by `counts.max /
counts.sum`. Indeed, it's just a sanity check that the value isn't 1 - i.e., we
don't
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/14547
Merge conflict with MimaExcludes is will keep this from being able to be
tested in jenkins :)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@setah do you have any opinion on "loss-based" vs. "auto" or @jkbradley do
you feel strongly about this? I think the trade-off is between being explicit
vs. possibly confusing the user. I prefer
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/14547
Just a heads up there is a merge conflict with the excludes you might want
to update for so that jenkins can run its tests on this PR :)
---
If your project is set up for it, you can reply to this
Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/14547
@sethah AFAIK, the original gradient boosting algorithm was generic, not
specific to trees. That's Algorithm 1 from
[https://statweb.stanford.edu/~jhf/ftp/trebst.pdf] and is what MLlib has
Github user sethah commented on the issue:
https://github.com/apache/spark/pull/14547
So, taking a look at the current patch, the API for this "loss-based"
impurity feels clunky and a bit confusing. To enumerate, we have the following
scenarios:
**1. Completely decoupled
Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/14547
@sethah I agree with you that the original TreeBoost does not use the loss
to choose the structure of the tree; it only uses the loss to recompute example
labels and to choose predicted values at
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65320/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #65320 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65320/consoleFull)**
for PR 14547 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #65320 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65320/consoleFull)**
for PR 14547 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65297/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #65297 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65297/consoleFull)**
for PR 14547 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #65297 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65297/consoleFull)**
for PR 14547 at commit
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@jkbradley I addressed your comments (will be pushing new version after
tests run), but I didn't understand what you were referring to in the "test
gists" comment. Would you mind clarifying?
---
Github user sethah commented on the issue:
https://github.com/apache/spark/pull/14547
TBH, I'm not certain after having read many of those papers exactly what
constitutes "TreeBoost". From the following excerpt, it seems to me like
TreeBoost is simply defined by making terminal node
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@sethah Was that coupling not already there beforehand? I didn't really
change any of the implementation class' interfaces, I just added the Bernoulli
impurity to the existing Impurity framework,
Github user sethah commented on the issue:
https://github.com/apache/spark/pull/14547
One questions I had - this PR creates an inherent coupling between the
impurity used to train the tree and the loss used for boosting. This is not how
I understood tree boost. My impression was
Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/14547
@vlad17 Thanks for the PR! I'm not done with a review pass, but I'll go
ahead and send comments from a partial pass.
---
If your project is set up for it, you can reply to this email and have
Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/14547
Test gists
* ```setMinInstancesPerNode(10)```: Is this the same value used by gbm by
default?
* Is ```counts.max / counts.sum``` meant to verify that the train/test
splits are identical?
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #3224 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3224/consoleFull)**
for PR 14547 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #3224 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3224/consoleFull)**
for PR 14547 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63875/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #63875 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63875/consoleFull)**
for PR 14547 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #63875 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63875/consoleFull)**
for PR 14547 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63500/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #63500 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63500/consoleFull)**
for PR 14547 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #63500 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63500/consoleFull)**
for PR 14547 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #3210 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3210/consoleFull)**
for PR 14547 at commit
Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/14547
CC: @hhbyyh Would you mind taking a look at this since you're familiar with
GBTs? Thanks in advance! This should be one of the most important
improvements in terms of accuracy, especially once
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #3210 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3210/consoleFull)**
for PR 14547 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63447/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #63447 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63447/consoleFull)**
for PR 14547 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #63447 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63447/consoleFull)**
for PR 14547 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63428/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #63428 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63428/consoleFull)**
for PR 14547 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #63428 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63428/consoleFull)**
for PR 14547 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63407/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #63407 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63407/consoleFull)**
for PR 14547 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #63407 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63407/consoleFull)**
for PR 14547 at commit
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@sethah Thanks for the FYI. I'm pretty confident that it'll help since now
we're directly optimizing the loss function. However, it would be nice to prove
this. Unfortunately, the example I linked
Github user sethah commented on the issue:
https://github.com/apache/spark/pull/14547
@vlad17 I do not get alerted when you comment on the squashed PR, as an
FYI. I was using the databricks spark-perf package for performance testing.
I'd be interested to see that TreeBoost
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63388/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #63388 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63388/consoleFull)**
for PR 14547 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@hhbyyh Would you mind reviewing this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14547
**[Test build #63388 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63388/consoleFull)**
for PR 14547 at commit
Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/14547
ok to test
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14547
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
81 matches
Mail list logo