[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2016-07-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15376056#comment-15376056
 ] 

Joseph K. Bradley commented on SPARK-4240:
--

Thanks for the detailed survey and design description!  Answers (with numbers, 
but not corresponding to your numbering):

1. Calculating the median: I don't think it's worth modifying the API or trying 
to access a node's entire set of examples.  I'd prefer the aggregator do 
something approximate like approxQuantile (see QuantileSummaries).  We could 
document it's approximate and perhaps make it exact or provide an API for 
setting the precision of the approximation later on.

2. second order approximation: If it's simple to add, then I'd say go ahead and 
add it, especially since it's not a public API.

3. naming loss/impurity: I would not add aliases for now.  That can be a task 
for later.

4. Regularization: Can happen later.

5. Weighted data: We'll add this in the future for sure, but don't worry about 
it for now.

6. Leaf weights: This can still be done using the loss, even without 
regularization, right?  It'd be nice to have.

7. requiredSamples: It's low priority.  Before that, we should consider 
choosing new bins for each node, rather than once at the beginning.

8. different losses: Sounds good, for later.

> Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
> 
>
> Key: SPARK-4240
> URL: https://issues.apache.org/jira/browse/SPARK-4240
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sung Chung
>
> The gradient boosting as currently implemented estimates the loss-gradient in 
> each iteration using regression trees. At every iteration, the regression 
> trees are trained/split to minimize predicted gradient variance. 
> Additionally, the terminal node predictions are computed to minimize the 
> prediction variance.
> However, such predictions won't be optimal for loss functions other than the 
> mean-squared error. The TreeBoosting refinement can help mitigate this issue 
> by modifying terminal node prediction values so that those predictions would 
> directly minimize the actual loss function. Although this still doesn't 
> change the fact that the tree splits were done through variance reduction, it 
> should still lead to improvement in gradient estimations, and thus better 
> performance.
> The details of this can be found in the R vignette. This paper also shows how 
> to refine the terminal node predictions.
> http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2016-07-07 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366502#comment-15366502
 ] 

Vladimir Feinberg commented on SPARK-4240:
--

Pending some dramatic response from \[~sethah\] telling me to back off, I'll 
take over this one. \[~josephkb\], mind reviewing the below outline?

I propose that this JIRA be resolved in the following manner:
API Change: Since a true "TreeBoost" splits on the impurity of loss reduction, 
the impurity calculator should be derived from the loss function itself.
 * Set a new default for impurity param in GBTs as 'auto', which uses the 
loss-based impurity by default, but can be overridden to use standard RFs if 
desired.
 * Create a generic loss-reduction calculator which works by reducing a 
parametrizable loss criterion (or, rather, a Taylor approximation of it as 
recommended by Friedman \[1\] and implemented to the second order by XGBoost 
\[2\] \[code: 5\]).
 * Instantiate the generic loss-reduction calculator (that supports different 
orders of losses) for regression:
 ** Add squared and absolute losses
 **  'auto' induces a second-order approximation for squared loss, and only a 
first-order approximation for absolute loss
 ** The former should perform better than LS_Boost from \[1\] (which only uses 
the first-order approximation) and the latter is equivalent to LAD_TreeBoost 
from \[1\]. It may be worthwhile to add an LS_Boost impurity and check it 
performs worse. Both these "generic loss" instantiations become new impurities 
that the user could set, just like 'gini' or 'entropy'. This calculator will 
implement corresponding terminal-leaf predictions, either the mean or median of 
the leaf's sample. Computing the median may require modifications to the 
internal developer API so that at some point the calculator can access the 
entire set of training samples a terminal node's partition corresponds to.
 * On the classifier side we need to do the same thing, with a logistic loss 
inducing a new impurity. Second order here is again feasible. First order 
corresponds to L2_TreeBoost from \[1\].
 * Because the new impurities apply only to GBTs, they'll only be available for 
them.

Questions for \[~josephkb\]:
1. Should I ditch making the second order approximation that \[2\] does? It 
won't make the code any simpler, but might make the theoretical offerings of 
the new easier to grasp. This would add another task "try out second order 
Taylor approx" to the below, and also means we won't perform as well as xgb 
until the second order thing happens.

Note that the L2 loss corresponds to gaussian regression, L1 to Laplace, 
logistic to bernoulli. I'll add the aliases to loss.

Differences between this and \[2\]:
* No leaf weight regularization, besides the default constant shrinkage, is 
implemented.

Differences between this and \[3\]:
* \[3\] uses variance impurity for split selection \[code: 6\]. I don't think 
this is even technically TreeBoost. Such behavior should be emulatable in the 
new code by overriding impurity='variance' (would be nice to see if we have 
comparable perf here).
* \[3\] implements GBTs for weighted in put data. We don't support data 
weights, so for both l1 and l2 losses terminal node computations don't need 
Newton-Raphson optimization.

Probably not for this JIRA:
1. Implementing leaf weights (and leaf weight regularization) - probably 
involves adding a regularization param to GBTs, creating new 
regularization-aware impurity calculators.
2. In {{RandomForest.scala}} the line {{val requiredSamples = 
math.max(metadata.maxBins * metadata.maxBins, 1)}} performs row subsampling 
on our data. I don't know if it's sound from a statistical learning 
perspective, but this is something that we should take a look at (i.e., does 
performing a precise sample complexity calculation in the PAC sense lead to 
better perf)?
3. Add different "losses" corresponding to residual distributions - see all the 
ones supported here \[3\] \[4\] \[7\]. Depending what we add, we may need to 
implement NR optimization. Huber loss is the only one mentioned in \[1\] that 
we don't yet have.

\[1\] Friedman paper: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf
\[2\] xgboost paper: http://www.kdd.org/kdd2016/papers/files/Paper_697.pdf
\[3\] gbm impl paper: http://www.saedsayad.com/docs/gbm2.pdf
\[4\] xgboost docs: 
https://xgboost.readthedocs.io/en/latest//parameter.html#general-parameters
\[5\] 
https://github.com/dmlc/xgboost/blob/1625dab1cbc416d9d9a79dde141e3d236060387a/src/objective/regression_obj.cc
\[6\] https://github.com/gbm-developers/gbm/blob/master/src/node_parameters.h
\[7\] gbm api: https://cran.r-project.org/web/packages/gbm/gbm.pdf


> Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
> 
>
> Key: SPARK-4240
> 

[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2016-07-05 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15362721#comment-15362721
 ] 

Vladimir Feinberg commented on SPARK-4240:
--

Sorry for delay in response - I was on vacation for the long weekend. Would you 
mind pushing or linking what you have done so far? I'll get back to you 
tomorrow on whether I have the bandwidth to tackle this right now.

> Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
> 
>
> Key: SPARK-4240
> URL: https://issues.apache.org/jira/browse/SPARK-4240
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sung Chung
>
> The gradient boosting as currently implemented estimates the loss-gradient in 
> each iteration using regression trees. At every iteration, the regression 
> trees are trained/split to minimize predicted gradient variance. 
> Additionally, the terminal node predictions are computed to minimize the 
> prediction variance.
> However, such predictions won't be optimal for loss functions other than the 
> mean-squared error. The TreeBoosting refinement can help mitigate this issue 
> by modifying terminal node prediction values so that those predictions would 
> directly minimize the actual loss function. Although this still doesn't 
> change the fact that the tree splits were done through variance reduction, it 
> should still lead to improvement in gradient estimations, and thus better 
> performance.
> The details of this can be found in the R vignette. This paper also shows how 
> to refine the terminal node predictions.
> http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2016-07-01 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358983#comment-15358983
 ] 

Seth Hendrickson commented on SPARK-4240:
-

I had done some work on this in the past, but haven't looked at it for a while 
now. I may have some time to pick it back up again in a few weeks, but if you 
are interested in working on it then feel free (please do indicate as such 
here, though). Thanks!

> Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
> 
>
> Key: SPARK-4240
> URL: https://issues.apache.org/jira/browse/SPARK-4240
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sung Chung
>
> The gradient boosting as currently implemented estimates the loss-gradient in 
> each iteration using regression trees. At every iteration, the regression 
> trees are trained/split to minimize predicted gradient variance. 
> Additionally, the terminal node predictions are computed to minimize the 
> prediction variance.
> However, such predictions won't be optimal for loss functions other than the 
> mean-squared error. The TreeBoosting refinement can help mitigate this issue 
> by modifying terminal node prediction values so that those predictions would 
> directly minimize the actual loss function. Although this still doesn't 
> change the fact that the tree splits were done through variance reduction, it 
> should still lead to improvement in gradient estimations, and thus better 
> performance.
> The details of this can be found in the R vignette. This paper also shows how 
> to refine the terminal node predictions.
> http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2016-06-30 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357868#comment-15357868
 ] 

Vladimir Feinberg commented on SPARK-4240:
--

[~sethah] Hi Seth, it seems like your comment is outdated now that GBT is 
indeed in ML. Are you currently working on this?


> Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
> 
>
> Key: SPARK-4240
> URL: https://issues.apache.org/jira/browse/SPARK-4240
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sung Chung
>
> The gradient boosting as currently implemented estimates the loss-gradient in 
> each iteration using regression trees. At every iteration, the regression 
> trees are trained/split to minimize predicted gradient variance. 
> Additionally, the terminal node predictions are computed to minimize the 
> prediction variance.
> However, such predictions won't be optimal for loss functions other than the 
> mean-squared error. The TreeBoosting refinement can help mitigate this issue 
> by modifying terminal node prediction values so that those predictions would 
> directly minimize the actual loss function. Although this still doesn't 
> change the fact that the tree splits were done through variance reduction, it 
> should still lead to improvement in gradient estimations, and thus better 
> performance.
> The details of this can be found in the R vignette. This paper also shows how 
> to refine the terminal node predictions.
> http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2015-11-05 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992764#comment-14992764
 ] 

Seth Hendrickson commented on SPARK-4240:
-

I think we should create a separate JIRA which blocks this one for moving the 
GBT implementation to ml. Once that's done, we can implement the tree boost 
modification to GBTs.

I can create the JIRA and begin work on it if we decide that it's appropriate. 
Note that this would be very similar to [PR 
7294|https://github.com/apache/spark/pull/7294/]. I'd like to continue working 
on this JIRA once the implementation has been moved since I spent some time on 
it already :)

ping [~josephkb] [~dbtsai] [~jbabcock]

> Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
> 
>
> Key: SPARK-4240
> URL: https://issues.apache.org/jira/browse/SPARK-4240
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sung Chung
>
> The gradient boosting as currently implemented estimates the loss-gradient in 
> each iteration using regression trees. At every iteration, the regression 
> trees are trained/split to minimize predicted gradient variance. 
> Additionally, the terminal node predictions are computed to minimize the 
> prediction variance.
> However, such predictions won't be optimal for loss functions other than the 
> mean-squared error. The TreeBoosting refinement can help mitigate this issue 
> by modifying terminal node prediction values so that those predictions would 
> directly minimize the actual loss function. Although this still doesn't 
> change the fact that the tree splits were done through variance reduction, it 
> should still lead to improvement in gradient estimations, and thus better 
> performance.
> The details of this can be found in the R vignette. This paper also shows how 
> to refine the terminal node predictions.
> http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2015-10-19 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963648#comment-14963648
 ] 

Joseph K. Bradley commented on SPARK-4240:
--

This conversation slipped under my radar somehow; my apologies!

I think it'd be fine to copy the implementation of GBTs to spark.ml, especially 
if we want to restructure it to support TreeBoost.  As far as updating or 
replacing the spark.mllib implementation, I'd say: Ideally it would eventually 
be a wrapper for the spark.ml implementation, but we should focus on the 
spark.ml API and implementation for now, even if it means temporarily having a 
copy of the code.

I think it'd be hard to combine this work with generic boosting because 
TreeBoost relies on the fact that trees are a space-partitioning algorithm, but 
we could discuss feasibility if there is a way to leverage the same 
implementation.

[~dbtsai] expressed interest in this work, so I'll ping him here.

> Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
> 
>
> Key: SPARK-4240
> URL: https://issues.apache.org/jira/browse/SPARK-4240
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sung Chung
>
> The gradient boosting as currently implemented estimates the loss-gradient in 
> each iteration using regression trees. At every iteration, the regression 
> trees are trained/split to minimize predicted gradient variance. 
> Additionally, the terminal node predictions are computed to minimize the 
> prediction variance.
> However, such predictions won't be optimal for loss functions other than the 
> mean-squared error. The TreeBoosting refinement can help mitigate this issue 
> by modifying terminal node prediction values so that those predictions would 
> directly minimize the actual loss function. Although this still doesn't 
> change the fact that the tree splits were done through variance reduction, it 
> should still lead to improvement in gradient estimations, and thus better 
> performance.
> The details of this can be found in the R vignette. This paper also shows how 
> to refine the terminal node predictions.
> http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2015-08-27 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717209#comment-14717209
 ] 

Seth Hendrickson commented on SPARK-4240:
-

[~josephkb] I think there needs to be some discussion of how and where this 
fits into the current boosting package architecture. Right now, the ML GBT 
algorithm just calls the the MLlib implementation of GBTs. While the random 
forest algorithm has already been moved into the ML package, the GBT algorithm 
has not and I assume this is because we are waiting on the 
implementation/result of 
[SPARK-7129|https://issues.apache.org/jira/browse/SPARK-7129], which calls for 
a generic boosting algorithm.

While this JIRA is specific to gradient boosted trees, it is still affected by 
the overall boosting architecture. I've got some code that implements the 
terminal node refinements in the MLlib implementation, but I suspect that there 
might be some resistance to changing MLlib's implementation. I can continue 
implementing this in MLlib if we decide that is the route we'd like to take. 
Otherwise, I think this work needs to wait until GBTs are moved to the ML 
package.

 Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
 

 Key: SPARK-4240
 URL: https://issues.apache.org/jira/browse/SPARK-4240
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Sung Chung

 The gradient boosting as currently implemented estimates the loss-gradient in 
 each iteration using regression trees. At every iteration, the regression 
 trees are trained/split to minimize predicted gradient variance. 
 Additionally, the terminal node predictions are computed to minimize the 
 prediction variance.
 However, such predictions won't be optimal for loss functions other than the 
 mean-squared error. The TreeBoosting refinement can help mitigate this issue 
 by modifying terminal node prediction values so that those predictions would 
 directly minimize the actual loss function. Although this still doesn't 
 change the fact that the tree splits were done through variance reduction, it 
 should still lead to improvement in gradient estimations, and thus better 
 performance.
 The details of this can be found in the R vignette. This paper also shows how 
 to refine the terminal node predictions.
 http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2015-08-19 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703402#comment-14703402
 ] 

Seth Hendrickson commented on SPARK-4240:
-

[~pprett] MLlib's current implementation for Gradient Boosted Trees does not 
perform a terminal node prediction update. Instead, the predicted value for 
each terminal node is determined by the impurity used to train the decision 
tree. The [[Variance]] impurity, for example, just averages the labels found in 
the terminal node. Terminal node predictions should be determined by the loss 
function for gradient boosting (e.g. AbsoluteError, SquaredError, etc).

I'd like to work on this if no one else has started it.

 Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
 

 Key: SPARK-4240
 URL: https://issues.apache.org/jira/browse/SPARK-4240
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Sung Chung

 The gradient boosting as currently implemented estimates the loss-gradient in 
 each iteration using regression trees. At every iteration, the regression 
 trees are trained/split to minimize predicted gradient variance. 
 Additionally, the terminal node predictions are computed to minimize the 
 prediction variance.
 However, such predictions won't be optimal for loss functions other than the 
 mean-squared error. The TreeBoosting refinement can help mitigate this issue 
 by modifying terminal node prediction values so that those predictions would 
 directly minimize the actual loss function. Although this still doesn't 
 change the fact that the tree splits were done through variance reduction, it 
 should still lead to improvement in gradient estimations, and thus better 
 performance.
 The details of this can be found in the R vignette. This paper also shows how 
 to refine the terminal node predictions.
 http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2015-01-07 Thread Peter Prettenhofer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267645#comment-14267645
 ] 

Peter Prettenhofer commented on SPARK-4240:
---

[~codedeft] I'm not sure if I understand correctly: is a) the line-search step 
in gradient boosting missing in mllib's GBM implementation or b) the 
line-search (leaf update) is done but it does not affect the next residual?

thanks, 
 Peter

 Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
 

 Key: SPARK-4240
 URL: https://issues.apache.org/jira/browse/SPARK-4240
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Sung Chung

 The gradient boosting as currently implemented estimates the loss-gradient in 
 each iteration using regression trees. At every iteration, the regression 
 trees are trained/split to minimize predicted gradient variance. 
 Additionally, the terminal node predictions are computed to minimize the 
 prediction variance.
 However, such predictions won't be optimal for loss functions other than the 
 mean-squared error. The TreeBoosting refinement can help mitigate this issue 
 by modifying terminal node prediction values so that those predictions would 
 directly minimize the actual loss function. Although this still doesn't 
 change the fact that the tree splits were done through variance reduction, it 
 should still lead to improvement in gradient estimations, and thus better 
 performance.
 The details of this can be found in the R vignette. This paper also shows how 
 to refine the terminal node predictions.
 http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org