[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15376056#comment-15376056 ] Joseph K. Bradley commented on SPARK-4240: -- Thanks for the detailed survey and design description! Answers (with numbers, but not corresponding to your numbering): 1. Calculating the median: I don't think it's worth modifying the API or trying to access a node's entire set of examples. I'd prefer the aggregator do something approximate like approxQuantile (see QuantileSummaries). We could document it's approximate and perhaps make it exact or provide an API for setting the precision of the approximation later on. 2. second order approximation: If it's simple to add, then I'd say go ahead and add it, especially since it's not a public API. 3. naming loss/impurity: I would not add aliases for now. That can be a task for later. 4. Regularization: Can happen later. 5. Weighted data: We'll add this in the future for sure, but don't worry about it for now. 6. Leaf weights: This can still be done using the loss, even without regularization, right? It'd be nice to have. 7. requiredSamples: It's low priority. Before that, we should consider choosing new bins for each node, rather than once at the beginning. 8. different losses: Sounds good, for later. > Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy. > > > Key: SPARK-4240 > URL: https://issues.apache.org/jira/browse/SPARK-4240 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sung Chung > > The gradient boosting as currently implemented estimates the loss-gradient in > each iteration using regression trees. At every iteration, the regression > trees are trained/split to minimize predicted gradient variance. > Additionally, the terminal node predictions are computed to minimize the > prediction variance. > However, such predictions won't be optimal for loss functions other than the > mean-squared error. The TreeBoosting refinement can help mitigate this issue > by modifying terminal node prediction values so that those predictions would > directly minimize the actual loss function. Although this still doesn't > change the fact that the tree splits were done through variance reduction, it > should still lead to improvement in gradient estimations, and thus better > performance. > The details of this can be found in the R vignette. This paper also shows how > to refine the terminal node predictions. > http://www.saedsayad.com/docs/gbm2.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366502#comment-15366502 ] Vladimir Feinberg commented on SPARK-4240: -- Pending some dramatic response from \[~sethah\] telling me to back off, I'll take over this one. \[~josephkb\], mind reviewing the below outline? I propose that this JIRA be resolved in the following manner: API Change: Since a true "TreeBoost" splits on the impurity of loss reduction, the impurity calculator should be derived from the loss function itself. * Set a new default for impurity param in GBTs as 'auto', which uses the loss-based impurity by default, but can be overridden to use standard RFs if desired. * Create a generic loss-reduction calculator which works by reducing a parametrizable loss criterion (or, rather, a Taylor approximation of it as recommended by Friedman \[1\] and implemented to the second order by XGBoost \[2\] \[code: 5\]). * Instantiate the generic loss-reduction calculator (that supports different orders of losses) for regression: ** Add squared and absolute losses ** 'auto' induces a second-order approximation for squared loss, and only a first-order approximation for absolute loss ** The former should perform better than LS_Boost from \[1\] (which only uses the first-order approximation) and the latter is equivalent to LAD_TreeBoost from \[1\]. It may be worthwhile to add an LS_Boost impurity and check it performs worse. Both these "generic loss" instantiations become new impurities that the user could set, just like 'gini' or 'entropy'. This calculator will implement corresponding terminal-leaf predictions, either the mean or median of the leaf's sample. Computing the median may require modifications to the internal developer API so that at some point the calculator can access the entire set of training samples a terminal node's partition corresponds to. * On the classifier side we need to do the same thing, with a logistic loss inducing a new impurity. Second order here is again feasible. First order corresponds to L2_TreeBoost from \[1\]. * Because the new impurities apply only to GBTs, they'll only be available for them. Questions for \[~josephkb\]: 1. Should I ditch making the second order approximation that \[2\] does? It won't make the code any simpler, but might make the theoretical offerings of the new easier to grasp. This would add another task "try out second order Taylor approx" to the below, and also means we won't perform as well as xgb until the second order thing happens. Note that the L2 loss corresponds to gaussian regression, L1 to Laplace, logistic to bernoulli. I'll add the aliases to loss. Differences between this and \[2\]: * No leaf weight regularization, besides the default constant shrinkage, is implemented. Differences between this and \[3\]: * \[3\] uses variance impurity for split selection \[code: 6\]. I don't think this is even technically TreeBoost. Such behavior should be emulatable in the new code by overriding impurity='variance' (would be nice to see if we have comparable perf here). * \[3\] implements GBTs for weighted in put data. We don't support data weights, so for both l1 and l2 losses terminal node computations don't need Newton-Raphson optimization. Probably not for this JIRA: 1. Implementing leaf weights (and leaf weight regularization) - probably involves adding a regularization param to GBTs, creating new regularization-aware impurity calculators. 2. In {{RandomForest.scala}} the line {{val requiredSamples = math.max(metadata.maxBins * metadata.maxBins, 1)}} performs row subsampling on our data. I don't know if it's sound from a statistical learning perspective, but this is something that we should take a look at (i.e., does performing a precise sample complexity calculation in the PAC sense lead to better perf)? 3. Add different "losses" corresponding to residual distributions - see all the ones supported here \[3\] \[4\] \[7\]. Depending what we add, we may need to implement NR optimization. Huber loss is the only one mentioned in \[1\] that we don't yet have. \[1\] Friedman paper: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf \[2\] xgboost paper: http://www.kdd.org/kdd2016/papers/files/Paper_697.pdf \[3\] gbm impl paper: http://www.saedsayad.com/docs/gbm2.pdf \[4\] xgboost docs: https://xgboost.readthedocs.io/en/latest//parameter.html#general-parameters \[5\] https://github.com/dmlc/xgboost/blob/1625dab1cbc416d9d9a79dde141e3d236060387a/src/objective/regression_obj.cc \[6\] https://github.com/gbm-developers/gbm/blob/master/src/node_parameters.h \[7\] gbm api: https://cran.r-project.org/web/packages/gbm/gbm.pdf > Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy. > > > Key: SPARK-4240 >
[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15362721#comment-15362721 ] Vladimir Feinberg commented on SPARK-4240: -- Sorry for delay in response - I was on vacation for the long weekend. Would you mind pushing or linking what you have done so far? I'll get back to you tomorrow on whether I have the bandwidth to tackle this right now. > Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy. > > > Key: SPARK-4240 > URL: https://issues.apache.org/jira/browse/SPARK-4240 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sung Chung > > The gradient boosting as currently implemented estimates the loss-gradient in > each iteration using regression trees. At every iteration, the regression > trees are trained/split to minimize predicted gradient variance. > Additionally, the terminal node predictions are computed to minimize the > prediction variance. > However, such predictions won't be optimal for loss functions other than the > mean-squared error. The TreeBoosting refinement can help mitigate this issue > by modifying terminal node prediction values so that those predictions would > directly minimize the actual loss function. Although this still doesn't > change the fact that the tree splits were done through variance reduction, it > should still lead to improvement in gradient estimations, and thus better > performance. > The details of this can be found in the R vignette. This paper also shows how > to refine the terminal node predictions. > http://www.saedsayad.com/docs/gbm2.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358983#comment-15358983 ] Seth Hendrickson commented on SPARK-4240: - I had done some work on this in the past, but haven't looked at it for a while now. I may have some time to pick it back up again in a few weeks, but if you are interested in working on it then feel free (please do indicate as such here, though). Thanks! > Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy. > > > Key: SPARK-4240 > URL: https://issues.apache.org/jira/browse/SPARK-4240 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sung Chung > > The gradient boosting as currently implemented estimates the loss-gradient in > each iteration using regression trees. At every iteration, the regression > trees are trained/split to minimize predicted gradient variance. > Additionally, the terminal node predictions are computed to minimize the > prediction variance. > However, such predictions won't be optimal for loss functions other than the > mean-squared error. The TreeBoosting refinement can help mitigate this issue > by modifying terminal node prediction values so that those predictions would > directly minimize the actual loss function. Although this still doesn't > change the fact that the tree splits were done through variance reduction, it > should still lead to improvement in gradient estimations, and thus better > performance. > The details of this can be found in the R vignette. This paper also shows how > to refine the terminal node predictions. > http://www.saedsayad.com/docs/gbm2.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357868#comment-15357868 ] Vladimir Feinberg commented on SPARK-4240: -- [~sethah] Hi Seth, it seems like your comment is outdated now that GBT is indeed in ML. Are you currently working on this? > Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy. > > > Key: SPARK-4240 > URL: https://issues.apache.org/jira/browse/SPARK-4240 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sung Chung > > The gradient boosting as currently implemented estimates the loss-gradient in > each iteration using regression trees. At every iteration, the regression > trees are trained/split to minimize predicted gradient variance. > Additionally, the terminal node predictions are computed to minimize the > prediction variance. > However, such predictions won't be optimal for loss functions other than the > mean-squared error. The TreeBoosting refinement can help mitigate this issue > by modifying terminal node prediction values so that those predictions would > directly minimize the actual loss function. Although this still doesn't > change the fact that the tree splits were done through variance reduction, it > should still lead to improvement in gradient estimations, and thus better > performance. > The details of this can be found in the R vignette. This paper also shows how > to refine the terminal node predictions. > http://www.saedsayad.com/docs/gbm2.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992764#comment-14992764 ] Seth Hendrickson commented on SPARK-4240: - I think we should create a separate JIRA which blocks this one for moving the GBT implementation to ml. Once that's done, we can implement the tree boost modification to GBTs. I can create the JIRA and begin work on it if we decide that it's appropriate. Note that this would be very similar to [PR 7294|https://github.com/apache/spark/pull/7294/]. I'd like to continue working on this JIRA once the implementation has been moved since I spent some time on it already :) ping [~josephkb] [~dbtsai] [~jbabcock] > Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy. > > > Key: SPARK-4240 > URL: https://issues.apache.org/jira/browse/SPARK-4240 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sung Chung > > The gradient boosting as currently implemented estimates the loss-gradient in > each iteration using regression trees. At every iteration, the regression > trees are trained/split to minimize predicted gradient variance. > Additionally, the terminal node predictions are computed to minimize the > prediction variance. > However, such predictions won't be optimal for loss functions other than the > mean-squared error. The TreeBoosting refinement can help mitigate this issue > by modifying terminal node prediction values so that those predictions would > directly minimize the actual loss function. Although this still doesn't > change the fact that the tree splits were done through variance reduction, it > should still lead to improvement in gradient estimations, and thus better > performance. > The details of this can be found in the R vignette. This paper also shows how > to refine the terminal node predictions. > http://www.saedsayad.com/docs/gbm2.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963648#comment-14963648 ] Joseph K. Bradley commented on SPARK-4240: -- This conversation slipped under my radar somehow; my apologies! I think it'd be fine to copy the implementation of GBTs to spark.ml, especially if we want to restructure it to support TreeBoost. As far as updating or replacing the spark.mllib implementation, I'd say: Ideally it would eventually be a wrapper for the spark.ml implementation, but we should focus on the spark.ml API and implementation for now, even if it means temporarily having a copy of the code. I think it'd be hard to combine this work with generic boosting because TreeBoost relies on the fact that trees are a space-partitioning algorithm, but we could discuss feasibility if there is a way to leverage the same implementation. [~dbtsai] expressed interest in this work, so I'll ping him here. > Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy. > > > Key: SPARK-4240 > URL: https://issues.apache.org/jira/browse/SPARK-4240 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sung Chung > > The gradient boosting as currently implemented estimates the loss-gradient in > each iteration using regression trees. At every iteration, the regression > trees are trained/split to minimize predicted gradient variance. > Additionally, the terminal node predictions are computed to minimize the > prediction variance. > However, such predictions won't be optimal for loss functions other than the > mean-squared error. The TreeBoosting refinement can help mitigate this issue > by modifying terminal node prediction values so that those predictions would > directly minimize the actual loss function. Although this still doesn't > change the fact that the tree splits were done through variance reduction, it > should still lead to improvement in gradient estimations, and thus better > performance. > The details of this can be found in the R vignette. This paper also shows how > to refine the terminal node predictions. > http://www.saedsayad.com/docs/gbm2.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717209#comment-14717209 ] Seth Hendrickson commented on SPARK-4240: - [~josephkb] I think there needs to be some discussion of how and where this fits into the current boosting package architecture. Right now, the ML GBT algorithm just calls the the MLlib implementation of GBTs. While the random forest algorithm has already been moved into the ML package, the GBT algorithm has not and I assume this is because we are waiting on the implementation/result of [SPARK-7129|https://issues.apache.org/jira/browse/SPARK-7129], which calls for a generic boosting algorithm. While this JIRA is specific to gradient boosted trees, it is still affected by the overall boosting architecture. I've got some code that implements the terminal node refinements in the MLlib implementation, but I suspect that there might be some resistance to changing MLlib's implementation. I can continue implementing this in MLlib if we decide that is the route we'd like to take. Otherwise, I think this work needs to wait until GBTs are moved to the ML package. Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy. Key: SPARK-4240 URL: https://issues.apache.org/jira/browse/SPARK-4240 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Sung Chung The gradient boosting as currently implemented estimates the loss-gradient in each iteration using regression trees. At every iteration, the regression trees are trained/split to minimize predicted gradient variance. Additionally, the terminal node predictions are computed to minimize the prediction variance. However, such predictions won't be optimal for loss functions other than the mean-squared error. The TreeBoosting refinement can help mitigate this issue by modifying terminal node prediction values so that those predictions would directly minimize the actual loss function. Although this still doesn't change the fact that the tree splits were done through variance reduction, it should still lead to improvement in gradient estimations, and thus better performance. The details of this can be found in the R vignette. This paper also shows how to refine the terminal node predictions. http://www.saedsayad.com/docs/gbm2.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703402#comment-14703402 ] Seth Hendrickson commented on SPARK-4240: - [~pprett] MLlib's current implementation for Gradient Boosted Trees does not perform a terminal node prediction update. Instead, the predicted value for each terminal node is determined by the impurity used to train the decision tree. The [[Variance]] impurity, for example, just averages the labels found in the terminal node. Terminal node predictions should be determined by the loss function for gradient boosting (e.g. AbsoluteError, SquaredError, etc). I'd like to work on this if no one else has started it. Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy. Key: SPARK-4240 URL: https://issues.apache.org/jira/browse/SPARK-4240 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Sung Chung The gradient boosting as currently implemented estimates the loss-gradient in each iteration using regression trees. At every iteration, the regression trees are trained/split to minimize predicted gradient variance. Additionally, the terminal node predictions are computed to minimize the prediction variance. However, such predictions won't be optimal for loss functions other than the mean-squared error. The TreeBoosting refinement can help mitigate this issue by modifying terminal node prediction values so that those predictions would directly minimize the actual loss function. Although this still doesn't change the fact that the tree splits were done through variance reduction, it should still lead to improvement in gradient estimations, and thus better performance. The details of this can be found in the R vignette. This paper also shows how to refine the terminal node predictions. http://www.saedsayad.com/docs/gbm2.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267645#comment-14267645 ] Peter Prettenhofer commented on SPARK-4240: --- [~codedeft] I'm not sure if I understand correctly: is a) the line-search step in gradient boosting missing in mllib's GBM implementation or b) the line-search (leaf update) is done but it does not affect the next residual? thanks, Peter Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy. Key: SPARK-4240 URL: https://issues.apache.org/jira/browse/SPARK-4240 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Sung Chung The gradient boosting as currently implemented estimates the loss-gradient in each iteration using regression trees. At every iteration, the regression trees are trained/split to minimize predicted gradient variance. Additionally, the terminal node predictions are computed to minimize the prediction variance. However, such predictions won't be optimal for loss functions other than the mean-squared error. The TreeBoosting refinement can help mitigate this issue by modifying terminal node prediction values so that those predictions would directly minimize the actual loss function. Although this still doesn't change the fact that the tree splits were done through variance reduction, it should still lead to improvement in gradient estimations, and thus better performance. The details of this can be found in the R vignette. This paper also shows how to refine the terminal node predictions. http://www.saedsayad.com/docs/gbm2.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org