[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569855#comment-16569855 ] Julian King commented on SPARK-9478: Has there been any progress on this in recent times? It looks like there are multiple pull requests for this but no comments in more than a year :( > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw >Priority: Major > > Currently, this implementation of random forest does not support sample > (instance) weights. Weights are important when there is imbalanced training > data or the evaluation metric of a classifier is imbalanced (e.g. true > positive rate at some false positive threshold). Sample weights generalize > class weights, so this could be used to add class weights later on. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970541#comment-15970541 ] Joseph K. Bradley commented on SPARK-9478: -- By the way, one design choice which has come up is whether the current minInstancesPerNode Param should take instance weights into account. Pros of using instance weights with minInstancesPerNode: * This maintains the semantics of instance weights. The algorithm should treat these 2 datasets identically: (a) {{[(weight 1.0, example A), (weight 1.0, example B), (weight 1.0, example B)]}} vs. (b) {{[(weight 1.0, example A), (weight 2.0, example B)]}}. * By maintaining these semantics, we avoid confusion about how RandomForest and GBT should treat the instance weights introduced by subsampling. (Currently, these use instance weights with minInstancesPerNode, so this choice is consistent with our previous choices.) Pros of not using instance weights with minInstancesPerNode: * AFAIK, scikit-learn does not use instance weights with {{min_samples_leaf}}. I vote for the first choice (taking weights into account). > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support sample > (instance) weights. Weights are important when there is imbalanced training > data or the evaluation metric of a classifier is imbalanced (e.g. true > positive rate at some false positive threshold). Sample weights generalize > class weights, so this could be used to add class weights later on. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951685#comment-15951685 ] Joseph K. Bradley commented on SPARK-9478: -- [~clamus] The current vote is to *not use* weights during sampling and then to *use* weights when growing the trees. That will simplify the sampling process so we hopefully won't have to deal with the complexity you're mentioning. Note that we'll have to weight the trees in the forest to make this approach work. I'm also guessing that it will give better calibrated probability estimates in the final forest, though this is based on intuition rather than analysis. E.g., given the 4-instance dataset in [~sethah]'s example above, I'd imagine: * If we use weights during sampling but not when growing trees... ** Say we want 10 trees. We pick 10 sets of 4 rows. The probability of always picking the weight-1000 row is ~0.89. ** So our forest will probably give us 0/1 (poorly calibrated) probabilities. * If we do not use weights during sampling but use them when growing trees... (current proposal) ** Say we want 10 trees. ** The probability of always picking the weight-1 rows is ~1e-5. This means we'll have at least one tree with the weight-1000 row, so it will dominate our predictions (giving good accuracy). ** The probability of having at least 1 tree with only weight-1 rows is ~0.02. This means it's pretty likely we'll have some tree predicting label1, so we'll keep our probability predictions away from 0 and 1. This is really hand-wavy, but it does alleviate my fears of having extreme log losses. On the other hand, maybe it could be handle by adding smoothing to predictions... > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support sample > (instance) weights. Weights are important when there is imbalanced training > data or the evaluation metric of a classifier is imbalanced (e.g. true > positive rate at some false positive threshold). Sample weights generalize > class weights, so this could be used to add class weights later on. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870490#comment-15870490 ] Camilo Lamus commented on SPARK-9478: - [~sethah] It is very exiting that you guys are working on adding a weighted version of the random forest. I am really looking forward to use it in spark ml RF and other algorithms. As [~josephkb] mention, adding weights to data points (samples/instances) has a myriad of application in data analysis. I have a question about the way you are thinking on using the weights. Are you thinking on using the weights both in the bootstrap sample step as well as in growing the trees? Using it in both steps might make the weights overly “important”. If you are using it in the tree growing process, are you doing something like what is shown here in slide 4 (http://www.stat.cmu.edu/~ryantibs/datamining/lectures/25-boost.pdf)? In the case where you would use the weights in the constructing the bootstrap samples, as you mention, the distribution of the (marginal) counts each data point is selected in a bootstrap sample is binomial. However, the joint distribution of the counts is multinomial. Specifically, If you draw N samples with replacement from the original N data points, selecting each with probability p_i = 1/ N, the joint distribution is Multinomial(N, p_i = 1/N, i=1,2,…,N), and this is not the same as drawing independently N times from Binomial(N, 1/N). For one thing, you might end up with more or less than N samples. In regard to the poisson approximation, I think this might be more problematic since I think it requires one of the counts to dominate (i.e, happen with high probability) (see here: http://www.jstor.org/stable/3314676?seq=1#page_scan_tab_contents). This is a theoretical issue, which might not be matter in practice. But who know, it might. And after all, it might be just better to get the counts from Multinomial(N, p_i = w_i / sum(w_j)). Either way, if the poisson approximation is good enough, it does make more sense to use what you suggest at the end, which is to sample from Poisson(lambda_i = N w_i / sum(w_j)). Sampling from Poisson(lambda_i = 1), and then multiply by N w_i / sum(w_j) can worsen the Poisson approximation to the binomial since the variance of multiplied version is lambda_i^2, and not lambda_i, as it should be a poisson rv. > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support class > weights. Class weights are important when there is imbalanced training data > or the evaluation metric of a classifier is imbalanced (e.g. true positive > rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865079#comment-15865079 ] Seth Hendrickson commented on SPARK-9478: - [~josephkb] Done. Thanks for your feedback on sampling! > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support class > weights. Class weights are important when there is imbalanced training data > or the evaluation metric of a classifier is imbalanced (e.g. true positive > rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861714#comment-15861714 ] Joseph K. Bradley commented on SPARK-9478: -- [~sethah] Thanks for researching this! +1 for not using weights during bagging and using importance weights to compensate. Intuitively, that seems like it should give better estimators for class conditional probabilities than the other option. If you're splitting this into trees and forests, could you please target your PR against a subtask for trees? > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support class > weights. Class weights are important when there is imbalanced training data > or the evaluation metric of a classifier is imbalanced (e.g. true positive > rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15843675#comment-15843675 ] Apache Spark commented on SPARK-9478: - User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/16722 > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support class > weights. Class weights are important when there is imbalanced training data > or the evaluation metric of a classifier is imbalanced (e.g. true positive > rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675770#comment-15675770 ] Seth Hendrickson commented on SPARK-9478: - I'm going to work on submitting a PR for adding sample weights for 2.2. That pr is for adding class weights, which I think we decided against. > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support class > weights. Class weights are important when there is imbalanced training data > or the evaluation metric of a classifier is imbalanced (e.g. true positive > rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org