[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

2018-08-06 Thread Julian King (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569855#comment-16569855
 ] 

Julian King commented on SPARK-9478:


Has there been any progress on this in recent times? It looks like there are 
multiple pull requests for this but no comments in more than a year :(

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>Priority: Major
>
> Currently, this implementation of random forest does not support sample 
> (instance) weights. Weights are important when there is imbalanced training 
> data or the evaluation metric of a classifier is imbalanced (e.g. true 
> positive rate at some false positive threshold).  Sample weights generalize 
> class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

2017-04-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970541#comment-15970541
 ] 

Joseph K. Bradley commented on SPARK-9478:
--

By the way, one design choice which has come up is whether the current 
minInstancesPerNode Param should take instance weights into account.

Pros of using instance weights with minInstancesPerNode:
* This maintains the semantics of instance weights.  The algorithm should treat 
these 2 datasets identically: (a) {{[(weight 1.0, example A), (weight 1.0, 
example B), (weight 1.0, example B)]}} vs. (b) {{[(weight 1.0, example A), 
(weight 2.0, example B)]}}.
* By maintaining these semantics, we avoid confusion about how RandomForest and 
GBT should treat the instance weights introduced by subsampling.  (Currently, 
these use instance weights with minInstancesPerNode, so this choice is 
consistent with our previous choices.)

Pros of not using instance weights with minInstancesPerNode:
* AFAIK, scikit-learn does not use instance weights with {{min_samples_leaf}}.

I vote for the first choice (taking weights into account).

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support sample 
> (instance) weights. Weights are important when there is imbalanced training 
> data or the evaluation metric of a classifier is imbalanced (e.g. true 
> positive rate at some false positive threshold).  Sample weights generalize 
> class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

2017-03-31 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951685#comment-15951685
 ] 

Joseph K. Bradley commented on SPARK-9478:
--

[~clamus] The current vote is to *not use* weights during sampling and then to 
*use* weights when growing the trees.  That will simplify the sampling process 
so we hopefully won't have to deal with the complexity you're mentioning.  Note 
that we'll have to weight the trees in the forest to make this approach work.

I'm also guessing that it will give better calibrated probability estimates in 
the final forest, though this is based on intuition rather than analysis.  
E.g., given the 4-instance dataset in [~sethah]'s example above, I'd imagine:
* If we use weights during sampling but not when growing trees...
** Say we want 10 trees.  We pick 10 sets of 4 rows.  The probability of always 
picking the weight-1000 row is ~0.89.
** So our forest will probably give us 0/1 (poorly calibrated) probabilities.
* If we do not use weights during sampling but use them when growing trees... 
(current proposal)
** Say we want 10 trees.
** The probability of always picking the weight-1 rows is ~1e-5.  This means 
we'll have at least one tree with the weight-1000 row, so it will dominate our 
predictions (giving good accuracy).
** The probability of having at least 1 tree with only weight-1 rows is ~0.02.  
This means it's pretty likely we'll have some tree predicting label1, so we'll 
keep our probability predictions away from 0 and 1.

This is really hand-wavy, but it does alleviate my fears of having extreme log 
losses.  On the other hand, maybe it could be handle by adding smoothing to 
predictions...

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support sample 
> (instance) weights. Weights are important when there is imbalanced training 
> data or the evaluation metric of a classifier is imbalanced (e.g. true 
> positive rate at some false positive threshold).  Sample weights generalize 
> class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

2017-02-16 Thread Camilo Lamus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870490#comment-15870490
 ] 

Camilo Lamus commented on SPARK-9478:
-

[~sethah] It is very exiting that you guys are working on adding a weighted 
version of the random forest. I am really looking forward to use it in spark ml 
RF and other algorithms. As [~josephkb] mention, adding weights to data points 
(samples/instances) has a myriad of application in data analysis.

I have a question about the way you are thinking on using the weights. Are you 
thinking on using the weights both in the bootstrap sample step as well as in 
growing the trees? Using it in both steps might make the weights overly 
“important”.

If you are using it in the tree growing process, are you doing something like 
what is shown here in slide 4 
(http://www.stat.cmu.edu/~ryantibs/datamining/lectures/25-boost.pdf)?

In the case where you would use the weights in the constructing the bootstrap 
samples, as you mention, the distribution of the (marginal) counts each data 
point is selected in a bootstrap sample is binomial. However, the joint 
distribution of the counts is multinomial. Specifically, If you draw N samples 
with replacement from the original N data points, selecting each with 
probability p_i = 1/ N, the joint distribution is Multinomial(N, p_i = 1/N, 
i=1,2,…,N), and this is not the same as drawing independently N times from 
Binomial(N, 1/N). For one thing, you might end up with more or less than N 
samples. In regard to the poisson approximation, I think this might be more 
problematic since I think it requires one of the counts to dominate (i.e, 
happen with high probability) (see here: 
http://www.jstor.org/stable/3314676?seq=1#page_scan_tab_contents). This is a 
theoretical issue, which might not be matter in practice. But who know, it 
might. And after all, it might be just better to get the counts from 
Multinomial(N, p_i = w_i / sum(w_j)).

Either way, if the poisson approximation is good enough, it does make more 
sense to use what you suggest at the end, which is to sample from 
Poisson(lambda_i = N  w_i / sum(w_j)). Sampling from Poisson(lambda_i = 1), and 
then multiply by  N  w_i / sum(w_j) can worsen the Poisson approximation to the 
binomial since the variance of multiplied version is lambda_i^2, and not 
lambda_i, as it should be a poisson rv.


> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

2017-02-13 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865079#comment-15865079
 ] 

Seth Hendrickson commented on SPARK-9478:
-

[~josephkb] Done. Thanks for your feedback on sampling!

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

2017-02-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861714#comment-15861714
 ] 

Joseph K. Bradley commented on SPARK-9478:
--

[~sethah] Thanks for researching this!  +1 for not using weights during bagging 
and using importance weights to compensate.  Intuitively, that seems like it 
should give better estimators for class conditional probabilities than the 
other option.

If you're splitting this into trees and forests, could you please target your 
PR against a subtask for trees?

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

2017-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15843675#comment-15843675
 ] 

Apache Spark commented on SPARK-9478:
-

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/16722

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

2016-11-17 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675770#comment-15675770
 ] 

Seth Hendrickson commented on SPARK-9478:
-

I'm going to work on submitting a PR for adding sample weights for 2.2. That pr 
is for adding class weights, which I think we decided against.

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org