[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492839#comment-14492839
 ] 

Max Kaznady commented on SPARK-3727:


I implemented the same thing but for PySpark. Since there is no existing 
function, should I just call the function predict_proba like in sklearn? 

Also, does it make sense to open a new ticket for this, since it's so specific?

Thanks,
Max

 DecisionTree, RandomForest: More prediction functionality
 -

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492887#comment-14492887
 ] 

Joseph K. Bradley commented on SPARK-3727:
--

Thanks for your initial works on this ticket!  The main issue with this 
extension is API stability: Modifying the existing classes will also make us 
have to update model save/load versioning, default constructors to ensure 
binary compatibility, etc.

I just linked a JIRA which discusses updating the tree and ensemble APIs under 
the spark.ml package, which will permit us to redesign the APIs (and make it 
easier to specify class probabilities or stats for regression).  What I'd like 
to do is get the tree API updates in (this week), and then we could work 
together to make the class probabilities available under the new API.

Does that sound good?

Also, if you're new to contributing to Spark, please make sure to check out: 
[https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark]

Thanks!

 DecisionTree, RandomForest: More prediction functionality
 -

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492906#comment-14492906
 ] 

Max Kaznady commented on SPARK-3727:


Yes, probabilities have to be added to other models too, like 
LogisticRegression. Right now they are hardcoded in two places but not 
outputted in PySpark.

I think is makes sense to split into PySpark, then classification, then 
probabilities, and then group different types of algorithms, all of which 
output probabilities: Logistic Regression, Random Forest, etc.

Can also add probabilities for trees by counting the number of leaf 1's and 0's.

What do you think?

 DecisionTree, RandomForest: More prediction functionality
 -

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492871#comment-14492871
 ] 

Max Kaznady commented on SPARK-3727:


I thought it would be more fitting to separate this: 
https://issues.apache.org/jira/browse/SPARK-6884

 DecisionTree, RandomForest: More prediction functionality
 -

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-12 Thread Michael Kuhlen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491919#comment-14491919
 ] 

Michael Kuhlen commented on SPARK-3727:
---

Hello!

I've implemented predictWithProbabilities() methods for DecisionTreeModel and 
treeEnsembleModels in scala. These methods return both the most likely class as 
well as the probabilities of each of the classes. As in scikit-learn, the 
probabilities are defined as the mean predicted class probabilities of the 
trees in the forest\[, where the\] class probability of a single tree is the 
fraction of samples of the same class in a leaf. 
([sklearn.ensemble.RandomForestClassifier.predict_proba|http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba])

My approach was to modify the Predict class to hold the class probabilities for 
all classes (as opposed to just of the majority class), and I utilize these 
probabilities to determine the means over all trees. I believe this should work 
for GBTrees as well, since I'm taking care to weight the probabilities by the 
weight of each tree (=1.0 for RandomForest).

Here's a [link to my 
fork|https://github.com/apache/spark/compare/master...mqk:master] showing my 
modifications. I would be happy to issue a pull request for these changes, if 
that would be of interest to the community. Although I haven't done so yet, I 
believe it should be straightforward to extend this to also calculate the 
variance of estimates for regression algorithms, as suggested in this ticket.

Best, 

Mike


 DecisionTree, RandomForest: More prediction functionality
 -

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org