[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648565#comment-14648565
 ] 

Joseph K. Bradley commented on SPARK-5133:
--

Yes, that's the approach I was planning on taking.  I realized that this JIRA 
is blocked by the one I just linked; I'll try to get it merged ASAP.  I 
appreciate your offer to contribute it, and I would normally say yes, but I'm a 
bit panicked by the code freeze.  Since I'm very familiar with the tree code, 
would you mind letting me send it, and helping to review my PR once I send one?

If you are interested, it would be great to collaborate to add better methods 
of measuring feature importance (or other features) for the next release cycle.

Thanks again!

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-30 Thread Nam Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648656#comment-14648656
 ] 

Nam Ma commented on SPARK-5133:
---

Definitely, Joseph. Please go ahead sending the PR. There are many things we 
can collaborate to improve the tree-based methods, such as adding other methods 
of measuring feature importance, as you said, or adding partial dependence. 
These are not challenging in terms of implementation/scalability/efficiency, 
but would help a lot for the usability of the methods. 

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646535#comment-14646535
 ] 

Joseph K. Bradley commented on SPARK-5133:
--

That's my hope...  I'll try to send a PR later today.

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-29 Thread Parv Oberoi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646493#comment-14646493
 ] 

Parv Oberoi commented on SPARK-5133:


[~josephkb]: is the plan to still include this in SPARK 1.5 considering that 
the code cutoff is this week?

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-29 Thread Nam Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647115#comment-14647115
 ] 

Nam Ma commented on SPARK-5133:
---

As far as I know, feature importance in scikit-learn is based on Gini 
Importance of Leo Breiman (originally given in his 1983's paper). Using the 
current existing stats information in DecisionTreeModel, it is fairly 
straightforward to compute it. However, we need to modify the 
InformationGainStats and the execution of Decision Tree a bit to store number 
of datapoints at that node during the split. I implemented it a long time ago 
and found the result quite consistent with what scikit-learn provides. 

Please correct me if I miss something.


 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-27 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14643730#comment-14643730
 ] 

Joseph K. Bradley commented on SPARK-5133:
--

[~pprett] I'd really like to get this into Spark 1.5, and the code cutoff is at 
the end of this week.  Would you be able to send a PR within ~1 day?  If not, I 
can send one instead.  Thanks!

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-06 Thread Peter Prettenhofer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614799#comment-14614799
 ] 

Peter Prettenhofer commented on SPARK-5133:
---

[~yalamart] For some reason i cannot assign it to me -- [~manishamde] can you 
help me with that? thanks!

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614812#comment-14614812
 ] 

Sean Owen commented on SPARK-5133:
--

We don't generally Assign JIRAs while they're being worked on. Anyone's free to 
work on anything, with an encouragement to please try to coordinate with anyone 
else working on it. (Look for an open PR or comments; there is none linked here 
though.)

(Also, JIRA has a problem now wherein we can't add new Contributors, and unless 
you're added to that group you can't be Assigned :( )

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-06 Thread Peter Prettenhofer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614682#comment-14614682
 ] 

Peter Prettenhofer commented on SPARK-5133:
---

[~yalamart] I'm already working on it -- havent published a PR yet

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-06 Thread Venkata Vineel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614715#comment-14614715
 ] 

Venkata Vineel commented on SPARK-5133:
---

[~peter.prettenhofer]  Please assign it to yourself, so that others don't get 
confused and take it up.Can you help to choose and start working on similar 
bugs/features that others haven't already submitted PRs for.

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615678#comment-14615678
 ] 

Joseph K. Bradley commented on SPARK-5133:
--

Definitely.  Please look at the list of starters linked from 
[https://issues.apache.org/jira/browse/SPARK-8445].  If a JIRA is not assigned 
and no one has said they are working on it yet, then please comment to say 
you'd like to work on it.  We're working on finding more starter issues.

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-06 Thread Venkata Vineel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615528#comment-14615528
 ] 

Venkata Vineel commented on SPARK-5133:
---

[~josephkb]  If you don't mind, can you please give me a couple of 
bugs/features to start with. I see some things not assigned to any one, but 
people are already working on them, so there is no clarity which ones are ready 
to be picked up. Please consider.

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615507#comment-14615507
 ] 

Joseph K. Bradley commented on SPARK-5133:
--

[~yalamart] I see you're looking into quite a few PRs.  It's good to ping 
people to see the status, but please work on only 1 at a time, especially while 
getting started.  Also, please get familiar with 
[https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark] as 
you get started.  Thanks!

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-06 Thread Venkata Vineel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614622#comment-14614622
 ] 

Venkata Vineel commented on SPARK-5133:
---

[~peter.prettenhofer]  Is this still open and can I work on it ?

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-06 Thread Venkata Vineel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614620#comment-14614620
 ] 

Venkata Vineel commented on SPARK-5133:
---

Can I work on this feature ? Please let me know. I m interested.

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-06-25 Thread Peter Prettenhofer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601385#comment-14601385
 ] 

Peter Prettenhofer commented on SPARK-5133:
---

[~josephkb] definitely - will start compiling a PR for feature importance via 
decrease in impurity.

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-06-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601657#comment-14601657
 ] 

Joseph K. Bradley commented on SPARK-5133:
--

Great, thank you!  Please ping me on the PR when you submit it.

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-06-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600140#comment-14600140
 ] 

Joseph K. Bradley commented on SPARK-5133:
--

It's high time we add this to MLlib, so I'm adding this to the 1.5 roadmap.  
[~peter.prettenhofer] If you are still interested in this, please feel free to 
take it.  Or if others are interested, please comment on this JIRA.

The initial API should be quite simple; I'm imagining a single method returning 
importance for each feature, modeled after what R or other libraries return.

I think we should calculate importance based on the learned model.  The 
permutation test would be nice in the future but would be much more expensive 
(shuffling data).

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-06-22 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596681#comment-14596681
 ] 

Joseph K. Bradley commented on SPARK-5133:
--

Linking related umbrella which discusses adding stats like this to models.

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
Priority: Minor

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-04-15 Thread Parv Oberoi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496625#comment-14496625
 ] 

Parv Oberoi commented on SPARK-5133:


this would be a really useful feature to have in MLLib Ensemble Tree Training. 
Exporting Feature Importance is a highly useful during feature engineering.

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
Priority: Minor

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org