[
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-13677.
---
Fix Version/s: 3.0.0
Resolution: Fixed
Issue resolved by pull request 25383
[https://github.com/apache/spark/pull/25383]
> Support Tree-Based Feature Transformation for ML
>
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
> Issue Type: New Feature
> Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.0.0
>
>
> It would be nice to be able to use RF and GBT for feature transformation:
> First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on
> the training set. Then each leaf of each tree in the ensemble is assigned a
> fixed arbitrary feature index in a new feature space. These leaf indices are
> then encoded in a one-hot fashion.
> This method was first introduced by
> facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is
> implemented in famous libraries:
> sklearn
> [apply|[http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]]
> xgboost
> [predict_leaf_index|[https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]]
> lightgbm
> [predict_leaf_index|https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_leaf_index]
> catboost
> [calc_leaf_index|https://github.com/catboost/tutorials/tree/master/leaf_indexes_calculation]
>
>
> Refering to the design of above impls, I propose following api:
> val model1 : DecisionTreeClassificationModel= ...
> model1.setLeafCol("leaves")
> model1.transform(df)
>
> val model2 : GBTClassificationModel = ...
> model2.getLeafCol
> model2.transform(df)
>
> The detailed design doc:
> [https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing]
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org