[ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen reassigned SPARK-13677: --------------------------------- Assignee: zhengruifeng > Support Tree-Based Feature Transformation for ML > ------------------------------------------------ > > Key: SPARK-13677 > URL: https://issues.apache.org/jira/browse/SPARK-13677 > Project: Spark > Issue Type: New Feature > Components: ML > Reporter: zhengruifeng > Assignee: zhengruifeng > Priority: Major > > It would be nice to be able to use RF and GBT for feature transformation: > First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on > the training set. Then each leaf of each tree in the ensemble is assigned a > fixed arbitrary feature index in a new feature space. These leaf indices are > then encoded in a one-hot fashion. > This method was first introduced by > facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is > implemented in famous libraries: > sklearn > [apply|[http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]] > xgboost > [predict_leaf_index|[https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]] > lightgbm > [predict_leaf_index|https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_leaf_index] > catboost > [calc_leaf_index|https://github.com/catboost/tutorials/tree/master/leaf_indexes_calculation] > > > Refering to the design of above impls, I propose following api: > val model1 : DecisionTreeClassificationModel= ... > model1.setLeafCol("leaves") > model1.transform(df) > > val model2 : GBTClassificationModel = ... > model2.getLeafCol > model2.transform(df) > > The detailed design doc: > [https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org