[ https://issues.apache.org/jira/browse/SPARK-23409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alessandro Solimando updated SPARK-23409: ----------------------------------------- Description: Improvement: redundancy elimination from decision trees where all the leaves of a given subtree share the same prediction. Benefits: * Model interpretability * Faster unitary model invocation (relevant for massive number of invocations) * Smaller model memory footprint For instance, consider the following decision tree. {panel:title=Original Decision Tree} {noformat} DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15 nodes If (feature 1 <= 0.5) If (feature 2 <= 0.5) If (feature 0 <= 0.5) Predict: 0.0 Else (feature 0 > 0.5) Predict: 0.0 Else (feature 2 > 0.5) If (feature 0 <= 0.5) Predict: 0.0 Else (feature 0 > 0.5) Predict: 0.0 Else (feature 1 > 0.5) If (feature 2 <= 0.5) If (feature 0 <= 0.5) Predict: 1.0 Else (feature 0 > 0.5) Predict: 1.0 Else (feature 2 > 0.5) If (feature 0 <= 0.5) Predict: 0.0 Else (feature 0 > 0.5) Predict: 0.0 {noformat} {panel} The proposed method, taken as input the first tree, aims at producing as output the following (semantically equivalent) tree: {panel:title=Pruned Decision Tree} {noformat} DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15 nodes If (feature 1 <= 0.5) Predict: 0.0 Else (feature 1 > 0.5) If (feature 2 <= 0.5) Predict: 1.0 Else (feature 2 > 0.5) Predict: 0.0 {noformat} {panel} was: Improvement: redundancy elimination from decision trees where all the leaves of a given subtree share the same prediction. Benefits: * Model interpretability * Faster unitary model invocation (relevant for massive ) * Smaller model memory footprint For instance, consider the following decision tree. {panel:title=Original Decision Tree} {noformat} DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15 nodes If (feature 1 <= 0.5) If (feature 2 <= 0.5) If (feature 0 <= 0.5) Predict: 0.0 Else (feature 0 > 0.5) Predict: 0.0 Else (feature 2 > 0.5) If (feature 0 <= 0.5) Predict: 0.0 Else (feature 0 > 0.5) Predict: 0.0 Else (feature 1 > 0.5) If (feature 2 <= 0.5) If (feature 0 <= 0.5) Predict: 1.0 Else (feature 0 > 0.5) Predict: 1.0 Else (feature 2 > 0.5) If (feature 0 <= 0.5) Predict: 0.0 Else (feature 0 > 0.5) Predict: 0.0 {noformat} {panel} The proposed method, taken as input the first tree, aims at producing as output the following (semantically equivalent) tree: {panel:title=Pruned Decision Tree} {noformat} DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15 nodes If (feature 1 <= 0.5) Predict: 0.0 Else (feature 1 > 0.5) If (feature 2 <= 0.5) Predict: 1.0 Else (feature 2 > 0.5) Predict: 0.0 {noformat} {panel} > RandomForest/DecisionTree (syntactic) pruning of redundant subtrees > ------------------------------------------------------------------- > > Key: SPARK-23409 > URL: https://issues.apache.org/jira/browse/SPARK-23409 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 2.2.1 > Environment: > Reporter: Alessandro Solimando > Priority: Minor > > Improvement: redundancy elimination from decision trees where all the leaves > of a given subtree share the same prediction. > Benefits: > * Model interpretability > * Faster unitary model invocation (relevant for massive number of > invocations) > * Smaller model memory footprint > For instance, consider the following decision tree. > {panel:title=Original Decision Tree} > {noformat} > DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15 > nodes > If (feature 1 <= 0.5) > If (feature 2 <= 0.5) > If (feature 0 <= 0.5) > Predict: 0.0 > Else (feature 0 > 0.5) > Predict: 0.0 > Else (feature 2 > 0.5) > If (feature 0 <= 0.5) > Predict: 0.0 > Else (feature 0 > 0.5) > Predict: 0.0 > Else (feature 1 > 0.5) > If (feature 2 <= 0.5) > If (feature 0 <= 0.5) > Predict: 1.0 > Else (feature 0 > 0.5) > Predict: 1.0 > Else (feature 2 > 0.5) > If (feature 0 <= 0.5) > Predict: 0.0 > Else (feature 0 > 0.5) > Predict: 0.0 > {noformat} > {panel} > The proposed method, taken as input the first tree, aims at producing as > output the following (semantically equivalent) tree: > {panel:title=Pruned Decision Tree} > {noformat} > DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15 > nodes > If (feature 1 <= 0.5) > Predict: 0.0 > Else (feature 1 > 0.5) > If (feature 2 <= 0.5) > Predict: 1.0 > Else (feature 2 > 0.5) > Predict: 0.0 > {noformat} > {panel} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org