Barry Becker created SPARK-24394: ------------------------------------ Summary: Nodes in decision tree sometimes have negative impurity values Key: SPARK-24394 URL: https://issues.apache.org/jira/browse/SPARK-24394 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.0 Environment: Spark 2.3.0
ML linux Reporter: Barry Becker After doing some reading about gini and entropy based impurity (see [https://spark.apache.org/docs/2.2.0/mllib-decision-tree.html]) it seems that impurity values should always be bounded by 0 and 1. However, sometimes some leaf nodes (usually, but not always those with the minimum number of records) have negative impurity values (usually -1, but not always). This seems like bug in the impurity calculation, but I am not sure. This happens for both gini and entropy impurity at slightly different nodes. I can reproduce this with almost any dataset using pretty standard parameters like the following: new DecisionTreeClassifier() .setLabelCol(targetName) .setMaxBins(100) .setMaxDepth(5) .setMinInfoGain(0.01) .setMinInstancesPerNode(5) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org