Barry Becker created SPARK-24394:
------------------------------------

             Summary: Nodes in decision tree sometimes have negative impurity 
values
                 Key: SPARK-24394
                 URL: https://issues.apache.org/jira/browse/SPARK-24394
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.3.0
         Environment: Spark 2.3.0

ML

linux
            Reporter: Barry Becker


After doing some reading about gini and entropy based impurity (see 
[https://spark.apache.org/docs/2.2.0/mllib-decision-tree.html]) it seems that 
impurity values should always be bounded by 0 and 1. However, sometimes some 
leaf nodes (usually, but not always those with the minimum number of records) 
have negative impurity values (usually -1, but not always). This seems like bug 
in the impurity calculation, but I am not sure. This happens for both gini and 
entropy impurity at slightly different nodes. 

I can reproduce this with almost any dataset using pretty standard parameters 
like the following:

new DecisionTreeClassifier()
 .setLabelCol(targetName)
 .setMaxBins(100)
 .setMaxDepth(5)
 .setMinInfoGain(0.01)
 .setMinInstancesPerNode(5)

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to