[jira] [Commented] (SPARK-28222) Feature importance outputs different values in GBT and Random Forest in 2.3.3 and 2.4 pyspark version
[ https://issues.apache.org/jira/browse/SPARK-28222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884412#comment-16884412 ] Marco Gaido commented on SPARK-28222: - [~eneriwrt] do you have a simple repro for this? I can try and check it if I have an example to debug. > Feature importance outputs different values in GBT and Random Forest in 2.3.3 > and 2.4 pyspark version > - > > Key: SPARK-28222 > URL: https://issues.apache.org/jira/browse/SPARK-28222 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3 >Reporter: eneriwrt >Priority: Minor > > Feature importance values obtained in a binary classification project outputs > different values if 2.3.3 version used or 2.4.0. It happens in Random Forest > and GBT. Turns out that values that are equal than sklearn output are from > 2.3.3 version. > As an example: > *SPARK 2.4* > MODEL RandomForestClassifier_gini [0.0, 0.4117930839002269, > 0.06894132653061226, 0.15857667209786705, 0.2974447311021076, > 0.06324418636918638] > MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, > 0.06578883597468652, 0.17433924485055197, 0.31754597164210124, > 0.055888697733790925] > MODEL GradientBoostingClassifier [0.0, 0.7556, > 0.24438, 0.0, 1.4602196686471875e-17, 0.0] > *SPARK 2.3.3* > MODEL RandomForestClassifier_gini [0.0, 0.40957086167800455, > 0.06894132653061226, 0.16413222765342259, 0.2974447311021076, > 0.05991085303585305] > MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, > 0.06578883597468652, 0.18789704501922055, 0.30398817147343266, > 0.055888697733790925] > MODEL GradientBoostingClassifier [0.0, 0.7555, > 0.24438, 0.0, 2.4326753518951276e-17, 0.0] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28222) Feature importance outputs different values in GBT and Random Forest in 2.3.3 and 2.4 pyspark version
[ https://issues.apache.org/jira/browse/SPARK-28222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881855#comment-16881855 ] eneriwrt commented on SPARK-28222: -- Yes, same values than sklearn are obtained in Spark 2.3.3 and Spark 2.3.0. So I guess until 3.0 it will output different values from versions 2.3.x and 2.4.x, and the right ones might be 2.3.x results. > Feature importance outputs different values in GBT and Random Forest in 2.3.3 > and 2.4 pyspark version > - > > Key: SPARK-28222 > URL: https://issues.apache.org/jira/browse/SPARK-28222 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3 >Reporter: eneriwrt >Priority: Minor > > Feature importance values obtained in a binary classification project outputs > different values if 2.3.3 version used or 2.4.0. It happens in Random Forest > and GBT. Turns out that values that are equal than sklearn output are 2.3.3 > version. > As an example: > *SPARK 2.4* > MODEL RandomForestClassifier_gini [0.0, 0.4117930839002269, > 0.06894132653061226, 0.15857667209786705, 0.2974447311021076, > 0.06324418636918638] > MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, > 0.06578883597468652, 0.17433924485055197, 0.31754597164210124, > 0.055888697733790925] > MODEL GradientBoostingClassifier [0.0, 0.7556, > 0.24438, 0.0, 1.4602196686471875e-17, 0.0] > *SPARK 2.3.3* > MODEL RandomForestClassifier_gini [0.0, 0.40957086167800455, > 0.06894132653061226, 0.16413222765342259, 0.2974447311021076, > 0.05991085303585305] > MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, > 0.06578883597468652, 0.18789704501922055, 0.30398817147343266, > 0.055888697733790925] > MODEL GradientBoostingClassifier [0.0, 0.7555, > 0.24438, 0.0, 2.4326753518951276e-17, 0.0] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28222) Feature importance outputs different values in GBT and Random Forest in 2.3.3 and 2.4 pyspark version
[ https://issues.apache.org/jira/browse/SPARK-28222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877267#comment-16877267 ] Marco Gaido commented on SPARK-28222: - Mmmmh, there has been a bug fix for it (see SPARK-26721), but it should be in 3.0 only AFAIK. The question is: which is the rigth value? Can you compare it with other libs like sklearn? > Feature importance outputs different values in GBT and Random Forest in 2.3.3 > and 2.4 pyspark version > - > > Key: SPARK-28222 > URL: https://issues.apache.org/jira/browse/SPARK-28222 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3 >Reporter: eneriwrt >Priority: Minor > > Feature importance values obtained in a binary classification project outputs > different values if 2.3.3 version used or 2.4.0. It happens in Random Forest > and GBT. > As an example: > *SPARK 2.4* > MODEL RandomForestClassifier_gini [0.0, 0.4117930839002269, > 0.06894132653061226, 0.15857667209786705, 0.2974447311021076, > 0.06324418636918638] > MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, > 0.06578883597468652, 0.17433924485055197, 0.31754597164210124, > 0.055888697733790925] > MODEL GradientBoostingClassifier [0.0, 0.7556, > 0.24438, 0.0, 1.4602196686471875e-17, 0.0] > *SPARK 2.3.3* > MODEL RandomForestClassifier_gini [0.0, 0.40957086167800455, > 0.06894132653061226, 0.16413222765342259, 0.2974447311021076, > 0.05991085303585305] > MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, > 0.06578883597468652, 0.18789704501922055, 0.30398817147343266, > 0.055888697733790925] > MODEL GradientBoostingClassifier [0.0, 0.7555, > 0.24438, 0.0, 2.4326753518951276e-17, 0.0] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org