[ https://issues.apache.org/jira/browse/SPARK-24866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Evan Zamir updated SPARK-24866: ------------------------------- Description: I'm encountering a very strange behavior that I can't explain away other than a bug somewhere. I'm creating RF models on Amazon EMR, normally using 1 Core instance. On these models, I have been consistently getting ROCs (during CV) ~0.55-0.60 (not good models obviously, but that's not the point here). After learning that Spark 2.3 introduced a parallelism parameter for the CV class, I decided to implement that and see if increasing the number of Core instances could also help speed up the models (which take several hours, sometimes up to a full day). To make a long story short, I have seen that on some of my datasets simply increasing the number of Core instances (i.e. 2), the ROC scores (*bestValidationMetric*) increase tremendously to the range of 0.85-0.95. For the life of me I can't figure out why simply increasing the number of instances (with absolutely no changes to code), would have this effect. I don't know if this is a Spark problem or somehow EMR, but I figured I'd post here and see if anyone has an idea for me. (was: I'm encountering a very strange behavior that I can't explain away other than a bug somewhere. I'm creating RF models on Amazon EMR, normally using 1 Core instance. On these models, I have been consistently getting ROCs (during CV) ~0.55-0.60 (not good models obviously, but that's not the point here). After learning that Spark 2.3 introduced a parallelism parameter for the CV class, I decided to implement that and see if increasing the number of Core instances could also help speed up the models (which take several hours, sometimes up to a full day). To make a long story short, I have seen that on some of my datasets simply increasing the number of Core instances (i.e. 2), the ROC scores increase tremendously to the range of 0.85-0.95. For the life of me I can't figure out why simply increasing the number of instances (with absolutely no changes to code), would have this effect. I don't know if this is a Spark problem or somehow EMR, but I figured I'd post here and see if anyone has an idea for me. ) > Artifactual ROC scores when scaling up Random Forest classifier > --------------------------------------------------------------- > > Key: SPARK-24866 > URL: https://issues.apache.org/jira/browse/SPARK-24866 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.3.0 > Reporter: Evan Zamir > Priority: Minor > > I'm encountering a very strange behavior that I can't explain away other than > a bug somewhere. I'm creating RF models on Amazon EMR, normally using 1 Core > instance. On these models, I have been consistently getting ROCs (during CV) > ~0.55-0.60 (not good models obviously, but that's not the point here). After > learning that Spark 2.3 introduced a parallelism parameter for the CV class, > I decided to implement that and see if increasing the number of Core > instances could also help speed up the models (which take several hours, > sometimes up to a full day). To make a long story short, I have seen that on > some of my datasets simply increasing the number of Core instances (i.e. 2), > the ROC scores (*bestValidationMetric*) increase tremendously to the range of > 0.85-0.95. For the life of me I can't figure out why simply increasing the > number of instances (with absolutely no changes to code), would have this > effect. I don't know if this is a Spark problem or somehow EMR, but I figured > I'd post here and see if anyone has an idea for me. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org