[ 
https://issues.apache.org/jira/browse/SPARK-24866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evan Zamir updated SPARK-24866:
-------------------------------
    Description: I'm encountering a very strange behavior that I can't explain 
away other than a bug somewhere. I'm creating RF models on Amazon EMR, normally 
using 1 Core instance. On these models, I have been consistently getting ROCs 
(during CV) ~0.55-0.60 (not good models obviously, but that's not the point 
here). After learning that Spark 2.3 introduced a parallelism parameter for the 
CV class, I decided to implement that and see if increasing the number of Core 
instances could also help speed up the models (which take several hours, 
sometimes up to a full day). To make a long story short, I have seen that on 
some of my datasets simply increasing the number of Core instances (i.e. 2), 
the ROC scores (*bestValidationMetric*) increase tremendously to the range of 
0.85-0.95. For the life of me I can't figure out why simply increasing the 
number of instances (with absolutely no changes to code), would have this 
effect. I don't know if this is a Spark problem or somehow EMR, but I figured 
I'd post here and see if anyone has an idea for me.   (was: I'm encountering a 
very strange behavior that I can't explain away other than a bug somewhere. I'm 
creating RF models on Amazon EMR, normally using 1 Core instance. On these 
models, I have been consistently getting ROCs (during CV) ~0.55-0.60 (not good 
models obviously, but that's not the point here). After learning that Spark 2.3 
introduced a parallelism parameter for the CV class, I decided to implement 
that and see if increasing the number of Core instances could also help speed 
up the models (which take several hours, sometimes up to a full day). To make a 
long story short, I have seen that on some of my datasets simply increasing the 
number of Core instances (i.e. 2), the ROC scores increase tremendously to the 
range of 0.85-0.95. For the life of me I can't figure out why simply increasing 
the number of instances (with absolutely no changes to code), would have this 
effect. I don't know if this is a Spark problem or somehow EMR, but I figured 
I'd post here and see if anyone has an idea for me. )

> Artifactual ROC scores when scaling up Random Forest classifier
> ---------------------------------------------------------------
>
>                 Key: SPARK-24866
>                 URL: https://issues.apache.org/jira/browse/SPARK-24866
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Evan Zamir
>            Priority: Minor
>
> I'm encountering a very strange behavior that I can't explain away other than 
> a bug somewhere. I'm creating RF models on Amazon EMR, normally using 1 Core 
> instance. On these models, I have been consistently getting ROCs (during CV) 
> ~0.55-0.60 (not good models obviously, but that's not the point here). After 
> learning that Spark 2.3 introduced a parallelism parameter for the CV class, 
> I decided to implement that and see if increasing the number of Core 
> instances could also help speed up the models (which take several hours, 
> sometimes up to a full day). To make a long story short, I have seen that on 
> some of my datasets simply increasing the number of Core instances (i.e. 2), 
> the ROC scores (*bestValidationMetric*) increase tremendously to the range of 
> 0.85-0.95. For the life of me I can't figure out why simply increasing the 
> number of instances (with absolutely no changes to code), would have this 
> effect. I don't know if this is a Spark problem or somehow EMR, but I figured 
> I'd post here and see if anyone has an idea for me. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to