[jira] [Commented] (SPARK-10487) MLlib model fitting causes DataFrame write to break with OutOfMemory exception

2015-09-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906008#comment-14906008
 ] 

Zsolt Tóth commented on SPARK-10487:


Increasing the perm size on the driver fixes the OOM: 
spark.driver.extraJavaOptions="-XX:MaxPermSize=128m"

> MLlib model fitting causes DataFrame write to break with OutOfMemory exception
> --
>
> Key: SPARK-10487
> URL: https://issues.apache.org/jira/browse/SPARK-10487
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tried in a centos-based 1-node YARN in docker and on a 
> real-world CDH5 cluster
> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest 
> nightly build)
> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn 
> -DzincPort=3034
> I'm using the default resource setup
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor 
> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
>Reporter: Zoltan Toth
>
> After fitting a _spark.ml_ or _mllib model_ in *cluster* deploy mode, no 
> dataframes can be written to hdfs. The driver receives an OutOfMemory 
> exception during the writing. It seems, however, that the file gets written 
> successfully.
>  * This happens both in SparkR and pyspark
>  * Only happens in cluster deploy mode
>  * The write fails regardless the size of the dataframe and whether the 
> dataframe is associated with the ml model.
> REPRO:
> {code}
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext
> from pyspark.ml.classification import LogisticRegression
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.linalg import Vector, Vectors
> conf = SparkConf().setAppName("LogRegTest")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
> sqlContext.setConf("park.sql.parquet.compression.codec", "uncompressed")
> training = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
> df = training.toDF()
> reg = LogisticRegression().setMaxIter(10).setRegParam(0.01)
> model = reg.fit(df)
> # Note that this is a brand new dataframe:
> one_df = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5.toDF()
> one_df.write.mode("overwrite").parquet("/tmp/df.parquet")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10487) MLlib model fitting causes DataFrame write to break with OutOfMemory exception

2015-09-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906802#comment-14906802
 ] 

Joseph K. Bradley commented on SPARK-10487:
---

As far as I can tell, there isn't a huge change between R and Scala, or between 
fitting and and not.  I think it's because of the small amount of memory 
available.  I'll close this for now.

> MLlib model fitting causes DataFrame write to break with OutOfMemory exception
> --
>
> Key: SPARK-10487
> URL: https://issues.apache.org/jira/browse/SPARK-10487
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tried in a centos-based 1-node YARN in docker and on a 
> real-world CDH5 cluster
> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest 
> nightly build)
> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn 
> -DzincPort=3034
> I'm using the default resource setup
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor 
> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
>Reporter: Zoltan Toth
>
> After fitting a _spark.ml_ or _mllib model_ in *cluster* deploy mode, no 
> dataframes can be written to hdfs. The driver receives an OutOfMemory 
> exception during the writing. It seems, however, that the file gets written 
> successfully.
>  * This happens both in SparkR and pyspark
>  * Only happens in cluster deploy mode
>  * The write fails regardless the size of the dataframe and whether the 
> dataframe is associated with the ml model.
> REPRO:
> {code}
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext
> from pyspark.ml.classification import LogisticRegression
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.linalg import Vector, Vectors
> conf = SparkConf().setAppName("LogRegTest")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
> sqlContext.setConf("park.sql.parquet.compression.codec", "uncompressed")
> training = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
> df = training.toDF()
> reg = LogisticRegression().setMaxIter(10).setRegParam(0.01)
> model = reg.fit(df)
> # Note that this is a brand new dataframe:
> one_df = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5.toDF()
> one_df.write.mode("overwrite").parquet("/tmp/df.parquet")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10487) MLlib model fitting causes DataFrame write to break with OutOfMemory exception

2015-09-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906656#comment-14906656
 ] 

Joseph K. Bradley commented on SPARK-10487:
---

Ohh, that's very helpful.  I suspect it's because Parquet allocates large 
buffers for each column.  It's still surprising to me since there are only 2 
columns.  (I've only seen this problem before with saving decision trees, which 
creates 13+ columns.)  I'm wondering if some of the data from model fitting is 
still cached and does not get kicked out of the cache when needed.  I'll try 
running this and will monitor the Spark UI to see if some temp data are staying 
cached unnecessarily.

Note though that you're using a very small amount of memory.  In general, I try 
to use about 20GB for the driver and 8GB for executors for common jobs.

> MLlib model fitting causes DataFrame write to break with OutOfMemory exception
> --
>
> Key: SPARK-10487
> URL: https://issues.apache.org/jira/browse/SPARK-10487
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tried in a centos-based 1-node YARN in docker and on a 
> real-world CDH5 cluster
> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest 
> nightly build)
> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn 
> -DzincPort=3034
> I'm using the default resource setup
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor 
> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
>Reporter: Zoltan Toth
>
> After fitting a _spark.ml_ or _mllib model_ in *cluster* deploy mode, no 
> dataframes can be written to hdfs. The driver receives an OutOfMemory 
> exception during the writing. It seems, however, that the file gets written 
> successfully.
>  * This happens both in SparkR and pyspark
>  * Only happens in cluster deploy mode
>  * The write fails regardless the size of the dataframe and whether the 
> dataframe is associated with the ml model.
> REPRO:
> {code}
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext
> from pyspark.ml.classification import LogisticRegression
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.linalg import Vector, Vectors
> conf = SparkConf().setAppName("LogRegTest")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
> sqlContext.setConf("park.sql.parquet.compression.codec", "uncompressed")
> training = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
> df = training.toDF()
> reg = LogisticRegression().setMaxIter(10).setRegParam(0.01)
> model = reg.fit(df)
> # Note that this is a brand new dataframe:
> one_df = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5.toDF()
> one_df.write.mode("overwrite").parquet("/tmp/df.parquet")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10487) MLlib model fitting causes DataFrame write to break with OutOfMemory exception

2015-09-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737790#comment-14737790
 ] 

Joseph K. Bradley commented on SPARK-10487:
---

Does this failure require there to be an ML model at all?  Or can you reproduce 
it only using dataframes?

Also, can you reproduce it using nothing from ML (not using LabeledPoint or 
Vector)?

> MLlib model fitting causes DataFrame write to break with OutOfMemory exception
> --
>
> Key: SPARK-10487
> URL: https://issues.apache.org/jira/browse/SPARK-10487
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tried in a centos-based 1-node YARN in docker and on a 
> real-world CDH5 cluster
> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest 
> nightly build)
> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn 
> -DzincPort=3034
> I'm using the default resource setup
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor 
> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
>Reporter: Zoltan Toth
>
> After fitting a _spark.ml_ or _mllib model_ in *cluster* deploy mode, no 
> dataframes can be written to hdfs. The driver receives an OutOfMemory 
> exception during the writing. It seems, however, that the file gets written 
> successfully.
>  * This happens both in SparkR and pyspark
>  * Only happens in cluster deploy mode
>  * The write fails regardless the size of the dataframe and whether the 
> dataframe is associated with the ml model.
> REPRO:
> {code}
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext
> from pyspark.ml.classification import LogisticRegression
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.linalg import Vector, Vectors
> conf = SparkConf().setAppName("LogRegTest")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
> sqlContext.setConf("park.sql.parquet.compression.codec", "uncompressed")
> training = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
> df = training.toDF()
> reg = LogisticRegression().setMaxIter(10).setRegParam(0.01)
> model = reg.fit(df)
> # Note that this is a brand new dataframe:
> one_df = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5.toDF()
> one_df.write.mode("overwrite").parquet("/tmp/df.parquet")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10487) MLlib model fitting causes DataFrame write to break with OutOfMemory exception

2015-09-09 Thread Zoltan Toth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14738194#comment-14738194
 ] 

Zoltan Toth commented on SPARK-10487:
-

Yes it only happens if you use mllib or ML and you fit a model. You don't need 
a DataPoint, e.g. if you make a trivial `glm` linear regression on the built in 
`iris` dataset in SparkR, it also fails.

{code}
library(SparkR)

sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)

training <- createDataFrame(sqlContext, iris)
test <- select(training, "Sepal_Length")
model <- glm(Sepal_Width ~ Sepal_Length, training, family = "gaussian")
prediction <- predict(model, test)

SparkR:::saveAsParquetFile(prediction, "/tmp/SparkR-logreg-prediction-data")
{code} 

Again, onlyi in `cluster` master mode

> MLlib model fitting causes DataFrame write to break with OutOfMemory exception
> --
>
> Key: SPARK-10487
> URL: https://issues.apache.org/jira/browse/SPARK-10487
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tried in a centos-based 1-node YARN in docker and on a 
> real-world CDH5 cluster
> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest 
> nightly build)
> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn 
> -DzincPort=3034
> I'm using the default resource setup
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor 
> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
>Reporter: Zoltan Toth
>
> After fitting a _spark.ml_ or _mllib model_ in *cluster* deploy mode, no 
> dataframes can be written to hdfs. The driver receives an OutOfMemory 
> exception during the writing. It seems, however, that the file gets written 
> successfully.
>  * This happens both in SparkR and pyspark
>  * Only happens in cluster deploy mode
>  * The write fails regardless the size of the dataframe and whether the 
> dataframe is associated with the ml model.
> REPRO:
> {code}
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext
> from pyspark.ml.classification import LogisticRegression
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.linalg import Vector, Vectors
> conf = SparkConf().setAppName("LogRegTest")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
> sqlContext.setConf("park.sql.parquet.compression.codec", "uncompressed")
> training = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
> df = training.toDF()
> reg = LogisticRegression().setMaxIter(10).setRegParam(0.01)
> model = reg.fit(df)
> # Note that this is a brand new dataframe:
> one_df = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5.toDF()
> one_df.write.mode("overwrite").parquet("/tmp/df.parquet")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org