Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-24 Thread Arun Luthra
Also for the record, turning on kryo was not able to help.

On Tue, Aug 23, 2016 at 12:58 PM, Arun Luthra  wrote:

> Splitting up the Maps to separate objects did not help.
>
> However, I was able to work around the problem by reimplementing it with
> RDD joins.
>
> On Aug 18, 2016 5:16 PM, "Arun Luthra"  wrote:
>
>> This might be caused by a few large Map objects that Spark is trying to
>> serialize. These are not broadcast variables or anything, they're just
>> regular objects.
>>
>> Would it help if I further indexed these maps into a two-level Map i.e.
>> Map[String, Map[String, Int]] ? Or would this still count against me?
>>
>> What if I manually split them up into numerous Map variables?
>>
>> On Mon, Aug 15, 2016 at 2:12 PM, Arun Luthra 
>> wrote:
>>
>>> I got this OOM error in Spark local mode. The error seems to have been
>>> at the start of a stage (all of the stages on the UI showed as complete,
>>> there were more stages to do but had not showed up on the UI yet).
>>>
>>> There appears to be ~100G of free memory at the time of the error.
>>>
>>> Spark 2.0.0
>>> 200G driver memory
>>> local[30]
>>> 8 /mntX/tmp directories for spark.local.dir
>>> "spark.sql.shuffle.partitions", "500"
>>> "spark.driver.maxResultSize","500"
>>> "spark.default.parallelism", "1000"
>>>
>>> The line number for the error is at an RDD map operation where there are
>>> some potentially large Map objects that are going to be accessed by each
>>> record. Does it matter if they are broadcast variables or not? I imagine
>>> not because its in local mode they should be available in memory to every
>>> executor/core.
>>>
>>> Possibly related:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cl
>>> osureCleaner-or-java-serializer-OOM-when-trying-to-grow-td24796.html
>>>
>>> Exception in thread "main" java.lang.OutOfMemoryError
>>> at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputSt
>>> ream.java:123)
>>> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
>>> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutput
>>> Stream.java:93)
>>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>>> at java.io.ObjectOutputStream$BlockDataOutputStream.drain(Objec
>>> tOutputStream.java:1877)
>>> at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDat
>>> aMode(ObjectOutputStream.java:1786)
>>> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
>>> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>>> at org.apache.spark.serializer.JavaSerializationStream.writeObj
>>> ect(JavaSerializer.scala:43)
>>> at org.apache.spark.serializer.JavaSerializerInstance.serialize
>>> (JavaSerializer.scala:100)
>>> at org.apache.spark.util.ClosureCleaner$.ensureSerializable(Clo
>>> sureCleaner.scala:295)
>>> at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$
>>> ClosureCleaner$$clean(ClosureCleaner.scala:288)
>>> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>>> at org.apache.spark.SparkContext.clean(SparkContext.scala:2037)
>>> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
>>> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
>>> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
>>> onScope.scala:151)
>>> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
>>> onScope.scala:112)
>>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
>>> at org.apache.spark.rdd.RDD.map(RDD.scala:365)
>>> at abc.Abc$.main(abc.scala:395)
>>> at abc.Abc.main(abc.scala)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>>> ssorImpl.java:62)
>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>> thodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>>> $SparkSubmit$$runMain(SparkSubmit.scala:729)
>>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>>> .scala:185)
>>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>
>>>
>>


Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-23 Thread Arun Luthra
Splitting up the Maps to separate objects did not help.

However, I was able to work around the problem by reimplementing it with
RDD joins.

On Aug 18, 2016 5:16 PM, "Arun Luthra"  wrote:

> This might be caused by a few large Map objects that Spark is trying to
> serialize. These are not broadcast variables or anything, they're just
> regular objects.
>
> Would it help if I further indexed these maps into a two-level Map i.e.
> Map[String, Map[String, Int]] ? Or would this still count against me?
>
> What if I manually split them up into numerous Map variables?
>
> On Mon, Aug 15, 2016 at 2:12 PM, Arun Luthra 
> wrote:
>
>> I got this OOM error in Spark local mode. The error seems to have been at
>> the start of a stage (all of the stages on the UI showed as complete, there
>> were more stages to do but had not showed up on the UI yet).
>>
>> There appears to be ~100G of free memory at the time of the error.
>>
>> Spark 2.0.0
>> 200G driver memory
>> local[30]
>> 8 /mntX/tmp directories for spark.local.dir
>> "spark.sql.shuffle.partitions", "500"
>> "spark.driver.maxResultSize","500"
>> "spark.default.parallelism", "1000"
>>
>> The line number for the error is at an RDD map operation where there are
>> some potentially large Map objects that are going to be accessed by each
>> record. Does it matter if they are broadcast variables or not? I imagine
>> not because its in local mode they should be available in memory to every
>> executor/core.
>>
>> Possibly related:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cl
>> osureCleaner-or-java-serializer-OOM-when-trying-to-grow-td24796.html
>>
>> Exception in thread "main" java.lang.OutOfMemoryError
>> at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputSt
>> ream.java:123)
>> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
>> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutput
>> Stream.java:93)
>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>> at java.io.ObjectOutputStream$BlockDataOutputStream.drain(Objec
>> tOutputStream.java:1877)
>> at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDat
>> aMode(ObjectOutputStream.java:1786)
>> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
>> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>> at org.apache.spark.serializer.JavaSerializationStream.writeObj
>> ect(JavaSerializer.scala:43)
>> at org.apache.spark.serializer.JavaSerializerInstance.serialize
>> (JavaSerializer.scala:100)
>> at org.apache.spark.util.ClosureCleaner$.ensureSerializable(Clo
>> sureCleaner.scala:295)
>> at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$
>> ClosureCleaner$$clean(ClosureCleaner.scala:288)
>> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>> at org.apache.spark.SparkContext.clean(SparkContext.scala:2037)
>> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
>> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
>> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
>> onScope.scala:151)
>> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
>> onScope.scala:112)
>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
>> at org.apache.spark.rdd.RDD.map(RDD.scala:365)
>> at abc.Abc$.main(abc.scala:395)
>> at abc.Abc.main(abc.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:62)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>> $SparkSubmit$$runMain(SparkSubmit.scala:729)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>> .scala:185)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>>
>


Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-18 Thread Arun Luthra
This might be caused by a few large Map objects that Spark is trying to
serialize. These are not broadcast variables or anything, they're just
regular objects.

Would it help if I further indexed these maps into a two-level Map i.e.
Map[String, Map[String, Int]] ? Or would this still count against me?

What if I manually split them up into numerous Map variables?

On Mon, Aug 15, 2016 at 2:12 PM, Arun Luthra  wrote:

> I got this OOM error in Spark local mode. The error seems to have been at
> the start of a stage (all of the stages on the UI showed as complete, there
> were more stages to do but had not showed up on the UI yet).
>
> There appears to be ~100G of free memory at the time of the error.
>
> Spark 2.0.0
> 200G driver memory
> local[30]
> 8 /mntX/tmp directories for spark.local.dir
> "spark.sql.shuffle.partitions", "500"
> "spark.driver.maxResultSize","500"
> "spark.default.parallelism", "1000"
>
> The line number for the error is at an RDD map operation where there are
> some potentially large Map objects that are going to be accessed by each
> record. Does it matter if they are broadcast variables or not? I imagine
> not because its in local mode they should be available in memory to every
> executor/core.
>
> Possibly related:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-
> ClosureCleaner-or-java-serializer-OOM-when-trying-to-grow-td24796.html
>
> Exception in thread "main" java.lang.OutOfMemoryError
> at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:
> 123)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> at java.io.ByteArrayOutputStream.ensureCapacity(
> ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> at java.io.ObjectOutputStream$BlockDataOutputStream.drain(
> ObjectOutputStream.java:1877)
> at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(
> ObjectOutputStream.java:1786)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at org.apache.spark.serializer.JavaSerializationStream.
> writeObject(JavaSerializer.scala:43)
> at org.apache.spark.serializer.JavaSerializerInstance.
> serialize(JavaSerializer.scala:100)
> at org.apache.spark.util.ClosureCleaner$.ensureSerializable(
> ClosureCleaner.scala:295)
> at org.apache.spark.util.ClosureCleaner$.org$apache$
> spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2037)
> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
> at org.apache.spark.rdd.RDD.map(RDD.scala:365)
> at abc.Abc$.main(abc.scala:395)
> at abc.Abc.main(abc.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$
> deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>


Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-15 Thread Arun Luthra
I got this OOM error in Spark local mode. The error seems to have been at
the start of a stage (all of the stages on the UI showed as complete, there
were more stages to do but had not showed up on the UI yet).

There appears to be ~100G of free memory at the time of the error.

Spark 2.0.0
200G driver memory
local[30]
8 /mntX/tmp directories for spark.local.dir
"spark.sql.shuffle.partitions", "500"
"spark.driver.maxResultSize","500"
"spark.default.parallelism", "1000"

The line number for the error is at an RDD map operation where there are
some potentially large Map objects that are going to be accessed by each
record. Does it matter if they are broadcast variables or not? I imagine
not because its in local mode they should be available in memory to every
executor/core.

Possibly related:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-ClosureCleaner-or-java-serializer-OOM-when-trying-to-grow-td24796.html

Exception in thread "main" java.lang.OutOfMemoryError
at
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2037)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.map(RDD.scala:365)
at abc.Abc$.main(abc.scala:395)
at abc.Abc.main(abc.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)