Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS
Also for the record, turning on kryo was not able to help. On Tue, Aug 23, 2016 at 12:58 PM, Arun Luthrawrote: > Splitting up the Maps to separate objects did not help. > > However, I was able to work around the problem by reimplementing it with > RDD joins. > > On Aug 18, 2016 5:16 PM, "Arun Luthra" wrote: > >> This might be caused by a few large Map objects that Spark is trying to >> serialize. These are not broadcast variables or anything, they're just >> regular objects. >> >> Would it help if I further indexed these maps into a two-level Map i.e. >> Map[String, Map[String, Int]] ? Or would this still count against me? >> >> What if I manually split them up into numerous Map variables? >> >> On Mon, Aug 15, 2016 at 2:12 PM, Arun Luthra >> wrote: >> >>> I got this OOM error in Spark local mode. The error seems to have been >>> at the start of a stage (all of the stages on the UI showed as complete, >>> there were more stages to do but had not showed up on the UI yet). >>> >>> There appears to be ~100G of free memory at the time of the error. >>> >>> Spark 2.0.0 >>> 200G driver memory >>> local[30] >>> 8 /mntX/tmp directories for spark.local.dir >>> "spark.sql.shuffle.partitions", "500" >>> "spark.driver.maxResultSize","500" >>> "spark.default.parallelism", "1000" >>> >>> The line number for the error is at an RDD map operation where there are >>> some potentially large Map objects that are going to be accessed by each >>> record. Does it matter if they are broadcast variables or not? I imagine >>> not because its in local mode they should be available in memory to every >>> executor/core. >>> >>> Possibly related: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cl >>> osureCleaner-or-java-serializer-OOM-when-trying-to-grow-td24796.html >>> >>> Exception in thread "main" java.lang.OutOfMemoryError >>> at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputSt >>> ream.java:123) >>> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) >>> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutput >>> Stream.java:93) >>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) >>> at java.io.ObjectOutputStream$BlockDataOutputStream.drain(Objec >>> tOutputStream.java:1877) >>> at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDat >>> aMode(ObjectOutputStream.java:1786) >>> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) >>> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) >>> at org.apache.spark.serializer.JavaSerializationStream.writeObj >>> ect(JavaSerializer.scala:43) >>> at org.apache.spark.serializer.JavaSerializerInstance.serialize >>> (JavaSerializer.scala:100) >>> at org.apache.spark.util.ClosureCleaner$.ensureSerializable(Clo >>> sureCleaner.scala:295) >>> at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ >>> ClosureCleaner$$clean(ClosureCleaner.scala:288) >>> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) >>> at org.apache.spark.SparkContext.clean(SparkContext.scala:2037) >>> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366) >>> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365) >>> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati >>> onScope.scala:151) >>> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati >>> onScope.scala:112) >>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) >>> at org.apache.spark.rdd.RDD.map(RDD.scala:365) >>> at abc.Abc$.main(abc.scala:395) >>> at abc.Abc.main(abc.scala) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce >>> ssorImpl.java:62) >>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe >>> thodAccessorImpl.java:43) >>> at java.lang.reflect.Method.invoke(Method.java:498) >>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy >>> $SparkSubmit$$runMain(SparkSubmit.scala:729) >>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit >>> .scala:185) >>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) >>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) >>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >>> >>> >>
Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS
Splitting up the Maps to separate objects did not help. However, I was able to work around the problem by reimplementing it with RDD joins. On Aug 18, 2016 5:16 PM, "Arun Luthra"wrote: > This might be caused by a few large Map objects that Spark is trying to > serialize. These are not broadcast variables or anything, they're just > regular objects. > > Would it help if I further indexed these maps into a two-level Map i.e. > Map[String, Map[String, Int]] ? Or would this still count against me? > > What if I manually split them up into numerous Map variables? > > On Mon, Aug 15, 2016 at 2:12 PM, Arun Luthra > wrote: > >> I got this OOM error in Spark local mode. The error seems to have been at >> the start of a stage (all of the stages on the UI showed as complete, there >> were more stages to do but had not showed up on the UI yet). >> >> There appears to be ~100G of free memory at the time of the error. >> >> Spark 2.0.0 >> 200G driver memory >> local[30] >> 8 /mntX/tmp directories for spark.local.dir >> "spark.sql.shuffle.partitions", "500" >> "spark.driver.maxResultSize","500" >> "spark.default.parallelism", "1000" >> >> The line number for the error is at an RDD map operation where there are >> some potentially large Map objects that are going to be accessed by each >> record. Does it matter if they are broadcast variables or not? I imagine >> not because its in local mode they should be available in memory to every >> executor/core. >> >> Possibly related: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cl >> osureCleaner-or-java-serializer-OOM-when-trying-to-grow-td24796.html >> >> Exception in thread "main" java.lang.OutOfMemoryError >> at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputSt >> ream.java:123) >> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) >> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutput >> Stream.java:93) >> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) >> at java.io.ObjectOutputStream$BlockDataOutputStream.drain(Objec >> tOutputStream.java:1877) >> at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDat >> aMode(ObjectOutputStream.java:1786) >> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) >> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) >> at org.apache.spark.serializer.JavaSerializationStream.writeObj >> ect(JavaSerializer.scala:43) >> at org.apache.spark.serializer.JavaSerializerInstance.serialize >> (JavaSerializer.scala:100) >> at org.apache.spark.util.ClosureCleaner$.ensureSerializable(Clo >> sureCleaner.scala:295) >> at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ >> ClosureCleaner$$clean(ClosureCleaner.scala:288) >> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) >> at org.apache.spark.SparkContext.clean(SparkContext.scala:2037) >> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366) >> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365) >> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati >> onScope.scala:151) >> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati >> onScope.scala:112) >> at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) >> at org.apache.spark.rdd.RDD.map(RDD.scala:365) >> at abc.Abc$.main(abc.scala:395) >> at abc.Abc.main(abc.scala) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce >> ssorImpl.java:62) >> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe >> thodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:498) >> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy >> $SparkSubmit$$runMain(SparkSubmit.scala:729) >> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit >> .scala:185) >> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) >> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) >> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >> >> >
Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS
This might be caused by a few large Map objects that Spark is trying to serialize. These are not broadcast variables or anything, they're just regular objects. Would it help if I further indexed these maps into a two-level Map i.e. Map[String, Map[String, Int]] ? Or would this still count against me? What if I manually split them up into numerous Map variables? On Mon, Aug 15, 2016 at 2:12 PM, Arun Luthrawrote: > I got this OOM error in Spark local mode. The error seems to have been at > the start of a stage (all of the stages on the UI showed as complete, there > were more stages to do but had not showed up on the UI yet). > > There appears to be ~100G of free memory at the time of the error. > > Spark 2.0.0 > 200G driver memory > local[30] > 8 /mntX/tmp directories for spark.local.dir > "spark.sql.shuffle.partitions", "500" > "spark.driver.maxResultSize","500" > "spark.default.parallelism", "1000" > > The line number for the error is at an RDD map operation where there are > some potentially large Map objects that are going to be accessed by each > record. Does it matter if they are broadcast variables or not? I imagine > not because its in local mode they should be available in memory to every > executor/core. > > Possibly related: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark- > ClosureCleaner-or-java-serializer-OOM-when-trying-to-grow-td24796.html > > Exception in thread "main" java.lang.OutOfMemoryError > at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java: > 123) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) > at java.io.ByteArrayOutputStream.ensureCapacity( > ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) > at java.io.ObjectOutputStream$BlockDataOutputStream.drain( > ObjectOutputStream.java:1877) > at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode( > ObjectOutputStream.java:1786) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at org.apache.spark.serializer.JavaSerializationStream. > writeObject(JavaSerializer.scala:43) > at org.apache.spark.serializer.JavaSerializerInstance. > serialize(JavaSerializer.scala:100) > at org.apache.spark.util.ClosureCleaner$.ensureSerializable( > ClosureCleaner.scala:295) > at org.apache.spark.util.ClosureCleaner$.org$apache$ > spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2037) > at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366) > at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365) > at org.apache.spark.rdd.RDDOperationScope$.withScope( > RDDOperationScope.scala:151) > at org.apache.spark.rdd.RDDOperationScope$.withScope( > RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) > at org.apache.spark.rdd.RDD.map(RDD.scala:365) > at abc.Abc$.main(abc.scala:395) > at abc.Abc.main(abc.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke( > NativeMethodAccessorImpl.java:62) > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$ > deploy$SparkSubmit$$runMain(SparkSubmit.scala:729) > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > >
Spark 2.0.0 OOM error at beginning of RDD map on AWS
I got this OOM error in Spark local mode. The error seems to have been at the start of a stage (all of the stages on the UI showed as complete, there were more stages to do but had not showed up on the UI yet). There appears to be ~100G of free memory at the time of the error. Spark 2.0.0 200G driver memory local[30] 8 /mntX/tmp directories for spark.local.dir "spark.sql.shuffle.partitions", "500" "spark.driver.maxResultSize","500" "spark.default.parallelism", "1000" The line number for the error is at an RDD map operation where there are some potentially large Map objects that are going to be accessed by each record. Does it matter if they are broadcast variables or not? I imagine not because its in local mode they should be available in memory to every executor/core. Possibly related: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-ClosureCleaner-or-java-serializer-OOM-when-trying-to-grow-td24796.html Exception in thread "main" java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2037) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) at org.apache.spark.rdd.RDD.map(RDD.scala:365) at abc.Abc$.main(abc.scala:395) at abc.Abc.main(abc.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)