I'm running on an AWS cluster of 10 x m1.large (64 bit, 7.5 GiB RAM). FWIW
I'm using the Flambo Clojure wrapper which uses the Java API but I don't
think that should make any difference. I'm running with the following
command:

spark/bin/spark-submit --class mything.core --name "My Thing" --conf
spark.yarn.executor.memoryOverhead=4096 --conf
spark.executor.extraJavaOptions="-XX:+CMSClassUnloadingEnabled
-XX:+CMSPermGenSweepingEnabled" /root/spark/code/myjar.jar

For one of the stages I'm getting errors:

 - ExecutorLostFailure (executor lost)
 - Resubmitted (resubmitted due to lost executor)

And I think they're caused by slave executor JVMs dying up with this error:

java.lang.OutOfMemoryError: PermGen space
        java.lang.Class.getDeclaredConstructors0(Native Method)
        java.lang.Class.privateGetDeclaredConstructors(Class.java:2585)
        java.lang.Class.getConstructor0(Class.java:2885)
        java.lang.Class.newInstance(Class.java:350)

sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:399)

sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:396)
        java.security.AccessController.doPrivileged(Native Method)

sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:395)

sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:113)

sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:331)

java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
        java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
        java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
        java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
        java.security.AccessController.doPrivileged(Native Method)
        java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
        java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
        java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)

java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
        java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)

java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)

java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)

java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)

java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)

java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)

java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)


1 stage out of 14 (so far) is failing. My failing stage is 1768 succeeded /
1862 (940 failed). 7 tasks failed with OOM, 919 were "Resubmitted
(resubmitted due to lost executor)".

Now my "Aggregated Metrics by Executor" shows that 10 out of 16 executors
show "CANNOT FIND ADDRESS" which I imagine means the JVM blew up and hasn't
been restarted. Now the 'Executors' tab shows only 7 executors.

 - Is this normal?
 - Any ideas why this is happening?
 - Any other measures I can take to prevent this?
 - Is the rest of my app going to run on a reduced number of executors?
 - Can I re-start the executors mid-application? This is a long-running
job, so I'd like to do what I can whilst it's running, if possible.
 - Am I correct in thinking that the --conf arguments are supplied to the
JVMs of the slave executors, so they will be receiving the extraJavaOptions
and memoryOverhead?

Thanks very much!

Joe

Reply via email to