I'm running on an AWS cluster of 10 x m1.large (64 bit, 7.5 GiB RAM). FWIW I'm using the Flambo Clojure wrapper which uses the Java API but I don't think that should make any difference. I'm running with the following command:
spark/bin/spark-submit --class mything.core --name "My Thing" --conf spark.yarn.executor.memoryOverhead=4096 --conf spark.executor.extraJavaOptions="-XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled" /root/spark/code/myjar.jar For one of the stages I'm getting errors: - ExecutorLostFailure (executor lost) - Resubmitted (resubmitted due to lost executor) And I think they're caused by slave executor JVMs dying up with this error: java.lang.OutOfMemoryError: PermGen space java.lang.Class.getDeclaredConstructors0(Native Method) java.lang.Class.privateGetDeclaredConstructors(Class.java:2585) java.lang.Class.getConstructor0(Class.java:2885) java.lang.Class.newInstance(Class.java:350) sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:399) sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:396) java.security.AccessController.doPrivileged(Native Method) sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:395) sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:113) sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:331) java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376) java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72) java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493) java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468) java.security.AccessController.doPrivileged(Native Method) java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468) java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365) java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) 1 stage out of 14 (so far) is failing. My failing stage is 1768 succeeded / 1862 (940 failed). 7 tasks failed with OOM, 919 were "Resubmitted (resubmitted due to lost executor)". Now my "Aggregated Metrics by Executor" shows that 10 out of 16 executors show "CANNOT FIND ADDRESS" which I imagine means the JVM blew up and hasn't been restarted. Now the 'Executors' tab shows only 7 executors. - Is this normal? - Any ideas why this is happening? - Any other measures I can take to prevent this? - Is the rest of my app going to run on a reduced number of executors? - Can I re-start the executors mid-application? This is a long-running job, so I'd like to do what I can whilst it's running, if possible. - Am I correct in thinking that the --conf arguments are supplied to the JVMs of the slave executors, so they will be receiving the extraJavaOptions and memoryOverhead? Thanks very much! Joe