Hi,
I have 4 very powerful boxes (256GB RAM, 32 cores each). I am running spark 
1.2.0 in yarn-client mode with following layout:
spark.executor.cores=4
spark.executor.memory=28G
spark.yarn.executor.memoryOverhead=4096

I am submitting bigger ALS trainImplicit task (rank=100, iters=15) on a dataset 
with ~3 billion of ratings using 25 executors. At some point some executor 
crashes with:
15/02/19 05:41:06 WARN util.AkkaUtils: Error sending message in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]     
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)     
   at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)    
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)        
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:107)        at 
org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:187)        at 
org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:398)15/02/19 
05:41:06 ERROR executor.Executor: Exception in task 131.0 in stage 51.0 (TID 
7259)java.lang.OutOfMemoryError: GC overhead limit exceeded        at 
java.lang.reflect.Array.newInstance(Array.java:75)        at 
java.io.ObjectInputStream.readArray(ObjectInputStream.java:1671)        at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1345)        at 
java.io.ObjectInputStream.readArray(ObjectInputStream.java:1707)        at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1345)
So the GC overhead limit exceeded is pretty clear and would suggest running out 
of memory. Since I have 1TB of RAM available this must be rather due to some 
config inoptimality.
Can anyone please point me to some directions how to tackle this?
Thanks,Antony.

Reply via email to