Hi, I am running Spark 0.9.2 on an EC2 cluster with about 16 r3.4xlarge machines The cluster is running Spark standalone and is launched with the ec2 scripts. In my Spark job, I am using ephemeral HDFS to checkpoint some of my RDDs. I'm also reading and writing to S3. My jobs also involve a large amountf of shuffles.
I run the same job on multiple set of data and for 50-70% of these runs, the job completes with no issues. (Typically a rerun will allow the "failures" to complete as well) However on the rest of the 30%, I see a bunch of different kinds of issues pop up. (which will go away if I rerun the same job) (1) Checkpointing silently fails (I assume). the checkpoint dir exists in HDFS, but no data files are written out. And a later step in the job tries to reload these RDDs and I get a failure about not being able to read from HDFS. -- Usually a start, stop-dfs "cures" this. *Q: What could be the cause of this? Timeouts? * (2) Other times I get ... no idea who or what is causing this... in master /spark/logs: 2014-08-21 16:46:15 ERROR EndpointWriter: AssociationError [akka.tcp:// [email protected]:7077] -> [akka.tcp://[email protected]:37681]: Error [Association failed with [akka.tcp://spark@ip-10 -34-2-246.us-west-2.compute.internal:37681]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://[email protected]:37681] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-34-2-246.us-west-2.compute.internal/ 10.34.2.246:37681 ] Slave Log: 2014-08-21 16:46:47 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(ip-10-33-7-4.us-west-2.compute.internal,33242) 2014-08-21 16:46:47 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(ip-10-33-7-4.us-west-2.compute.internal,33242) java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) at org.apache.spark.network.SendingConnection.read(Connection.scala:398) at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) *Q: Where do I even start debugging this kind of issues? Are the machines too loaded and so timeouts are getting hit? Am I not setting some configuration number correctly? I would be grateful for some hints on where to start looking!* (3) Often (2) will be preceeded by the following in spark.logs.. 2014-08-21 16:34:10 WARN TaskSetManager: Lost TID 102135 (task 398.0:147) 2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371, 0) 2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371, 0) 2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371, 0) Not sure if this is an indication... I'll be very grateful for any ideas on how to start debugging these. Is there anything I should be noting -- CPU using on Master/Slave. Number of executors/cpu, akka threads etc? Cheers, shay
