Hi, I am using Spark 1.2 and facing network related issues while performing simple computations.
This is a custom cluster set up using ec2 machines and spark prebuilt binary from apache site. The problem is only when we have workers on other machines(networking involved). Having a single node for the master and the slave works correctly. The error log from slave node is attached below. It is reading textFile from local FS(copied each node) and counting it. The first 30 tasks get completed within 5 seconds. Then, it takes several minutes to complete another 10 tasks and eventually dies. Sometimes, one of the workers completes all the tasks assigned to it. Different workers have different behavior at different times(non-deterministic). Is it related to something specific to EC2? 15/09/24 13:04:40 INFO Executor: Running task 117.0 in stage 0.0 (TID 117) 15/09/24 13:04:41 INFO TorrentBroadcast: Started reading broadcast variable 1 15/09/24 13:04:41 INFO SendingConnection: Initiating connection to [master_ip:56305] 15/09/24 13:04:41 INFO SendingConnection: Connected to [master_ip/master_ip_address:56305], 1 messages pending 15/09/24 13:05:41 INFO TorrentBroadcast: Started reading broadcast variable 1 15/09/24 13:05:41 ERROR Executor: Exception in task 77.0 in stage 0.0 (TID 77) java.io.IOException: sendMessageReliably failed because ack was not received within 60 sec at org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918) at org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917) at scala.Option.foreach(Option.scala:236) at org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917) at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581) at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656) at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367) at java.lang.Thread.run(Thread.java:745) 15/09/24 13:05:41 INFO CoarseGrainedExecutorBackend: Got assigned task 122 15/09/24 13:05:41 INFO Executor: Running task 3.1 in stage 0.0 (TID 122) 15/09/24 13:06:41 ERROR Executor: Exception in task 113.0 in stage 0.0 (TID 113) java.io.IOException: sendMessageReliably failed because ack was not received within 60 sec at org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918) at org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917) at scala.Option.foreach(Option.scala:236) at org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917) at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581) at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656) at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367) at java.lang.Thread.run(Thread.java:745) 15/09/24 13:06:41 INFO TorrentBroadcast: Started reading broadcast variable 1 15/09/24 13:06:41 INFO SendingConnection: Initiating connection to [master_ip/master_ip_address:44427] 15/09/24 13:06:41 INFO SendingConnection: Connected to [master_ip/master_ip_address:44427], 1 messages pending 15/09/24 13:07:41 ERROR Executor: Exception in task 37.0 in stage 0.0 (TID 37) java.io.IOException: sendMessageReliably failed because ack was not received within 60 sec at org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918) at org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917) at scala.Option.foreach(Option.scala:236) at org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917) at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581) at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656) at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367) at java.lang.Thread.run(Thread.java:745) I checked the network speed between the master and the slave and it is able to scp large files at a speed of 60 MB/s. Any leads on how this can be fixed? Thanks and Regards, Suraj Sheth