Hi, Are you using EMR ?
Natu On Sat, Sep 26, 2015 at 6:55 AM, SURAJ SHETH <shet...@gmail.com> wrote: > Hi Ankur, > Thanks for the reply. > This is already done. > If I wait for a long amount of time(10 minutes), a few tasks get > successful even on slave nodes. Sometime, a fraction of the tasks(20%) are > completed on all the machines in the initial 5 seconds and then, it slows > down drastically. > > Thanks and Regards, > Suraj Sheth > > On Fri, Sep 25, 2015 at 2:10 AM, Ankur Srivastava < > ankur.srivast...@gmail.com> wrote: > >> Hi Suraj, >> >> Spark uses a lot of ports to communicate between nodes. Probably your >> security group is restrictive and does not allow instances to communicate >> on all networks. The easiest way to resolve it is to add a Rule to allow >> all Inbound traffic on all ports (0-65535) to instances in same security >> group like this. >> >> All TCP >> TCP >> 0 - 65535 >> your security group >> >> Hope this helps!! >> >> Thanks >> Ankur >> >> On Thu, Sep 24, 2015 at 7:09 AM SURAJ SHETH <shet...@gmail.com> wrote: >> >>> Hi, >>> >>> I am using Spark 1.2 and facing network related issues while performing >>> simple computations. >>> >>> This is a custom cluster set up using ec2 machines and spark prebuilt >>> binary from apache site. The problem is only when we have workers on other >>> machines(networking involved). Having a single node for the master and the >>> slave works correctly. >>> >>> The error log from slave node is attached below. It is reading textFile >>> from local FS(copied each node) and counting it. The first 30 tasks get >>> completed within 5 seconds. Then, it takes several minutes to complete >>> another 10 tasks and eventually dies. >>> >>> Sometimes, one of the workers completes all the tasks assigned to it. >>> Different workers have different behavior at different >>> times(non-deterministic). >>> >>> Is it related to something specific to EC2? >>> >>> >>> >>> 15/09/24 13:04:40 INFO Executor: Running task 117.0 in stage 0.0 (TID >>> 117) >>> >>> 15/09/24 13:04:41 INFO TorrentBroadcast: Started reading broadcast >>> variable 1 >>> >>> 15/09/24 13:04:41 INFO SendingConnection: Initiating connection to >>> [master_ip:56305] >>> >>> 15/09/24 13:04:41 INFO SendingConnection: Connected to >>> [master_ip/master_ip_address:56305], 1 messages pending >>> >>> 15/09/24 13:05:41 INFO TorrentBroadcast: Started reading broadcast >>> variable 1 >>> >>> 15/09/24 13:05:41 ERROR Executor: Exception in task 77.0 in stage 0.0 >>> (TID 77) >>> >>> java.io.IOException: sendMessageReliably failed because ack was not >>> received within 60 sec >>> >>> at >>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918) >>> >>> at >>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917) >>> >>> at scala.Option.foreach(Option.scala:236) >>> >>> at >>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917) >>> >>> at >>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581) >>> >>> at >>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656) >>> >>> at >>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367) >>> >>> at java.lang.Thread.run(Thread.java:745) >>> >>> 15/09/24 13:05:41 INFO CoarseGrainedExecutorBackend: Got assigned task >>> 122 >>> >>> 15/09/24 13:05:41 INFO Executor: Running task 3.1 in stage 0.0 (TID 122) >>> >>> 15/09/24 13:06:41 ERROR Executor: Exception in task 113.0 in stage 0.0 >>> (TID 113) >>> >>> java.io.IOException: sendMessageReliably failed because ack was not >>> received within 60 sec >>> >>> at >>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918) >>> >>> at >>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917) >>> >>> at scala.Option.foreach(Option.scala:236) >>> >>> at >>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917) >>> >>> at >>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581) >>> >>> at >>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656) >>> >>> at >>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367) >>> >>> at java.lang.Thread.run(Thread.java:745) >>> >>> 15/09/24 13:06:41 INFO TorrentBroadcast: Started reading broadcast >>> variable 1 >>> >>> 15/09/24 13:06:41 INFO SendingConnection: Initiating connection to >>> [master_ip/master_ip_address:44427] >>> >>> 15/09/24 13:06:41 INFO SendingConnection: Connected to >>> [master_ip/master_ip_address:44427], 1 messages pending >>> >>> 15/09/24 13:07:41 ERROR Executor: Exception in task 37.0 in stage 0.0 >>> (TID 37) >>> >>> java.io.IOException: sendMessageReliably failed because ack was not >>> received within 60 sec >>> >>> at >>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918) >>> >>> at >>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917) >>> >>> at scala.Option.foreach(Option.scala:236) >>> >>> at >>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917) >>> >>> at >>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581) >>> >>> at >>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656) >>> >>> at >>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367) >>> >>> at java.lang.Thread.run(Thread.java:745) >>> >>> >>> >>> >>> >>> I checked the network speed between the master and the slave and it is >>> able to scp large files at a speed of 60 MB/s. >>> >>> Any leads on how this can be fixed? >>> >>> >>> >>> Thanks and Regards, >>> >>> Suraj Sheth >>> >>> >>> >> >