Hi Ankur,
Thanks for the reply.
This is already done.
If I wait for a long amount of time(10 minutes), a few tasks get successful
even on slave nodes. Sometime, a fraction of the tasks(20%) are completed
on all the machines in the initial 5 seconds and then, it slows down
drastically.

Thanks and Regards,
Suraj Sheth

On Fri, Sep 25, 2015 at 2:10 AM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:

> Hi Suraj,
>
> Spark uses a lot of ports to communicate between nodes. Probably your
> security group is restrictive and does not allow instances to communicate
> on all networks. The easiest way to resolve it is to add a Rule to allow
> all Inbound traffic on all ports (0-65535) to instances in same security
> group like this.
>
> All TCP
> TCP
> 0 - 65535
>  your security group
>
> Hope this helps!!
>
> Thanks
> Ankur
>
> On Thu, Sep 24, 2015 at 7:09 AM SURAJ SHETH <shet...@gmail.com> wrote:
>
>> Hi,
>>
>> I am using Spark 1.2 and facing network related issues while performing
>> simple computations.
>>
>> This is a custom cluster set up using ec2 machines and spark prebuilt
>> binary from apache site. The problem is only when we have workers on other
>> machines(networking involved). Having a single node for the master and the
>> slave works correctly.
>>
>> The error log from slave node is attached below. It is reading textFile
>> from local FS(copied each node) and counting it. The first 30 tasks get
>> completed within 5 seconds. Then, it takes several minutes to complete
>> another 10 tasks and eventually dies.
>>
>> Sometimes, one of the workers completes all the tasks assigned to it.
>> Different workers have different behavior at different
>> times(non-deterministic).
>>
>> Is it related to something specific to EC2?
>>
>>
>>
>> 15/09/24 13:04:40 INFO Executor: Running task 117.0 in stage 0.0 (TID 117)
>>
>> 15/09/24 13:04:41 INFO TorrentBroadcast: Started reading broadcast
>> variable 1
>>
>> 15/09/24 13:04:41 INFO SendingConnection: Initiating connection to
>> [master_ip:56305]
>>
>> 15/09/24 13:04:41 INFO SendingConnection: Connected to
>> [master_ip/master_ip_address:56305], 1 messages pending
>>
>> 15/09/24 13:05:41 INFO TorrentBroadcast: Started reading broadcast
>> variable 1
>>
>> 15/09/24 13:05:41 ERROR Executor: Exception in task 77.0 in stage 0.0
>> (TID 77)
>>
>> java.io.IOException: sendMessageReliably failed because ack was not
>> received within 60 sec
>>
>>         at
>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>>
>>         at
>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>>
>>         at scala.Option.foreach(Option.scala:236)
>>
>>         at
>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>>
>>         at
>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>>
>>         at
>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>>
>>         at
>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>>
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> 15/09/24 13:05:41 INFO CoarseGrainedExecutorBackend: Got assigned task 122
>>
>> 15/09/24 13:05:41 INFO Executor: Running task 3.1 in stage 0.0 (TID 122)
>>
>> 15/09/24 13:06:41 ERROR Executor: Exception in task 113.0 in stage 0.0
>> (TID 113)
>>
>> java.io.IOException: sendMessageReliably failed because ack was not
>> received within 60 sec
>>
>>         at
>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>>
>>         at
>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>>
>>         at scala.Option.foreach(Option.scala:236)
>>
>>         at
>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>>
>>         at
>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>>
>>         at
>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>>
>>         at
>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>>
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> 15/09/24 13:06:41 INFO TorrentBroadcast: Started reading broadcast
>> variable 1
>>
>> 15/09/24 13:06:41 INFO SendingConnection: Initiating connection to
>> [master_ip/master_ip_address:44427]
>>
>> 15/09/24 13:06:41 INFO SendingConnection: Connected to
>> [master_ip/master_ip_address:44427], 1 messages pending
>>
>> 15/09/24 13:07:41 ERROR Executor: Exception in task 37.0 in stage 0.0
>> (TID 37)
>>
>> java.io.IOException: sendMessageReliably failed because ack was not
>> received within 60 sec
>>
>>         at
>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>>
>>         at
>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>>
>>         at scala.Option.foreach(Option.scala:236)
>>
>>         at
>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>>
>>         at
>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>>
>>         at
>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>>
>>         at
>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>>
>>         at java.lang.Thread.run(Thread.java:745)
>>
>>
>>
>>
>>
>> I checked the network speed between the master and the slave and it is
>> able to scp large files at a speed of 60 MB/s.
>>
>> Any leads on how this can be fixed?
>>
>>
>>
>> Thanks and Regards,
>>
>> Suraj Sheth
>>
>>
>>
>

Reply via email to