[ 
https://issues.apache.org/jira/browse/SPARK-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3923.
----------------------------
          Resolution: Fixed
       Fix Version/s: 1.2.0
            Assignee: Aaron Davidson
    Target Version/s: 1.2.0

> All Standalone Mode services time out with each other
> -----------------------------------------------------
>
>                 Key: SPARK-3923
>                 URL: https://issues.apache.org/jira/browse/SPARK-3923
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy
>    Affects Versions: 1.2.0
>            Reporter: Aaron Davidson
>            Assignee: Aaron Davidson
>            Priority: Blocker
>             Fix For: 1.2.0
>
>
> I'm seeing an issue where it seems that components in Standalone Mode 
> (Worker, Master, Driver, and Executor) all seem to time out with each other 
> after around 1000 seconds. Here is an example log:
> {code}
> 14/10/13 06:43:55 INFO Master: Registering worker 
> ip-10-0-147-189.us-west-2.compute.internal:38922 with 4 cores, 29.0 GB RAM
> 14/10/13 06:43:55 INFO Master: Registering worker 
> ip-10-0-175-214.us-west-2.compute.internal:42918 with 4 cores, 59.0 GB RAM
> 14/10/13 06:43:56 INFO Master: Registering app Databricks Shell
> 14/10/13 06:43:56 INFO Master: Registered app Databricks Shell with ID 
> app-20141013064356-0000
> ... precisely 1000 seconds later ...
> 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote 
> system 
> [akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922] has 
> failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> 14/10/13 07:00:35 INFO Master: 
> akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922 got 
> disassociated, removing it.
> 14/10/13 07:00:35 INFO LocalActorRef: Message 
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
> Actor[akka://sparkMaster/deadLetters] to 
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.147.189%3A54956-1#1529980245]
>  was not delivered. [2] dead letters encountered. This logging can be turned 
> off or adjusted with configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:35 INFO Master: 
> akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got 
> disassociated, removing it.
> 14/10/13 07:00:35 INFO Master: Removing worker 
> worker-20141013064354-ip-10-0-175-214.us-west-2.compute.internal-42918 on 
> ip-10-0-175-214.us-west-2.compute.internal:42918
> 14/10/13 07:00:35 INFO Master: Telling app of lost executor: 1
> 14/10/13 07:00:35 INFO Master: 
> akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got 
> disassociated, removing it.
> 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote 
> system 
> [akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918] has 
> failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> 14/10/13 07:00:35 INFO LocalActorRef: Message 
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
> Actor[akka://sparkMaster/deadLetters] to 
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324]
>  was not delivered. [3] dead letters encountered. This logging can be turned 
> off or adjusted with configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:35 INFO LocalActorRef: Message 
> [akka.remote.transport.AssociationHandle$Disassociated] from 
> Actor[akka://sparkMaster/deadLetters] to 
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324]
>  was not delivered. [4] dead letters encountered. This logging can be turned 
> off or adjusted with configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:36 INFO ProtocolStateActor: No response from remote. Handshake 
> timed out or transport failure detector triggered.
> 14/10/13 07:00:36 INFO Master: 
> akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259 got 
> disassociated, removing it.
> 14/10/13 07:00:36 INFO LocalActorRef: Message 
> [akka.remote.transport.AssociationHandle$InboundPayload] from 
> Actor[akka://sparkMaster/deadLetters] to 
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249]
>  was not delivered. [5] dead letters encountered. This logging can be turned 
> off or adjusted with configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:36 INFO Master: Removing app app-20141013064356-0000
> 14/10/13 07:00:36 WARN ReliableDeliverySupervisor: Association with remote 
> system 
> [akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259] has 
> failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> 14/10/13 07:00:36 INFO LocalActorRef: Message 
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
> Actor[akka://sparkMaster/deadLetters] to 
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249]
>  was not delivered. [6] dead letters encountered. This logging can be turned 
> off or adjusted with configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:36 INFO LocalActorRef: Message 
> [akka.remote.transport.AssociationHandle$Disassociated] from 
> Actor[akka://sparkMaster/deadLetters] to 
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249]
>  was not delivered. [7] dead letters encountered. This logging can be turned 
> off or adjusted with configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:36 INFO Master: 
> akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259 got 
> disassociated, removing it.
> {code}
> Note that the driver and master are living on the same machine, and there is 
> no load to speak of at the time (so no GC). Also everything disconnecting 
> exactly 1000 seconds after initial connection is pretty suspicious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to