I've seen also worker loss and that's way I asked a question about worker re-spawn.
My typical case is there's some job got OOM exception. Then on the master UI some worker's state becomes DEAD. In the master's log, there's error like: ``` 14/05/21 15:38:02 ERROR remote.EndpointWriter: AssociationError [akka.tcp:// sparkmas...@ec2-23-20-189-111.compute-1.amazonaws.com:7077] -> [akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572]: Error [Association failed with [akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-186-156-22.ec2.internal/10.186.156.22:38572 ] 14/05/21 15:38:02 INFO master.Master: akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572 got disassociated, removing it. ``` On the `DEAD` worker machine, there's 2 spark processes, worker and executor backend: 16280 org.apache.spark.deploy.worker.Worker 25989 org.apache.spark.executor.CoarseGrainedExecutorBackend The bad thing is that in this case, a sbin/stop-all.sh and sbin/start-all.sh cannot bring back the DEAD worker since the worker process cannot be terminated (maybe due to the executor backend). I have to log in, kill -9 both worker process and the executor backend. I'm on 0.9.1 and using ec2-script. 2014-05-21 11:42 GMT+02:00 sagi <zhpeng...@gmail.com>: > if you saw some exception message like the JIRA > https://issues.apache.org/jira/browse/SPARK-1886 mentioned in work's log > file, you are welcome to have a try > https://github.com/apache/spark/pull/827 > > > > > On Wed, May 21, 2014 at 11:21 AM, Josh Marcus <jmar...@meetup.com> wrote: > >> Aaron: >> >> I see this in the Master's logs: >> >> 14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same >> address: akka.tcp://sparkwor...@hdn3.int.meetup.com:50038 >> 14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker >> worker-20140520011737-hdn3.int.meetup.com-50038 >> >> There was an executor that launched that did fail, such as: >> 14/05/20 01:16:05 INFO Master: Launching executor >> app-20140520011605-0001/2 on worker >> worker-20140519155427-hdn3.int.meetup.com-50 >> 038 >> 14/05/20 01:17:37 INFO Master: Removing executor >> app-20140520011605-0001/2 because it is FAILED >> >> ... but other executors on other machines also failed without permanently >> disassociating. >> >> There are these messages which I don't know if they are related: >> 14/05/20 01:17:38 INFO LocalActorRef: Message >> [akka.remote.transport.AssociationHandle$Disassociated] from >> Actor[akka://sparkMaste >> r/deadLetters] to >> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3. >> 6.19%3A47252-18#1027788678] was not delivered. [3] dead letters >> encountered. This logging can be turned off or adjusted with confi >> guration settings 'akka.log-dead-letters' and >> 'akka.log-dead-letters-during-shutdown'. >> 14/05/20 01:17:38 INFO LocalActorRef: Message >> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from >> Actor[akka >> ://sparkMaster/deadLetters] to >> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkM >> aster%4010.3.6.19%3A47252-18#1027788678] was not delivered. [4] dead >> letters encountered. This logging can be turned off or adjust >> ed with configuration settings 'akka.log-dead-letters' and >> 'akka.log-dead-letters-during-shutdown'. >> >> >> >> >> On Tue, May 20, 2014 at 10:13 PM, Aaron Davidson <ilike...@gmail.com>wrote: >> >>> Unfortunately, those errors are actually due to an Executor that exited, >>> such that the connection between the Worker and Executor failed. This is >>> not a fatal issue, unless there are analogous messages from the Worker to >>> the Master (which should be present, if they exist, at around the same >>> point in time). >>> >>> Do you happen to have the logs from the Master that indicate that the >>> Worker terminated? Is it just an Akka disassociation, or some exception? >>> >>> >>> On Tue, May 20, 2014 at 12:53 PM, Sean Owen <so...@cloudera.com> wrote: >>> >>>> This isn't helpful of me to say, but, I see the same sorts of problem >>>> and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight >>>> into when it happens, but usually after heavy use and after running >>>> for a long time. I had figured I'd see if the changes since 0.9.0 >>>> addressed it and revisit later. >>>> >>>> On Tue, May 20, 2014 at 8:37 PM, Josh Marcus <jmar...@meetup.com> >>>> wrote: >>>> > So, for example, I have two disassociated worker machines at the >>>> moment. >>>> > The last messages in the spark logs are akka association error >>>> messages, >>>> > like the following: >>>> > >>>> > 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError >>>> > [akka.tcp://sparkwor...@hdn3.int.meetup.com:50038] -> >>>> > [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]: Error >>>> [Association >>>> > failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]] [ >>>> > akka.remote.EndpointAssociationException: Association failed with >>>> > [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288] >>>> > Caused by: >>>> > >>>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: >>>> > Connection refused: hdn3.int.meetup.com/10.3.6.23:46288 >>>> > ] >>>> > >>>> > On the master side, there are lots and lots of messages of the form: >>>> > >>>> > 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker >>>> > worker-20140520011737-hdn3.int.meetup.com-50038 >>>> > >>>> > --j >>>> > >>>> > >>>> >>> >>> >> > > > -- > --------------------------------- > Best Regards > -- *JU Han* Data Engineer @ Botify.com +33 0619608888