I got the reason for the weird behaviour  

the executor throws an exception due to the bug in application code (I forgot 
to set an env variable used in the application code in every machine) when 
starting  

then the master seems to remove the worker from the list (?) but the worker 
keeps sending the heartbeat but gets no reply, finally all workers are dead…

but obviously it should not work in this way, the problematic application code 
should not make all workers dead

I’m checking the source code to find the reason

Best,

--  
Nan Zhu


On Tuesday, January 14, 2014 at 8:53 PM, Nan Zhu wrote:

> Hi, all  
>  
> I’m trying to deploy spark in standalone mode, everything goes as usual,  
>  
> the webUI is accessible, the master node wrote some logs saying all workers 
> are registered
>  
> 14/01/15 01:37:30 INFO Slf4jEventHandler: Slf4jEventHandler started  
> 14/01/15 01:37:31 INFO ActorSystemImpl: 
> RemoteServerStarted@akka://sparkMaster@172.31.36.93 
> (mailto:sparkMaster@172.31.36.93):7077
> 14/01/15 01:37:31 INFO Master: Starting Spark master at 
> spark://172.31.36.93:7077
> 14/01/15 01:37:31 INFO MasterWebUI: Started Master web UI at 
> http://ip-172-31-36-93.us-west-2.compute.internal:8080
> 14/01/15 01:37:31 INFO Master: I have been elected leader! New state: ALIVE
> 14/01/15 01:37:34 INFO ActorSystemImpl: 
> RemoteClientStarted@akka://sparkwor...@ip-172-31-34-61.us-west-2.compute.internal
>  (mailto:sparkwor...@ip-172-31-34-61.us-west-2.compute.internal):37914
> 14/01/15 01:37:34 INFO ActorSystemImpl: 
> RemoteClientStarted@akka://sparkwor...@ip-172-31-40-28.us-west-2.compute.internal
>  (mailto:sparkwor...@ip-172-31-40-28.us-west-2.compute.internal):43055
> 14/01/15 01:37:34 INFO Master: Registering worker 
> ip-172-31-34-61.us-west-2.compute.internal:37914 with 2 cores, 6.3 GB RAM
> 14/01/15 01:37:34 INFO ActorSystemImpl: 
> RemoteClientStarted@akka://sparkwor...@ip-172-31-45-211.us-west-2.compute.internal
>  (mailto:sparkwor...@ip-172-31-45-211.us-west-2.compute.internal):55355
> 14/01/15 01:37:34 INFO Master: Registering worker 
> ip-172-31-40-28.us-west-2.compute.internal:43055 with 2 cores, 6.3 GB RAM
> 14/01/15 01:37:34 INFO Master: Registering worker 
> ip-172-31-45-211.us-west-2.compute.internal:55355 with 2 cores, 6.3 GB RAM
> 14/01/15 01:37:34 INFO ActorSystemImpl: 
> RemoteClientStarted@akka://sparkwor...@ip-172-31-41-251.us-west-2.compute.internal
>  (mailto:sparkwor...@ip-172-31-41-251.us-west-2.compute.internal):47709
> 14/01/15 01:37:34 INFO Master: Registering worker 
> ip-172-31-41-251.us-west-2.compute.internal:47709 with 2 cores, 6.3 GB RAM
> 14/01/15 01:37:34 INFO ActorSystemImpl: 
> RemoteClientStarted@akka://sparkwor...@ip-172-31-43-78.us-west-2.compute.internal
>  (mailto:sparkwor...@ip-172-31-43-78.us-west-2.compute.internal):36257
> 14/01/15 01:37:34 INFO Master: Registering worker 
> ip-172-31-43-78.us-west-2.compute.internal:36257 with 2 cores, 6.3 GB RAM
> 14/01/15 01:38:44 INFO ActorSystemImpl: 
> RemoteClientStarted@akka://sp...@ip-172-31-37-160.us-west-2.compute.internal 
> (mailto:sp...@ip-172-31-37-160.us-west-2.compute.internal):43086
>  
>  
>  
>  
> However, when I launched an application, the master firstly “attempted to 
> re-register the worker” and then said that all heartbeats are from 
> “unregistered” workers. Can anyone told me what happened here?
>  
> 14/01/15 01:38:44 INFO Master: Registering app ALS  
> 14/01/15 01:38:44 INFO Master: Registered app ALS with ID 
> app-20140115013844-0000
> 14/01/15 01:38:44 INFO Master: Launching executor app-20140115013844-0000/0 
> on worker 
> worker-20140115013734-ip-172-31-43-78.us-west-2.compute.internal-36257
> 14/01/15 01:38:44 INFO Master: Launching executor app-20140115013844-0000/1 
> on worker 
> worker-20140115013734-ip-172-31-40-28.us-west-2.compute.internal-43055
> 14/01/15 01:38:44 INFO Master: Launching executor app-20140115013844-0000/2 
> on worker 
> worker-20140115013734-ip-172-31-34-61.us-west-2.compute.internal-37914
> 14/01/15 01:38:44 INFO Master: Launching executor app-20140115013844-0000/3 
> on worker 
> worker-20140115013734-ip-172-31-45-211.us-west-2.compute.internal-55355
> 14/01/15 01:38:44 INFO Master: Launching executor app-20140115013844-0000/4 
> on worker 
> worker-20140115013734-ip-172-31-41-251.us-west-2.compute.internal-47709
> 14/01/15 01:38:44 INFO Master: Registering worker 
> ip-172-31-40-28.us-west-2.compute.internal:43055 with 2 cores, 6.3 GB RAM
> 14/01/15 01:38:44 INFO Master: Attempted to re-register worker at same 
> address: akka://sparkwor...@ip-172-31-40-28.us-west-2.compute.internal 
> (mailto:sparkwor...@ip-172-31-40-28.us-west-2.compute.internal):43055
> 14/01/15 01:38:44 INFO Master: Registering worker 
> ip-172-31-34-61.us-west-2.compute.internal:37914 with 2 cores, 6.3 GB RAM
> 14/01/15 01:38:44 INFO Master: Attempted to re-register worker at same 
> address: akka://sparkwor...@ip-172-31-34-61.us-west-2.compute.internal 
> (mailto:sparkwor...@ip-172-31-34-61.us-west-2.compute.internal):37914
> 14/01/15 01:38:44 INFO Master: Registering worker 
> ip-172-31-41-251.us-west-2.compute.internal:47709 with 2 cores, 6.3 GB RAM
> 14/01/15 01:38:44 INFO Master: Attempted to re-register worker at same 
> address: akka://sparkwor...@ip-172-31-41-251.us-west-2.compute.internal 
> (mailto:sparkwor...@ip-172-31-41-251.us-west-2.compute.internal):47709
> 14/01/15 01:38:44 INFO Master: Registering worker 
> ip-172-31-45-211.us-west-2.compute.internal:55355 with 2 cores, 6.3 GB RAM
> 14/01/15 01:38:44 INFO Master: Attempted to re-register worker at same 
> address: akka://sparkwor...@ip-172-31-45-211.us-west-2.compute.internal 
> (mailto:sparkwor...@ip-172-31-45-211.us-west-2.compute.internal):55355
> 14/01/15 01:38:44 INFO Master: Registering worker 
> ip-172-31-43-78.us-west-2.compute.internal:36257 with 2 cores, 6.3 GB RAM
> 14/01/15 01:38:44 INFO Master: Attempted to re-register worker at same 
> address: akka://sparkwor...@ip-172-31-43-78.us-west-2.compute.internal 
> (mailto:sparkwor...@ip-172-31-43-78.us-west-2.compute.internal):36257
> 14/01/15 01:38:44 WARN Master: Got heartbeat from unregistered worker 
> worker-20140115013844-ip-172-31-34-61.us-west-2.compute.internal-37914
> 14/01/15 01:38:44 WARN Master: Got heartbeat from unregistered worker 
> worker-20140115013844-ip-172-31-45-211.us-west-2.compute.internal-55355
> 14/01/15 01:38:44 WARN Master: Got heartbeat from unregistered worker 
> worker-20140115013844-ip-172-31-40-28.us-west-2.compute.internal-43055
> 14/01/15 01:38:44 WARN Master: Got heartbeat from unregistered worker 
> worker-20140115013844-ip-172-31-43-78.us-west-2.compute.internal-36257
> 14/01/15 01:38:44 WARN Master: Got heartbeat from unregistered worker 
> worker-20140115013844-ip-172-31-41-251.us-west-2.compute.internal-47709
> 14/01/15 01:38:50 WARN Master: Got heartbeat from unregistered worker 
> worker-20140115013844-ip-172-31-45-211.us-west-2.compute.internal-55355
>  
>  
>  
>  
> Thank you very much!
>  
> Best,
>  
> --  
> Nan Zhu
>  

Reply via email to