I installed Spark 0.9.0 from the CDH parcel yesterday in standalone mode on top of a 6 node cluster running CDH4.6 on Centos.
What I'm seeing is that when jobs fail, often the worker process will crash, it seems that the worker restarts on the node but the Master then never utilizes the restarted worker, and it doesn't show up in the web interface. Has anyone seen anything like this, is there an obvious workaround/fix other than manually restarting the workers? In the Master log I see the following repeated many times, filer being the "lost" node. What it looks like to me is that when the worker actor is restarted by AKKA, it gets a new ID and for whatever reason does not register with the master. Any ideas? 14/03/04 20:04:44 WARN master.Master: Got heartbeat from unregistered worker worker-20140304183709-filer.maana.io-7078 14/03/04 20:04:54 WARN master.Master: Got heartbeat from unregistered worker worker-20140304183709-filer.maana.io-7078 14/03/04 20:04:59 WARN master.Master: Got heartbeat from unregistered worker worker-20140304183709-filer.maana.io-7078 On Filer itself I can see it's shutdown with the following exception, and I can see that it's been restarted and is running. 14/03/04 18:37:09 INFO worker.Worker: Executor app-20140304183705-0036/0 finished with state KILLED 14/03/04 18:37:09 INFO worker.CommandUtils: Redirection to /var/run/spark/work/app-20140304183705-0036/0/stderr closed: Bad file descriptor 14/03/04 18:37:09 ERROR actor.OneForOneStrategy: key not found: app-20140304183705-0036/0 java.util.NoSuchElementException: key not found: app-20140304183705-0036/0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:232) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/03/04 18:37:09 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkwor...@filer.maana.io:7078] -> [akka.tcp://sparkexecu...@filer.maana.io:58331]: Error [Association failed with [akka.tcp://sparkexecu...@filer.maana.io:58331]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkexecu...@filer.maana.io:58331] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: filer.maana.io/192.168.1.33:58331 ] 14/03/04 18:37:09 INFO handler.ContextHandler: stopped o.e.j.s.h.ContextHandler{*,null} 14/03/04 18:37:09 INFO handler.ContextHandler: stopped o.e.j.s.h.ContextHandler{/json,null} 14/03/04 18:37:09 INFO handler.ContextHandler: stopped o.e.j.s.h.ContextHandler{/logPage,null} 14/03/04 18:37:09 INFO handler.ContextHandler: stopped o.e.j.s.h.ContextHandler{/log,null} 14/03/04 18:37:09 INFO handler.ContextHandler: stopped o.e.j.s.h.ContextHandler{/static,null} 14/03/04 18:37:09 INFO handler.ContextHandler: stopped o.e.j.s.h.ContextHandler{/metrics/json,null} 14/03/04 18:37:09 INFO worker.Worker: Starting Spark worker filer.maana.io:7078 with 4 cores, 30.3 GB RAM 14/03/04 18:37:09 INFO worker.Worker: Spark home: /opt/cloudera/parcels/SPARK/lib/spark 14/03/04 18:37:09 INFO server.Server: jetty-7.6.8.v20121106 14/03/04 18:37:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/metrics/json,null} 14/03/04 18:37:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/static,null} 14/03/04 18:37:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/log,null} 14/03/04 18:37:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/static,null} 14/03/04 18:37:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/log,null} 14/03/04 18:37:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/logPage,null} 14/03/04 18:37:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/json,null} 14/03/04 18:37:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{*,null} 14/03/04 18:37:09 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:18081 14/03/04 18:37:09 INFO ui.WorkerWebUI: Started Worker web UI at http://filer.maana.io:18081 14/03/04 18:37:09 INFO worker.Worker: Connecting to master spark://Master.maana.io:7077... 14/03/04 18:37:09 INFO worker.Worker: Successfully registered with master spark://Master.maana.io:7077 14/03/05 08:53:35 INFO actor.LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40192.168.1.33%3A37859-60#-1831633323] was not delivered. [24] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child process. 14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child process. 14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child process. 14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child process. 14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child process. ~ ~ ~ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Worker-crashing-and-Master-not-seeing-recovered-worker-tp2312.html Sent from the Apache Spark User List mailing list archive at Nabble.com.