Yes you are right I initially started from master node but what happened suddenly after 2 days that workers dies is what I am interested in knowing , is it possible that workers got disconnected because of some network issue and then they tried tried starting themselves but kept failing ?
On Sun, Feb 14, 2016 at 11:21 PM, Prabhu Joseph <prabhujose.ga...@gmail.com> wrote: > Kartik, > > Spark Workers won't start if SPARK_MASTER_IP is wrong, maybe you > would have used start_slaves.sh from Master node to start all worker nodes, > where Workers would have got correct SPARK_MASTER_IP initially. Later any > restart from slave nodes would have failed because of wrong SPARK_MASTER_IP > at worker nodes. > > Check the logs of other workers running to see what SPARK_MASTER_IP it > has connected, I don't think it is using a wrong Master IP. > > > Thanks, > Prabhu Joseph > > On Mon, Feb 15, 2016 at 12:34 PM, Kartik Mathur <kar...@bluedata.com> > wrote: > >> Thanks Prabhu , >> >> I had wrongly configured spark_master_ip in worker nodes to `hostname -f` >> which is the worker and not master , >> >> but now the question is *why the cluster was up initially for 2 days* >> and workers realized of this invalid configuration after 2 days ? And why >> other workers are still up even through they have the same setting ? >> >> Really appreciate your help >> >> Thanks, >> Kartik >> >> On Sun, Feb 14, 2016 at 10:53 PM, Prabhu Joseph < >> prabhujose.ga...@gmail.com> wrote: >> >>> Kartik, >>> >>> The exception stack trace >>> *java.util.concurrent.RejectedExecutionException* will happen if >>> SPARK_MASTER_IP in worker nodes are configured wrongly like if >>> SPARK_MASTER_IP is a hostname of Master Node and workers trying to connect >>> to IP of master node. Check whether SPARK_MASTER_IP in Worker nodes are >>> exactly the same as what Spark Master GUI shows. >>> >>> >>> Thanks, >>> Prabhu Joseph >>> >>> On Mon, Feb 15, 2016 at 11:51 AM, Kartik Mathur <kar...@bluedata.com> >>> wrote: >>> >>>> on spark 1.5.2 >>>> I have a spark standalone cluster with 6 workers , I left the cluster >>>> idle for 3 days and after 3 days I saw only 4 workers on the spark master >>>> UI , 2 workers died with the same exception - >>>> >>>> Strange part is cluster was running stable for 2 days but on third day >>>> 2 workers abruptly died . I am see this error in one of the affected worker >>>> . No job ran for 2 days. >>>> >>>> >>>> >>>> 2016-02-14 01:12:59 ERROR Worker:75 - Connection to master failed! >>>> Waiting for master to reconnect...2016-02-14 01:12:59 ERROR Worker:75 - >>>> Connection to master failed! Waiting for master to reconnect...2016-02-14 >>>> 01:13:10 ERROR SparkUncaughtExceptionHandler:96 - Uncaught exception in >>>> thread >>>> Thread[sparkWorker-akka.actor.default-dispatcher-2,5,main]java.util.concurrent.RejectedExecutionException: >>>> Task java.util.concurrent.FutureTask@514b13ad rejected from >>>> java.util.concurrent.ThreadPoolExecutor@17f8ec8d[Running, pool size = >>>> 1, active threads = 1, queued tasks = 0, completed tasks = 3] at >>>> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) >>>> at >>>> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110) >>>> at >>>> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$reregisterWithMaster$1.apply$mcV$sp(Worker.scala:269) >>>> at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119) >>>> at >>>> org.apache.spark.deploy.worker.Worker.org$apache$spark$deploy$worker$Worker$$reregisterWithMaster(Worker.scala:234) >>>> at >>>> org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:521) >>>> at >>>> org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177) >>>> at >>>> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126) >>>> at >>>> org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197) >>>> at >>>> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125) >>>> at >>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>>> at >>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>>> at >>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>>> at >>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59) >>>> at >>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) >>>> at >>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >>>> at >>>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) >>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at >>>> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92) >>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487) at >>>> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at >>>> akka.dispatch.Mailbox.run(Mailbox.scala:220) at >>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>> >>>> >>>> >>>> down votefavorite >>>> <http://t.sidekickopen35.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XZs4WrRx6W4XyGfn7gbDClW5vMqt056dBqBf8x44FH02?t=http%3A%2F%2Fstackoverflow.com%2Fquestions%2F35402516%2Fspark-workers-dropping-off-after-couple-of-days%23&si=5102319033384960&pi=a5b195e6-0a48-4ec8-80a6-176be5a0ebe5> >>>> >>>> >>> >> >