yeah, just causes them to hang. the first "deadLetters" message shows up about the same time. Oddly, after it first happens, I keep getting some results trickling in from those executors. (maybe they were just queued up on the driver already, I dunno.) but then it just hangs. the stage has a few more tasks to be run, but the executors are just idle, they're not running anything.
I'm gonna try manually listening for more Association events listed here & logging them http://doc.akka.io/docs/akka/2.2.3/scala/remoting.html#remote-events imran On Wed, Oct 30, 2013 at 11:27 AM, Prashant Sharma <[email protected]>wrote: > I am guessing something wrong with using Dissociation event then. > > Try applying something on the lines of this patch. This might cause the > executors to hang so be prepared for that. > > diff --git > a/core/src/main/scala/org/apache/spark/executor/StandaloneExecutorBackend.scala > b/core/src/main/scala/org/apache/spark/executor/StandaloneExecutorBackend.scala > index 4e8052a..1ec5d19 100644 > --- > a/core/src/main/scala/org/apache/spark/executor/StandaloneExecutorBackend.scala > +++ > b/core/src/main/scala/org/apache/spark/executor/StandaloneExecutorBackend.scala > @@ -74,9 +74,13 @@ private[spark] class StandaloneExecutorBackend( > executor.launchTask(this, taskDesc.taskId, > taskDesc.serializedTask) > } > > - case DisassociatedEvent(_, _, _) => > - logError("Driver terminated or disconnected! Shutting down.") > + case Terminated(actor) => > + logError("Driver terminated Shutting down.") > System.exit(1) > + > + // case DisassociatedEvent(_, _, _) => > + // logError("Driver terminated or disconnected! Shutting down.") > + // System.exit(1) > } > > override def statusUpdate(taskId: Long, state: TaskState, data: > ByteBuffer) { > diff --git > a/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala > b/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala > index b6f0ec9..9955484 100644 > --- > a/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala > +++ > b/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala > @@ -102,8 +102,8 @@ class StandaloneSchedulerBackend(scheduler: > ClusterScheduler, actorSystem: Actor > case Terminated(actor) => > actorToExecutorId.get(actor).foreach(removeExecutor(_, "Akka > actor terminated")) > > - case DisassociatedEvent(_, remoteAddress, _) => > - addressToExecutorId.get(remoteAddress).foreach(removeExecutor(_, > "remote Akka client disconnected")) > + // case DisassociatedEvent(_, remoteAddress, _) => > + // > addressToExecutorId.get(remoteAddress).foreach(removeExecutor(_, "remote > Akka client disconnected")) > > case AssociationErrorEvent(_, _, remoteAddress, _) => > addressToExecutorId.get(remoteAddress).foreach(removeExecutor(_, > "remote Akka client shutdown")) > @@ -132,7 +132,7 @@ class StandaloneSchedulerBackend(scheduler: > ClusterScheduler, actorSystem: Actor > // Remove a disconnected slave from the cluster > def removeExecutor(executorId: String, reason: String) { > if (executorActor.contains(executorId)) { > - logInfo("Executor " + executorId + " disconnected, so removing > it") > + logInfo("Executor " + executorId + " disconnected, so removing > it, reason:" + reason) > val numCores = freeCores(executorId) > actorToExecutorId -= executorActor(executorId) > addressToExecutorId -= executorAddress(executorId) > > > > On Wed, Oct 30, 2013 at 9:42 PM, Imran Rashid <[email protected]>wrote: > >> ok, so I applied a few patches >> >> https://github.com/quantifind/incubator-spark/pull/1/files >> >> and ran it again, with these options: >> >> -Dspark.akka.stdout-loglevel=DEBUG \ >> -Dspark.akkaExtra.akka.logLevel=DEBUG\ >> -Dspark.akkaExtra.akka.actor.debug.receive=on \ >> -Dspark.akkaExtra.akka.actor.debug.autoreceive=on \ >> -Dspark.akkaExtra.akka.actor.debug.lifecycle=on \ >> -Dspark.akkaExtra.akka.remote.log-sent-messages=on \ >> -Dspark.akkaExtra.akka.remote.log-received-messages=on\ >> -Dspark.akkaExtra.akka.log-config-on-start=on >> >> On the driver, I see: >> >> 2013-10-30 08:44:31,034 [spark-akka.actor.default-dispatcher-19] INFO >> akka.actor.LocalActorRef - Message >> [akka.remote.transport.AssociationHandle$Disassociated] from >> Actor[akka://spark/deadLetters] to >> Actor[akka://spark/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fspark%4010.10.5.64%3A52400-2#-837892141] >> was not delivered. [1] dead letters encountered. This logging can be turned >> off or adjusted with configuration settings 'akka.log-dead-letters' and >> 'akka.log-dead-letters-during-shutdown'. >> >> 2013-10-30 08:44:31,058 [spark-akka.actor.default-dispatcher-13] INFO >> org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend - Executor 1 >> disconnected, so removing it, reason:remote Akka client disconnected >> >> 2013-10-30 08:44:31,059 [spark-akka.actor.default-dispatcher-13] ERROR >> org.apache.spark.scheduler.cluster.ClusterScheduler - Lost executor 1 on >> dhd2.quantifind.com: remote Akka client disconnected >> >> >> on the worker, stderr: >> >> 13/10/30 08:44:28 INFO executor.Executor: Finished task ID 934 >> >> 13/10/30 08:44:31 ERROR executor.StandaloneExecutorBackend: Driver >> terminated or disconnected! Shutting down.Disassociated [akka.tcp:// >> [email protected]:38021] -> [akka.tcp:// >> [email protected]:36730] >> >> and unfortunately, all those akka debug options give me *no* useful info >> in the worker stdout: >> >> Starting akka system "sparkExecutor" using config: >> >> akka.daemonic = on >> akka.loggers = [""akka.event.slf4j.Slf4jLogger""] >> akka.stdout-loglevel = "DEBUG" >> akka.actor.provider = "akka.remote.RemoteActorRefProvider" >> akka.remote.netty.tcp.transport-class = >> "akka.remote.transport.netty.NettyTransport" >> akka.remote.netty.tcp.hostname = "dhd2.quantifind.com" >> akka.remote.netty.tcp.port = 0 >> akka.remote.netty.tcp.connection-timeout = 60 s >> akka.remote.netty.tcp.maximum-frame-size = 10MiB >> akka.remote.netty.tcp.execution-pool-size = 4 >> akka.actor.default-dispatcher.throughput = 15 >> akka.remote.log-remote-lifecycle-events = off >> akka.remote.log-sent-messages = on >> akka.remote.log-received-messages = on >> akka.logLevel = DEBUG >> akka.actor.debug.autoreceive = on >> akka.actor.debug.lifecycle = on >> akka.actor.debug.receive = on >> akka.log-config-on-start = on >> akka.remote.quarantine-systems-for = off >> [DEBUG] [10/30/2013 08:40:30.230] [main] [EventStream] StandardOutLogger >> started >> [DEBUG] [10/30/2013 08:40:30.438] >> [sparkExecutor-akka.actor.default-dispatcher-2] [akka://sparkExecutor/] >> started (akka.actor.LocalActorRefProvider$Guardian@4bf54c5f) >> [DEBUG] [10/30/2013 08:40:30.446] >> [sparkExecutor-akka.actor.default-dispatcher-3] [akka://sparkExecutor/user] >> started (akka.actor.LocalActorRefProvider$Guardian@72608760) >> [DEBUG] [10/30/2013 08:40:30.447] >> [sparkExecutor-akka.actor.default-dispatcher-4] >> [akka://sparkExecutor/system] started >> (akka.actor.LocalActorRefProvider$SystemGuardian@1f57ea4a) >> [DEBUG] [10/30/2013 08:40:30.454] >> [sparkExecutor-akka.actor.default-dispatcher-2] [akka://sparkExecutor/] now >> supervising Actor[akka://sparkExecutor/user] >> [DEBUG] [10/30/2013 08:40:30.454] >> [sparkExecutor-akka.actor.default-dispatcher-2] [akka://sparkExecutor/] now >> supervising Actor[akka://sparkExecutor/system] >> [DEBUG] [10/30/2013 08:40:30.468] >> [sparkExecutor-akka.actor.default-dispatcher-3] [akka://sparkExecutor/user] >> now monitoring Actor[akka://sparkExecutor/system] >> [DEBUG] [10/30/2013 08:40:30.468] >> [sparkExecutor-akka.actor.default-dispatcher-4] >> [akka://sparkExecutor/system] now monitoring Actor[akka://sparkExecutor/] >> [DEBUG] [10/30/2013 08:40:30.476] >> [sparkExecutor-akka.actor.default-dispatcher-3] >> [akka://sparkExecutor/system/log1-Slf4jLogger] started >> (akka.event.slf4j.Slf4jLogger@24988707) >> [DEBUG] [10/30/2013 08:40:30.477] >> [sparkExecutor-akka.actor.default-dispatcher-4] >> [akka://sparkExecutor/system] now supervising >> Actor[akka://sparkExecutor/system/log1-Slf4jLogger#719056881] >> >> (followed by similar mesages for the "spark" system) >> >> I dunno if this means much more to you, but it seems to me that for some >> reason the executor decides to disconnect from the master -- unfortunately >> we don't know why. I think my logging configuration is not getting applied >> correctly, or "log-sent-messages" & "log-received-messages" don't do what I >> think they do ... something conflicting must be turing that logging off. >> There are a zillion different remoting settings: >> http://doc.akka.io/docs/akka/snapshot/scala/remoting.html >> >> I feel like I really need to get those messages on why it disconnected to >> know which ones to play with. Any ideas for config changes to see those >> messages? >> >> thanks >> >> >> >> >> On Wed, Oct 30, 2013 at 10:09 AM, Prashant Sharma >> <[email protected]>wrote: >> >>> Can you apply this patch too and check the logs of Driver and worker. >>> >>> diff --git >>> a/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala >>> b/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala >>> index b6f0ec9..ad0ebf7 100644 >>> --- >>> a/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala >>> +++ >>> b/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala >>> @@ -132,7 +132,7 @@ class StandaloneSchedulerBackend(scheduler: >>> ClusterScheduler, actorSystem: Actor >>> // Remove a disconnected slave from the cluster >>> def removeExecutor(executorId: String, reason: String) { >>> if (executorActor.contains(executorId)) { >>> - logInfo("Executor " + executorId + " disconnected, so removing >>> it") >>> + logInfo("Executor " + executorId + " disconnected, so removing >>> it, reason:" + reason) >>> val numCores = freeCores(executorId) >>> actorToExecutorId -= executorActor(executorId) >>> addressToExecutorId -= executorAddress(executorId) >>> >>> >>> >>> >>> On Wed, Oct 30, 2013 at 8:18 PM, Imran Rashid <[email protected]>wrote: >>> >>>> I just realized something about the failing stages -- they generally >>>> occur in steps like this: >>>> >>>> rdd.mapPartitions{itr => >>>> val myCounters = initializeSomeDataStructure() >>>> itr.foreach{ >>>> //update myCounter in here >>>> ... >>>> } >>>> >>>> myCounters.iterator.map{ >>>> //some other transformation here ... >>>> } >>>> } >>>> >>>> that is, as a partition is processed, nothing gets output, we just >>>> accumulate some values. Only at the end of the partition do we output some >>>> accumulated values. >>>> >>>> These stages don't always fail, and generally they do succeed after the >>>> executor has died and a new one has started -- so I'm pretty confident its >>>> not a problem w/ the code. But maybe we need to add something like a >>>> periodic heartbeat in this kind of operation? >>>> >>>> >>>> >>>> On Wed, Oct 30, 2013 at 8:56 AM, Imran Rashid <[email protected]>wrote: >>>> >>>>> I'm gonna try turning on more akka debugging msgs as described at >>>>> http://akka.io/faq/ >>>>> and >>>>> >>>>> http://doc.akka.io/docs/akka/current/scala/testing.html#Tracing_Actor_Invocations >>>>> >>>>> unfortunately that will require a patch to spark, but hopefully that >>>>> will give us more info to go on ... >>>>> >>>>> >>>>> On Wed, Oct 30, 2013 at 8:10 AM, Prashant Sharma <[email protected] >>>>> > wrote: >>>>> >>>>>> I have things running (from scala 2.10 branch) for over 3-4 hours now >>>>>> without a problem and my jobs write data about the same as you suggested. >>>>>> My cluster size is 7 nodes and not *congested* for memory. I going to >>>>>> leave >>>>>> jobs running all night long. Meanwhile I had encourage you to try to spot >>>>>> the problem such that it is reproducible that can help a ton in fixing >>>>>> the >>>>>> issue. >>>>>> >>>>>> Thanks for testing and reporting your experience. I still feel there >>>>>> is something else wrong !. About tolerance for network connection >>>>>> timeouts, >>>>>> setting those properties should work, but I am afraid about >>>>>> Disassociation >>>>>> Event though. I will have to check this is indeed hard to reproduce bug >>>>>> if >>>>>> it is, I mean how do I simulate network delays ? >>>>>> >>>>>> >>>>>> On Wed, Oct 30, 2013 at 6:05 PM, Imran Rashid >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> This is a spark-standalone setup (not mesos), on our own cluster. >>>>>>> >>>>>>> At first I thought it must be some temporary network problem too -- >>>>>>> but the times between receiving task completion events from an executor >>>>>>> and >>>>>>> declaring it failed are really small, so I didn't think that could >>>>>>> possibly >>>>>>> be it. Plus we tried increasing various akka timeouts, but that didn't >>>>>>> help. Or maybe there are some other spark / akka properities we should >>>>>>> be >>>>>>> setting? It certainly should be resilient to such a temporary network >>>>>>> issue, if that is the problem. >>>>>>> >>>>>>> btw, I think I've noticed this happens most often during >>>>>>> ShuffleMapTasks. The tasks write out very small amounts of data (64 MB >>>>>>> total for the entire stage). >>>>>>> >>>>>>> thanks >>>>>>> >>>>>>> On Wed, Oct 30, 2013 at 6:47 AM, Prashant Sharma < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Are you using mesos ? I admit to have not properly tested things on >>>>>>>> mesos though. >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Oct 30, 2013 at 11:31 AM, Prashant Sharma < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Those log messages are new to the Akka 2.2 and are usually seen >>>>>>>>> when a node is disassociated with other by either a network failure >>>>>>>>> or even >>>>>>>>> clean shutdown. This suggests some network issue to me, are you >>>>>>>>> running on >>>>>>>>> EC2 ? It might be a temporary thing in that case. >>>>>>>>> >>>>>>>>> I had like to have more details on the long jobs though, how long >>>>>>>>> ? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Oct 30, 2013 at 1:29 AM, Imran Rashid < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> We've been testing out the 2.10 branch of spark, and we're >>>>>>>>>> running into some issues were akka disconnects from the executors >>>>>>>>>> after a >>>>>>>>>> while. We ran some simple tests first, and all was well, so we >>>>>>>>>> started >>>>>>>>>> upgrading our whole codebase to 2.10. Everything seemed to be >>>>>>>>>> working, but >>>>>>>>>> then we noticed that when we run long jobs, and then things start >>>>>>>>>> failing. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> The first suspicious thing is that we get akka warnings about >>>>>>>>>> undeliverable messages sent to deadLetters: >>>>>>>>>> >>>>>>>>>> 22013-10-29 11:03:54,577 [spark-akka.actor.default-dispatcher-17] >>>>>>>>>> INFO akka.actor.LocalActorRef - Message >>>>>>>>>> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] >>>>>>>>>> from >>>>>>>>>> Actor[akka://spark/deadLetters] to >>>>>>>>>> Actor[akka://spark/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fspark%4010.10.5.81%3A46572-3#656094700] >>>>>>>>>> was not delivered. [4] dead letters encountered. This logging can be >>>>>>>>>> turned >>>>>>>>>> off or adjusted with configuration settings 'akka.log-dead-letters' >>>>>>>>>> and >>>>>>>>>> 'akka.log-dead-letters-during-shutdown'. >>>>>>>>>> >>>>>>>>>> 2013-10-29 11:03:54,579 [spark-akka.actor.default-dispatcher-19] >>>>>>>>>> INFO akka.actor.LocalActorRef - Message >>>>>>>>>> [akka.remote.transport.AssociationHandle$Disassociated] from >>>>>>>>>> Actor[akka://spark/deadLetters] to >>>>>>>>>> Actor[akka://spark/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fspark%4010.10.5.81%3A46572-3#656094700] >>>>>>>>>> was not delivered. [5] dead letters encountered. This logging can be >>>>>>>>>> turned >>>>>>>>>> off or adjusted with configuration settings 'akka.log-dead-letters' >>>>>>>>>> and >>>>>>>>>> 'akka.log-dead-letters-during-shutdown'. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Generally within a few seconds after the first such message, >>>>>>>>>> there are a bunch more, and then the executor is marked as failed, >>>>>>>>>> and a >>>>>>>>>> new one is started: >>>>>>>>>> >>>>>>>>>> 2013-10-29 11:03:58,775 [spark-akka.actor.default-dispatcher-3] >>>>>>>>>> INFO akka.actor.LocalActorRef - Message >>>>>>>>>> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] >>>>>>>>>> from >>>>>>>>>> Actor[akka://spark/deadLetters] to >>>>>>>>>> Actor[akka://spark/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkExecutor% >>>>>>>>>> 40dhd2.quantifind.com%3A45794-6#-890135716] was not delivered. >>>>>>>>>> [10] dead letters encountered, no more dead letters will be logged. >>>>>>>>>> This >>>>>>>>>> logging can be turned off or adjusted with configuration settings >>>>>>>>>> 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. >>>>>>>>>> >>>>>>>>>> 2013-10-29 11:03:58,778 [spark-akka.actor.default-dispatcher-17] >>>>>>>>>> INFO org.apache.spark.deploy.client.Client$ClientActor - Executor >>>>>>>>>> updated: >>>>>>>>>> app-20131029110000-0000/1 is now FAILED (Command exited with code 1) >>>>>>>>>> >>>>>>>>>> 2013-10-29 11:03:58,784 [spark-akka.actor.default-dispatcher-17] >>>>>>>>>> INFO org.apache.spark.deploy.client.Client$ClientActor - Executor >>>>>>>>>> added: >>>>>>>>>> app-20131029110000-0000/2 on >>>>>>>>>> worker-20131029105824-dhd2.quantifind.com-51544 ( >>>>>>>>>> dhd2.quantifind.com:51544) with 24 cores >>>>>>>>>> >>>>>>>>>> 2013-10-29 11:03:58,784 [spark-akka.actor.default-dispatcher-18] >>>>>>>>>> ERROR akka.remote.EndpointWriter - AssociationError [akka.tcp:// >>>>>>>>>> [email protected]:43068] -> [akka.tcp:// >>>>>>>>>> [email protected]:45794]: Error [Association >>>>>>>>>> failed with [akka.tcp://[email protected]:45794]] >>>>>>>>>> [ >>>>>>>>>> akka.remote.EndpointAssociationException: Association failed with >>>>>>>>>> [akka.tcp://[email protected]:45794] >>>>>>>>>> Caused by: >>>>>>>>>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: >>>>>>>>>> Connection refused: dhd2.quantifind.com/10.10.5.64:45794] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Looking in the logs of the failed executor, there are some >>>>>>>>>> similar messages about undeliverable messages, but I don't see any >>>>>>>>>> reason: >>>>>>>>>> >>>>>>>>>> 13/10/29 11:03:52 INFO executor.Executor: Finished task ID 943 >>>>>>>>>> >>>>>>>>>> 13/10/29 11:03:53 INFO actor.LocalActorRef: Message >>>>>>>>>> [akka.actor.FSM$Timer] from Actor[akka://sparkExecutor/deadLetters] >>>>>>>>>> to >>>>>>>>>> Actor[akka://sparkExecutor/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fspark% >>>>>>>>>> 40ddd0.quantifind.com%3A43068-1#772172548] was not delivered. >>>>>>>>>> [1] dead letters encountered. This logging can be turned off or >>>>>>>>>> adjusted >>>>>>>>>> with configuration settings 'akka.log-dead-letters' and >>>>>>>>>> 'akka.log-dead-letters-during-shutdown'. >>>>>>>>>> >>>>>>>>>> 13/10/29 11:03:53 INFO actor.LocalActorRef: Message >>>>>>>>>> [akka.remote.transport.AssociationHandle$Disassociated] from >>>>>>>>>> Actor[akka://sparkExecutor/deadLetters] to >>>>>>>>>> Actor[akka://sparkExecutor/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fspark% >>>>>>>>>> 40ddd0.quantifind.com%3A43068-1#772172548] was not delivered. >>>>>>>>>> [2] dead letters encountered. This logging can be turned off or >>>>>>>>>> adjusted >>>>>>>>>> with configuration settings 'akka.log-dead-letters' and >>>>>>>>>> 'akka.log-dead-letters-during-shutdown'. >>>>>>>>>> >>>>>>>>>> 13/10/29 11:03:53 INFO actor.LocalActorRef: Message >>>>>>>>>> [akka.remote.transport.AssociationHandle$Disassociated] from >>>>>>>>>> Actor[akka://sparkExecutor/deadLetters] to >>>>>>>>>> Actor[akka://sparkExecutor/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fspark% >>>>>>>>>> 40ddd0.quantifind.com%3A43068-1#772172548] was not delivered. >>>>>>>>>> [3] dead letters encountered. This logging can be turned off or >>>>>>>>>> adjusted >>>>>>>>>> with configuration settings 'akka.log-dead-letters' and >>>>>>>>>> 'akka.log-dead-letters-during-shutdown'. >>>>>>>>>> >>>>>>>>>> 13/10/29 11:03:53 ERROR executor.StandaloneExecutorBackend: >>>>>>>>>> Driver terminated or disconnected! Shutting down. >>>>>>>>>> >>>>>>>>>> 13/10/29 11:03:53 INFO actor.LocalActorRef: Message >>>>>>>>>> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] >>>>>>>>>> from >>>>>>>>>> Actor[akka://sparkExecutor/deadLetters] to >>>>>>>>>> Actor[akka://sparkExecutor/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fspark% >>>>>>>>>> 40ddd0.quantifind.com%3A43068-1#772172548] was not delivered. >>>>>>>>>> [4] dead letters encountered. This logging can be turned off or >>>>>>>>>> adjusted >>>>>>>>>> with configuration settings 'akka.log-dead-letters' and >>>>>>>>>> 'akka.log-dead-letters-during-shutdown'. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> After this happens, spark does launch a new executor >>>>>>>>>> successfully, and continue the job. Sometimes, the job just >>>>>>>>>> continues >>>>>>>>>> happily and there aren't any other problems. However, that executor >>>>>>>>>> may >>>>>>>>>> have to run a bunch of steps to re-compute some cached RDDs -- and >>>>>>>>>> during >>>>>>>>>> that time, another executor may crash similarly, and then we end up >>>>>>>>>> in a >>>>>>>>>> never ending loop, of one executor crashing, then trying to reload >>>>>>>>>> data, >>>>>>>>>> while the others sit around. >>>>>>>>>> >>>>>>>>>> I have no idea what is triggering this behavior -- there isn't >>>>>>>>>> any particular point in the job that it regularly occurs at. >>>>>>>>>> Certain steps >>>>>>>>>> seem more prone to this, but there isn't any step which regularly >>>>>>>>>> causes >>>>>>>>>> the problem. In a long pipeline of steps, though, that loop becomes >>>>>>>>>> very >>>>>>>>>> likely. I don't think its a timeout issue -- the initial failing >>>>>>>>>> executors >>>>>>>>>> can be actively completing stages just seconds before this failure >>>>>>>>>> happens. We did try adjusting some of the spark / akka timeouts: >>>>>>>>>> >>>>>>>>>> -Dspark.storage.blockManagerHeartBeatMs=300000 >>>>>>>>>> -Dspark.akka.frameSize=150 >>>>>>>>>> -Dspark.akka.timeout=120 >>>>>>>>>> -Dspark.akka.askTimeout=30 >>>>>>>>>> -Dspark.akka.logLifecycleEvents=true >>>>>>>>>> >>>>>>>>>> but those settings didn't seem to help the problem at all. I >>>>>>>>>> figure it must be some configuration with the new version of akka >>>>>>>>>> that >>>>>>>>>> we're missing, but we haven't found anything. Any ideas? >>>>>>>>>> >>>>>>>>>> our code works fine w/ the 0.8.0 release on scala 2.9.3. The >>>>>>>>>> failures occur on the tip of the scala-2.10 branch (5429d62d) >>>>>>>>>> >>>>>>>>>> thanks, >>>>>>>>>> Imran >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> s >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> s >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> s >>>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> s >>> >> >> > > > -- > s >
