unfortunately that change wasn't the silver bullet I was hoping for. Even with 1) ignoring DisassociatedEvent 2) executor uses ReliableProxy to send messages back to driver 3) turn up akka.remote.watch-failure-detector.threshold=12
there is a lot of weird behavior. First, there are a few DisassociatedEvents, but some that are followed by AssociatedEvents, so that seems ok. But sometimes the re-associations are immediately followed by this: 13/10/31 18:51:10 INFO executor.StandaloneExecutorBackend: got lifecycleevent: AssociationError [akka.tcp://sparkExecutor@<executor>:41441] -> [akka.tcp://spark@<driver>:41321]: Error [Invalid address: akka.tcp://spark@<driver>:41321] [ akka.remote.InvalidAssociation: Invalid address: akka.tcp://spark@ <driver>:41321 Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted. ] On the driver, there are messages like: [INFO] [10/31/2013 18:51:07.838] [spark-akka.actor.default-dispatcher-3] [Remoting] Address [akka.tcp://sparkExecutor@<executor>:46123] is now quarantined, all messages to this address will be delivered to dead letters. [WARN] [10/31/2013 18:51:10.845] [spark-akka.actor.default-dispatcher-20] [akka://spark/system/remote-watcher] Detected unreachable: [akka.tcp://sparkExecutor@<executor>:41441] and when the driver does decide that the executor has been terminated, it removes the executor, but doesn't start another one. there are a ton of messages also about messages to the block manager master ... I'm wondering if there are other parts of the system that need to use a reliable proxy (or some sort of acknowledgement). I really don't think this was working properly even w/ previous versions of spark / akka. I'm still learning about akka, but I think you always need an ack to be confident w/ remote communicate. Perhaps the old version of akka just had more robust defaults or something, but I bet it could still have the same problems. Even before, I have seen the driver thinking there were running tasks, but nothing happening on any executor -- it was just rare enough (and hard to reproduce) that I never bothered looking into it more. I will keep digging ... On Thu, Oct 31, 2013 at 4:36 PM, Matei Zaharia <[email protected]>wrote: > BTW the problem might be the Akka failure detector settings that seem new > in 2.2: http://doc.akka.io/docs/akka/2.2.3/scala/remoting.html > > Their timeouts seem pretty aggressive by default — around 10 seconds. This > can easily be too little if you have large garbage collections. We should > make sure they are higher than our own node failure detection timeouts. > > Matei > > On Oct 31, 2013, at 1:33 PM, Imran Rashid <[email protected]> wrote: > > pretty sure I found the problem -- two problems actually. And I think one > of them has been a general lurking problem w/ spark for a while. > > 1) we should ignore disassociation events, as you suggested earlier. > They seem to just indicate a temporary problem, and can generally be > ignored. I've found that they're regularly followed by AssociatedEvents, > and it seems communication really works fine at that point. > > 2) Task finished messages get lost. When this message gets sent, we dont' > know it actually gets there: > > > https://github.com/apache/incubator-spark/blob/scala-2.10/core/src/main/scala/org/apache/spark/executor/StandaloneExecutorBackend.scala#L90 > > (this is so incredible, I feel I must be overlooking something -- but > there is no ack somewhere else that I'm overlooking, is there??) So, after > the patch, spark wasn't hanging b/c of the unhandled DisassociatedEvent. > It hangs b/c the executor has sent some taskFinished messages that never > get received by the driver. So the driver is waiting for some tasks to > finish, but the executors think they are all done. > > I'm gonna add the reliable proxy pattern for this particular interaction > and see if its fixes the problem > > http://doc.akka.io/docs/akka/2.2.3/contrib/reliable-proxy.html#introducing-the-reliable-proxy > > imran > >
