Igor Kamyshnikov created SPARK-27967: ----------------------------------------
Summary: Fault tolerance broken: Race conditions: a supervised Driver is not relaunched and completely removed sometimes under Standalone cluster when Worker gracefully shuts down Key: SPARK-27967 URL: https://issues.apache.org/jira/browse/SPARK-27967 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.2, 2.1.2 Reporter: Igor Kamyshnikov Synthetic test: 1) run ZK 2) run Master 3) run Worker with remote debugging agent (required for enabling a breakpoint to demonstrate race conditions issue) 4) submit a long running Driver with --supervise flag 5) connect to Worker via remote debugger 6) enable a breakpoint in the method: org.apache.spark.deploy.worker.DriverRunner#kill {code:java} /** Terminate this driver (or prevent it from ever starting if not yet started) */ private[worker] def kill(): Unit = { logInfo("Killing driver process!") killed = true synchronized { process.foreach { p => val exitCode = Utils.terminateProcess(p, DRIVER_TERMINATE_TIMEOUT_MS) if (exitCode.isEmpty) { //<<<<<<<<<<<<< BREAKPOINT <<<<<<<<<<<<<< logWarning("Failed to terminate driver process: " + p + ". This process will likely be orphaned.") } } } } {code} 7) send SIGTERM to Worker (or CTRL+C in Windows) 8) check Spark Master Web UI: the Driver will appear in the *Completed Drivers* section with the state equal to *KILLED* If there was no breakpoint then it is more likely that a new row with *RELAUNCHING* state would appear in the *Completed Drivers* section and a row with *SUBMITTED* state would remain in the *Running Drivers* section. Explanation: 1) Spark master relaunches a driver in response to "channelInactive" callback: org.apache.spark.rpc.netty.NettyRpcHandler#channelInactive which is triggered when the Worker process finishes. 2) DriverRunner registers a shutdown hook here: org.apache.spark.deploy.worker.DriverRunner#start which calls the aforementioned "kill" method. Killing a driver can lead to reaching the following lines in the DriverRunner.start method: {noformat} // notify worker of final driver state, possible exception worker.send(DriverStateChanged(driverId, finalState.get, finalException)) {noformat} If this notification reaches Master then Driver is removed from the cluster as KILLED. Real-world scenario (ver. 2.1.2): ZK, two Masters, the Active one loses its leadership, another becomes a new leader. Workers attempt to re-register with new master. But the report they failed to do this. They execute *System.exit(1)* from org.apache.spark.deploy.worker.Worker#registerWithMaster. This System.exit results in executing shutdown hooks. And somehow DriverStateChanged message reaches the new master. *Worker* logs: {noformat} 19/06/03 14:05:30 INFO Worker: Retrying connection to master (attempt # 5) 19/06/03 14:05:30 INFO Worker: Connecting to master 10.0.0.16:7077... 19/06/03 14:05:33 INFO Worker: Master has changed, new master is at spark://10.0.0.17:7077 19/06/03 14:05:33 ERROR TransportResponseHandler: Still have 4 requests outstanding when connection from /10.0.0.16:7077 is closed 19/06/03 14:05:33 ERROR Worker: Cannot register with master: 10.0.0.16:7077 java.io.IOException: Connection from /10.0.0.16:7077 closed at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:128) at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:278) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227) at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893) at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) 19/06/03 14:05:33 INFO DriverRunner: Worker shutting down, killing driver driver-20190508103758-0002 19/06/03 14:05:33 INFO DriverRunner: Killing driver process! ... {noformat} *Master* logs: {noformat} 19/06/03 14:05:33 INFO Master: I have been elected leader! New state: RECOVERING 19/06/03 14:05:33 INFO Master: Trying to recover app: app-20190524081659-0002 19/06/03 14:05:33 INFO Master: Trying to recover app: app-20190524081656-0001 19/06/03 14:05:33 INFO Master: Trying to recover app: app-20190522170759-0002 19/06/03 14:05:33 INFO Master: Trying to recover worker: worker-20190522164806-10.0.0.16-34107 19/06/03 14:05:33 INFO Master: Trying to recover worker: worker-20190522164804-10.0.0.17-35662 19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.16:37487 after 10 ms (0 ms spent in bootstraps) 19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.17:34591 after 10 ms (0 ms spent in bootstraps) 19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.17:45169 after 11 ms (0 ms spent in bootstraps) 19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.16:34107 after 2 ms (0 ms spent in bootstraps) 19/06/03 14:05:33 INFO TransportClientFactory: Successfully created connection to /10.4.148.17:35662 after 4 ms (0 ms spent in bootstraps) 19/06/03 14:05:33 INFO Master: Application has been re-registered: app-20190524081656-0001 19/06/03 14:05:33 INFO Master: Application has been re-registered: app-20190524081659-0002 19/06/03 14:05:33 INFO Master: Application has been re-registered: app-20190522170759-0002 19/06/03 14:05:33 INFO Master: Worker has been re-registered: worker-20190522164806-10.0.0.16-34107 19/06/03 14:05:33 INFO Master: Worker has been re-registered: worker-20190522164804-10.0.0.17-35662 19/06/03 14:05:33 INFO Master: Recovery complete - resuming operations! 19/06/03 14:05:34 INFO Master: Received unregister request from application app-20190524081659-0002 19/06/03 14:05:34 INFO Master: Removing app app-20190524081659-0002 19/06/03 14:05:35 INFO Master: 10.0.0.16:37487 got disassociated, removing it. 19/06/03 14:05:35 INFO Master: 10.0.0.16:37487 got disassociated, removing it. 19/06/03 14:05:38 INFO Master: Removing driver: driver-20190508103758-0002 19/06/03 14:05:39 WARN Master: Got status update for unknown executor app-20190524081659-0002/0 19/06/03 14:05:39 WARN Master: Got status update for unknown executor app-20190524081659-0002/1 19/06/03 14:05:43 INFO Master: 10.0.0.17:34591 got disassociated, removing it. 19/06/03 14:05:43 INFO Master: Removing app app-20190522170759-0002 19/06/03 14:05:43 INFO Master: 10.0.0.17:34591 got disassociated, removing it. 19/06/03 14:05:45 INFO Master: Removing driver: driver-20190522170754-0002 {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org