Hello, We are using spark-jobserver to spawn jobs in Spark cluster. We have recently faced issues with Zombie jobs in Spark cluster. This normally happens when the job is accessing some external resources like Kafka/C* and something goes wrong while consuming them. For example, if suddenly a topic which was being consumed is deleted in Kafka or connection breaks to the whole Kafka cluster.
Within spark-jobserver, we have the option to delete the context/jobs in such scenarios. When we delete the job <https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server/src/main/scala/spark/jobserver/JobManagerActor.scala#L228>, internally context.cancelJobGroup(<jobId>) is used. When we delete the context <https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server/src/main/scala/spark/jobserver/JobManagerActor.scala#L148>, internally context.stop(true,true) is executed. In both cases, even if we delete the job/context, the application on the Spark cluster is still running (sometimes) and some jobs are still being executed within Spark. Here are the logs of one such scenario. The job context was stopped but it kept on running and became a zombie. 2018-02-28 15:36:50,931 INFO ForkJoinPool-3-worker-13 org.apache.kafka.common.utils.AppInfoParser []: Kafka version : 0.11.0.1-SNAPSHOT 2018-02-28 15:36:50,931 INFO ForkJoinPool-3-worker-13 org.apache.kafka.common.utils.AppInfoParser []: Kafka commitId : de8225b66d494cd 2018-02-28 15:36:51,144 INFO dispatcher-event-loop-5 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint []: Registered executor NettyRpcEndpointRef(null) (10.10.10.15:46224) with ID 1 2018-02-28 15:38:58,254 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -3 could not be established. Broker may not be available. 2018-02-28 15:41:05,485 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -2 could not be established. Broker may not be available. 2018-02-28 15:42:07,074 WARN JobServer-akka.actor.default-dispatcher-3 akka.cluster.ClusterCoreDaemon []: Cluster Node [akka.tcp://JobServer@127.0.0.1:43319] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://JobServer@127.0.0.1:37343, status = Up)]. Node roles [manager] Later at some point, we see the following logs. It seems that from Spark job, none of the Kafka nodes were accessible. The job kept on trying and became a zombie. 2018-02-28 15:43:12,717 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -3 could not be established. Broker may not be available. 2018-02-28 15:45:19,949 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -1 could not be established. Broker may not be available. 2018-02-28 15:47:27,180 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -2 could not be established. Broker may not be available. 2018-02-28 15:49:34,412 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -3 could not be established. Broker may not be available. 2018-02-28 15:51:41,644 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -1 could not be established. Broker may not be available. 2018-02-28 15:53:48,877 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -2 could not be established. Broker may not be available. 2018-02-28 15:55:56,109 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -1 could not be established. Broker may not be available. 2018-02-28 15:58:03,340 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -2 could not be established. Broker may not be available. 2018-02-28 16:00:10,572 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -3 could not be established. Broker may not be available. 2018-02-28 16:02:17,804 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -1 could not be established. Broker may not be available. Similarly to this, we have another scenario for zombie contexts. The logs are in the gist below. https://gist.github.com/bsikander/697d85e2352a650437a922752328a90f In the gist, you can see that the topic is not created and the job tried to use it. Then when we try to delete the job but it bacame a zombie and kept on showing. "Block rdd_13011_0 already exists on this machine; not re-adding it" So, my question would be, what is the right way to kill the jobs running within the context or the context/application itself without having these zombies? Regards, Behroz