Properly stop applications or jobs within the application

Behroz Sikander Mon, 05 Mar 2018 04:17:29 -0800

Hello,
We are using spark-jobserver to spawn jobs in Spark cluster. We have
recently faced issues with Zombie jobs in Spark cluster. This normally
happens when the job is accessing some external resources like Kafka/C* and
something goes wrong while consuming them. For example, if suddenly a topic
which was being consumed is deleted in Kafka or connection breaks to the
whole Kafka cluster.


Within spark-jobserver, we have the option to delete the context/jobs in
such scenarios.
When we delete the job
<https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server/src/main/scala/spark/jobserver/JobManagerActor.scala#L228>,
internally context.cancelJobGroup(<jobId>) is used.
When we delete the context
<https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server/src/main/scala/spark/jobserver/JobManagerActor.scala#L148>,
internally context.stop(true,true) is executed.

In both cases, even if we delete the job/context, the application on the
Spark cluster is still running (sometimes) and some jobs are still being
executed within Spark.

Here are the logs of one such scenario. The job context was stopped but it
kept on running and became a zombie.

2018-02-28 15:36:50,931 INFO ForkJoinPool-3-worker-13
org.apache.kafka.common.utils.AppInfoParser []: Kafka version :
0.11.0.1-SNAPSHOT
2018-02-28 15:36:50,931 INFO ForkJoinPool-3-worker-13
org.apache.kafka.common.utils.AppInfoParser []: Kafka commitId :
de8225b66d494cd
2018-02-28 15:36:51,144 INFO dispatcher-event-loop-5
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint
[]: Registered executor NettyRpcEndpointRef(null) (10.10.10.15:46224)
with ID 1
2018-02-28 15:38:58,254 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -3 could
not be established. Broker may not be available.
2018-02-28 15:41:05,485 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -2 could
not be established. Broker may not be available.
2018-02-28 15:42:07,074 WARN JobServer-akka.actor.default-dispatcher-3
akka.cluster.ClusterCoreDaemon []: Cluster Node
[akka.tcp://JobServer@127.0.0.1:43319] - Marking node(s) as
UNREACHABLE [Member(address = akka.tcp://JobServer@127.0.0.1:37343,
status = Up)]. Node roles [manager]


Later at some point, we see the following logs. It seems that from Spark
job, none of the Kafka nodes were accessible. The job kept on trying and
became a zombie.

2018-02-28 15:43:12,717 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -3 could
not be established. Broker may not be available.
2018-02-28 15:45:19,949 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -1 could
not be established. Broker may not be available.
2018-02-28 15:47:27,180 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -2 could
not be established. Broker may not be available.
2018-02-28 15:49:34,412 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -3 could
not be established. Broker may not be available.
2018-02-28 15:51:41,644 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -1 could
not be established. Broker may not be available.
2018-02-28 15:53:48,877 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -2 could
not be established. Broker may not be available.
2018-02-28 15:55:56,109 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -1 could
not be established. Broker may not be available.
2018-02-28 15:58:03,340 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -2 could
not be established. Broker may not be available.
2018-02-28 16:00:10,572 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -3 could
not be established. Broker may not be available.
2018-02-28 16:02:17,804 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -1 could
not be established. Broker may not be available.




Similarly to this, we have another scenario for zombie contexts. The logs
are in the gist below.
https://gist.github.com/bsikander/697d85e2352a650437a922752328a90f

In the gist, you can see that the topic is not created and the job tried to
use it. Then when we try to delete the job but it bacame a zombie and kept
on showing.
"Block rdd_13011_0 already exists on this machine; not re-adding it"


So, my question would be, what is the right way to kill the jobs running
within
the context or the context/application itself without having these zombies?


Regards,
Behroz

Properly stop applications or jobs within the application

Reply via email to