Hi everyone,
I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler,
and sometimes while running these jobs a pretty bad thing happens, the
driver (in cluster mode) gets scheduled on Kubernetes and launches many
executor pods.
So far so good, but the k8s "Service" associated to the driver does not
seem to be propagated in terms of DNS resolution so all the executor fails
with a "spark-application-......cluster.svc.local" does not exists.

All executors failing the driver should be failing too, but it considers
that it's a "pending" initial allocation and stay stuck forever in a loop
of "Initial job has not accepted any resources, please check Cluster UI"

Has anyone else observed this king of behaviour ?
We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to
exist even after the "big refactoring" in the kubernetes cluster scheduler
backend.

I can work on a fix / workaround but I'd like to check with you the proper
way forward :

   - Some processes (like the airflow helm recipe) rely on a "sleep 30s"
   before launching the dependent pods (that could be added to
   /opt/entrypoint.sh used in the kubernetes packing)
   - We can add a simple step to the init container trying to do the DNS
   resolution and failing after 60s if it did not work

But these steps won't change the fact that the driver will stay stuck
thinking we're still in the case of the Initial allocation delay.

Thoughts ?

-- 
*Olivier Girardot*
o.girar...@lateral-thoughts.com

Reply via email to