Hi Prudhvi, not really but we took a drastic approach mitigating this, modifying the bundled launch script to be more resilient. In the kubernetes/dockerfiles/spark/entrypoint.sh in the executor case we added something like that :
executor) DRIVER_HOST=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":" -f 1 ) DRIVER_PORT=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":" -f 2 ) for i in $(seq 1 20); do nc -zvw1 $DRIVER_HOST $DRIVER_PORT status=$? if [ $status -eq 0 ] then echo "Driver is accessible, let's rock'n'roll." break else echo "Driver not accessible :-| napping for a while..." sleep 3 fi done CMD=( ${JAVA_HOME}/bin/java .... That way the executor will not start before the driver is really connectable. That's kind of a hack but we did not experience the issue anymore, so I guess I'll keep it for now. Regards, Olivier. Le mar. 11 juin 2019 à 18:23, Prudhvi Chennuru (CONT) < prudhvi.chenn...@capitalone.com> a écrit : > Hey Oliver, > > I am also facing the same issue on my kubernetes > cluster(v1.11.5) on AWS with spark version 2.3.3, any luck in figuring out > the root cause? > > On Fri, May 3, 2019 at 5:37 AM Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> Hi, >> I did not try on another vendor, so I can't say if it's only related to >> gke, and no, I did not notice anything on the kubelet or kube-dns >> processes... >> >> Regards >> >> Le ven. 3 mai 2019 à 03:05, Li Gao <ligao...@gmail.com> a écrit : >> >>> hi Olivier, >>> >>> This seems a GKE specific issue? have you tried on other vendors ? Also >>> on the kubelet nodes did you notice any pressure on the DNS side? >>> >>> Li >>> >>> >>> On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot < >>> o.girar...@lateral-thoughts.com> wrote: >>> >>>> Hi everyone, >>>> I have ~300 spark job on Kubernetes (GKE) using the cluster >>>> auto-scaler, and sometimes while running these jobs a pretty bad thing >>>> happens, the driver (in cluster mode) gets scheduled on Kubernetes and >>>> launches many executor pods. >>>> So far so good, but the k8s "Service" associated to the driver does not >>>> seem to be propagated in terms of DNS resolution so all the executor fails >>>> with a "spark-application-......cluster.svc.local" does not exists. >>>> >>>> All executors failing the driver should be failing too, but it >>>> considers that it's a "pending" initial allocation and stay stuck forever >>>> in a loop of "Initial job has not accepted any resources, please check >>>> Cluster UI" >>>> >>>> Has anyone else observed this king of behaviour ? >>>> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems >>>> to exist even after the "big refactoring" in the kubernetes cluster >>>> scheduler backend. >>>> >>>> I can work on a fix / workaround but I'd like to check with you the >>>> proper way forward : >>>> >>>> - Some processes (like the airflow helm recipe) rely on a "sleep >>>> 30s" before launching the dependent pods (that could be added to >>>> /opt/entrypoint.sh used in the kubernetes packing) >>>> - We can add a simple step to the init container trying to do the >>>> DNS resolution and failing after 60s if it did not work >>>> >>>> But these steps won't change the fact that the driver will stay stuck >>>> thinking we're still in the case of the Initial allocation delay. >>>> >>>> Thoughts ? >>>> >>>> -- >>>> *Olivier Girardot* >>>> o.girar...@lateral-thoughts.com >>>> >>> > > -- > *Thanks,* > *Prudhvi Chennuru.* > > ------------------------------ > > The information contained in this e-mail is confidential and/or > proprietary to Capital One and/or its affiliates and may only be used > solely in performance of work or services for Capital One. The information > transmitted herewith is intended only for use by the individual or entity > to which it is addressed. If the reader of this message is not the intended > recipient, you are hereby notified that any review, retransmission, > dissemination, distribution, copying or other use of, or taking of any > action in reliance upon this information is strictly prohibited. If you > have received this communication in error, please contact the sender and > delete the material from your computer. > -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94