Hey Oliver, I am also facing the same issue on my kubernetes cluster(v1.11.5) on AWS with spark version 2.3.3, any luck in figuring out the root cause?
On Fri, May 3, 2019 at 5:37 AM Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi, > I did not try on another vendor, so I can't say if it's only related to > gke, and no, I did not notice anything on the kubelet or kube-dns > processes... > > Regards > > Le ven. 3 mai 2019 à 03:05, Li Gao <ligao...@gmail.com> a écrit : > >> hi Olivier, >> >> This seems a GKE specific issue? have you tried on other vendors ? Also >> on the kubelet nodes did you notice any pressure on the DNS side? >> >> Li >> >> >> On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot < >> o.girar...@lateral-thoughts.com> wrote: >> >>> Hi everyone, >>> I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler, >>> and sometimes while running these jobs a pretty bad thing happens, the >>> driver (in cluster mode) gets scheduled on Kubernetes and launches many >>> executor pods. >>> So far so good, but the k8s "Service" associated to the driver does not >>> seem to be propagated in terms of DNS resolution so all the executor fails >>> with a "spark-application-......cluster.svc.local" does not exists. >>> >>> All executors failing the driver should be failing too, but it considers >>> that it's a "pending" initial allocation and stay stuck forever in a loop >>> of "Initial job has not accepted any resources, please check Cluster UI" >>> >>> Has anyone else observed this king of behaviour ? >>> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to >>> exist even after the "big refactoring" in the kubernetes cluster scheduler >>> backend. >>> >>> I can work on a fix / workaround but I'd like to check with you the >>> proper way forward : >>> >>> - Some processes (like the airflow helm recipe) rely on a "sleep >>> 30s" before launching the dependent pods (that could be added to >>> /opt/entrypoint.sh used in the kubernetes packing) >>> - We can add a simple step to the init container trying to do the >>> DNS resolution and failing after 60s if it did not work >>> >>> But these steps won't change the fact that the driver will stay stuck >>> thinking we're still in the case of the Initial allocation delay. >>> >>> Thoughts ? >>> >>> -- >>> *Olivier Girardot* >>> o.girar...@lateral-thoughts.com >>> >> -- *Thanks,* *Prudhvi Chennuru.* ________________________________________________________ The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.