Hi Prudhvi,
not really but we took a drastic approach mitigating this, modifying the
bundled launch script to be more resilient.
In the kubernetes/dockerfiles/spark/entrypoint.sh in the executor case we
added something like that :

  executor)

    DRIVER_HOST=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":" -f 1
)

    DRIVER_PORT=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":" -f 2
)


    for i in $(seq 1 20);

    do

      nc -zvw1 $DRIVER_HOST $DRIVER_PORT

      status=$?

      if [ $status -eq 0 ]

      then

        echo "Driver is accessible, let's rock'n'roll."

        break

      else

        echo "Driver not accessible :-| napping for a while..."

        sleep 3

      fi

    done

    CMD=(

      ${JAVA_HOME}/bin/java

    ....


That way the executor will not start before the driver is really
connectable.
That's kind of a hack but we did not experience the issue anymore, so I
guess I'll keep it for now.

Regards,

Olivier.

Le mar. 11 juin 2019 à 18:23, Prudhvi Chennuru (CONT) <
prudhvi.chenn...@capitalone.com> a écrit :

> Hey Oliver,
>
>                      I am also facing the same issue on my kubernetes
> cluster(v1.11.5)  on AWS with spark version 2.3.3, any luck in figuring out
> the root cause?
>
> On Fri, May 3, 2019 at 5:37 AM Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> Hi,
>> I did not try on another vendor, so I can't say if it's only related to
>> gke, and no, I did not notice anything on the kubelet or kube-dns
>> processes...
>>
>> Regards
>>
>> Le ven. 3 mai 2019 à 03:05, Li Gao <ligao...@gmail.com> a écrit :
>>
>>> hi Olivier,
>>>
>>> This seems a GKE specific issue? have you tried on other vendors ? Also
>>> on the kubelet nodes did you notice any pressure on the DNS side?
>>>
>>> Li
>>>
>>>
>>> On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot <
>>> o.girar...@lateral-thoughts.com> wrote:
>>>
>>>> Hi everyone,
>>>> I have ~300 spark job on Kubernetes (GKE) using the cluster
>>>> auto-scaler, and sometimes while running these jobs a pretty bad thing
>>>> happens, the driver (in cluster mode) gets scheduled on Kubernetes and
>>>> launches many executor pods.
>>>> So far so good, but the k8s "Service" associated to the driver does not
>>>> seem to be propagated in terms of DNS resolution so all the executor fails
>>>> with a "spark-application-......cluster.svc.local" does not exists.
>>>>
>>>> All executors failing the driver should be failing too, but it
>>>> considers that it's a "pending" initial allocation and stay stuck forever
>>>> in a loop of "Initial job has not accepted any resources, please check
>>>> Cluster UI"
>>>>
>>>> Has anyone else observed this king of behaviour ?
>>>> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems
>>>> to exist even after the "big refactoring" in the kubernetes cluster
>>>> scheduler backend.
>>>>
>>>> I can work on a fix / workaround but I'd like to check with you the
>>>> proper way forward :
>>>>
>>>>    - Some processes (like the airflow helm recipe) rely on a "sleep
>>>>    30s" before launching the dependent pods (that could be added to
>>>>    /opt/entrypoint.sh used in the kubernetes packing)
>>>>    - We can add a simple step to the init container trying to do the
>>>>    DNS resolution and failing after 60s if it did not work
>>>>
>>>> But these steps won't change the fact that the driver will stay stuck
>>>> thinking we're still in the case of the Initial allocation delay.
>>>>
>>>> Thoughts ?
>>>>
>>>> --
>>>> *Olivier Girardot*
>>>> o.girar...@lateral-thoughts.com
>>>>
>>>
>
> --
> *Thanks,*
> *Prudhvi Chennuru.*
>
> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>


-- 
*Olivier Girardot* | Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94

Reply via email to