Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Eleanore Jin Tue, 04 Aug 2020 08:55:30 -0700

Hi Yang & Till,

Thanks for your prompt reply!


Yang, regarding your question, I am actually not using k8s job, as I put my
app.jar and its dependencies under flink's lib directory. I have 1 k8s
deployment for job manager, and 1 k8s deployment for task manager, and 1
k8s service for job manager.

As you mentioned above, if flink job is marked as failed, it will cause the
job manager pod to be restarted. Which is not the ideal behavior.

Do you suggest that I should change the deployment strategy from using k8s
deployment to k8s job? In case the flink program exit with non-zero code
(e.g. exhausted number of configured restart), pod can be marked as
complete hence not restarting the job again?

Thanks a lot!
Eleanore

On Tue, Aug 4, 2020 at 2:49 AM Yang Wang <danrtsey...@gmail.com> wrote:

> @Till Rohrmann <trohrm...@apache.org> In native mode, when a Flink
> application terminates with FAILED state, all the resources will be cleaned
> up.
>
> However, in standalone mode, I agree with you that we need to rethink the
> exit code of Flink. When a job exhausts the restart
> strategy, we should terminate the pod and do not restart again. After
> googling, it seems that we could not specify the restartPolicy
> based on exit code[1]. So maybe we need to return a zero exit code to
> avoid restarting by K8s.
>
> [1].
> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code
>
> Best,
> Yang
>
> Till Rohrmann <trohrm...@apache.org> 于2020年8月4日周二 下午3:48写道：
>
>> @Yang Wang <danrtsey...@gmail.com> I believe that we should rethink the
>> exit codes of Flink. In general you want K8s to restart a failed Flink
>> process. Hence, an application which terminates in state FAILED should not
>> return a non-zero exit code because it is a valid termination state.
>>
>> Cheers,
>> Till
>>
>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <danrtsey...@gmail.com> wrote:
>>
>>> Hi Eleanore,
>>>
>>> I think you are using K8s resource "Job" to deploy the jobmanager.
>>> Please set .spec.template.spec.restartPolicy = "Never" and
>>> spec.backoffLimit = 0.
>>> Refer here[1] for more information.
>>>
>>> Then, when the jobmanager failed because of any reason, the K8s job will
>>> be marked failed. And K8s will not restart the job again.
>>>
>>> [1].
>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
>>>
>>>
>>> Best,
>>> Yang
>>>
>>> Eleanore Jin <eleanore....@gmail.com> 于2020年8月4日周二 上午12:05写道：
>>>
>>>> Hi Till,
>>>>
>>>> Thanks for the reply!
>>>>
>>>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2.
>>>> Specifically, I build a custom docker image, which I copied the app jar
>>>> (not uber jar) and all its dependencies under /flink/lib.
>>>>
>>>> So my question is more like, in this case, if the job is marked as
>>>> FAILED, which causes k8s to restart the pod, this seems not help at all,
>>>> what are the suggestions for such scenario?
>>>>
>>>> Thanks a lot!
>>>> Eleanore
>>>>
>>>> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes
>>>>
>>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <trohrm...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Eleanore,
>>>>>
>>>>> how are you deploying Flink exactly? Are you using the application
>>>>> mode with native K8s support to deploy a cluster [1] or are you manually
>>>>> deploying a per-job mode [2]?
>>>>>
>>>>> I believe the problem might be that we terminate the Flink process
>>>>> with a non-zero exit code if the job reaches the ApplicationStatus.FAILED
>>>>> [3].
>>>>>
>>>>> cc Yang Wang have you observed a similar behavior when running Flink
>>>>> in per-job mode on K8s?
>>>>>
>>>>> [1]
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application
>>>>> [2]
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions
>>>>> [3]
>>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32
>>>>>
>>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <eleanore....@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Experts,
>>>>>>
>>>>>> I have a flink cluster (per job mode) running on kubernetes. The job
>>>>>> is configured with restart strategy
>>>>>>
>>>>>> restart-strategy.fixed-delay.attempts: 
>>>>>> 3restart-strategy.fixed-delay.delay: 10 s
>>>>>>
>>>>>>
>>>>>> So after 3 times retry, the job will be marked as FAILED, hence the
>>>>>> pods are not running. However, kubernetes will then restart the job again
>>>>>> as the available replicas do not match the desired one.
>>>>>>
>>>>>> I wonder what are the suggestions for such a scenario? How should I
>>>>>> configure the flink job running on k8s?
>>>>>>
>>>>>> Thanks a lot!
>>>>>> Eleanore
>>>>>>
>>>>>

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

回复