I see. Thanks for sharing the logs. It's related to a FLINK-9097 [1]. In
order for the job to not be cleaned up entirely after a failure while
submitting the job, the JobManager is failed fatally resulting in a
failover. That's what you're experiencing.

One solution is to fix the permission issue to make the job recover without
problems. If that's not what you want to do, you could delete the entry
with the key 'jobGraph-04ae99777ee2ed34c13fe8120e68436e' from the
JobGraphStore ConfigMap (based on your logs it should
be flink-972ac3d8028e45fcafa9b8b7b7f1dafb-custer-config-map). This will
prevent the JobManager from recovering this specific job. Keep in mind that
you have to clean up any job-related data by yourself in that case.

I hope that helps.
Matthias

[1] https://issues.apache.org/jira/browse/FLINK-9097

On Mon, Sep 26, 2022 at 12:26 PM ramkrishna vasudevan <
ramvasu.fl...@gmail.com> wrote:

> I got some logs and stack traces from our backend storage. This is not the
> entire log though. Can this be useful?  With these set of logs messages the
> job manager kept restarting.
>
> Regards
> Ram
>
> On Mon, Sep 26, 2022 at 3:11 PM ramkrishna vasudevan <
> ramvasu.fl...@gmail.com> wrote:
>
>> Thank you very much for the reply. I have lost the k8s cluster in this
>> case before I could capture the logs. I will try to repro this and get back
>> to you.
>>
>> Regards
>> Ram
>>
>> On Mon, Sep 26, 2022 at 12:42 PM Matthias Pohl <matthias.p...@aiven.io>
>> wrote:
>>
>>> Hi Ramkrishna,
>>> thanks for reaching out to the Flink community. Could you share the
>>> JobManager logs to get a better understanding of what's going on? I'm
>>> wondering why the JobManager is failing when the actual problem is that the
>>> job is struggling to access a folder. It sounds like there are multiple
>>> problems here.
>>>
>>> Best,
>>> Matthias
>>>
>>> On Mon, Sep 26, 2022 at 6:25 AM ramkrishna vasudevan <
>>> ramvasu.fl...@gmail.com> wrote:
>>>
>>>> Hi all
>>>>
>>>> I have a simple job where we read for a given path in cloud storage to
>>>> watch for new files in a given fodler. While I setup my job there was some
>>>> permission issue on the folder. The job is STREAMING job.
>>>> The cluster is set in the session mode and is running on Kubernetes.
>>>> The job manager since then is failing to come back up and every time it
>>>> fails with the permission issue. But the point is how should i recover my
>>>> cluster in this case. Since JM is not there the UI is also not working and
>>>> how do I remove the bad job from the JM.
>>>>
>>>> Regards
>>>> Ram
>>>>
>>>

Reply via email to