The root cause might be you APIServer is overloaded or not running
normally. And then all the pods events of
taskmanager-1-9 and taskmanager-1-10 are not delivered to the watch in
FlinkResourceManager.
So the two taskmanagers are not recognized by ResourceManager and then
registration are rejected.

The ResourceManager also did not receive the terminated pod events. That's
why it does not allocate new TaskManager pods.

All in all, I believe you need to check the K8s APIServer status.

Best,
Yang

Zheng, Chenyu <chenyu.zh...@disneystreaming.com> 于2022年4月22日周五 12:54写道:

> Hi developers!
>
>
>
> I got a strange bug during failure recovery of Flink. It seems the
> JobManager doesn't bring up new TaskManager during failure recovery. Some
> logs and information of the Flink job are pasted below. Can you take a look
> and give me some guidance? Thank you so much!
>
>
>
> Flink version: 1.13.2
>
> Deploy mode: K8s native
>
> Timeline of the bug:
>
>    1. Flink job start to work with 8 taskmanagers.
>    2. At *2022-04-17 00:28:15,286*, this job got an error and JobManager
>    decided to restart 2 tasks (pod
>    stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-1,
>    stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-7)
>    3. The two old pod is stopped and JobManager created 2 pod (pod
>    stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9,
>    stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10) at *2022-04-17
>    00:33:15,376*
>    4. JobManager discard two new pods’ registration at *2022-04-17
>    00:33:32,393*
>    5. These new pods exited at *2022-04-17 00:33:32,396*, due to the
>    rejection of registration.
>    6. JobManager didn’t bring up new pods and print error “Slot request
>    bulk is not fulfillable! Could not allocate the required slot within slot
>    request timeout” over and over
>
> Flink logs:
>
> 1.      JobManager:
> https://drive.google.com/file/d/1HuRQUFQrq9JIfrOzH9qBPCK1hMsyqFpJ/view?usp=sharing
>
> 2.      TaskManager:
> https://drive.google.com/file/d/1ReWR27VlXCkGCFN62__j0UpQlXV7Ensn/view?usp=sharing
>
>
>
>
>
> BRs,
>
> Chenyu
>

Reply via email to