Job stuck in CREATED state with scheduling failures

2023-01-21 Thread Gyula Fóra
Hi Devs! We noticed a very strange failure scenario a few times recently with the Native Kubernetes integration. The issue is triggered by a heartbeat timeout (a temporary network problem). We observe the following behaviour: === 3 pods (1 JM, 2 TMs), Flink 1.15 (

Re: Kubernetes JobManager and TaskManager minimum/maximum resources

2023-01-21 Thread Gyula Fóra
But of course the actual memory requirement will largely depend on the type of job, statebackend , number of task slots etc Production TM/JMs usually have much more resources allocated than 2gb/1cpu as you never want to run out of it :) Gyula On Sat, 21 Jan 2023 at 11:17, Gyula Fóra wrote: > H

Re: Kubernetes JobManager and TaskManager minimum/maximum resources

2023-01-21 Thread Gyula Fóra
Hi! I think the examples allocate too many resources by default and we should reduce it in the yamls. 1gb memory and 0.5 cpu should be more than enough , we could probably get away with even less for example purposes. Would you have time trying this out and maybe contributing this improvement? :

Re: DuplicateJobSubmissionException on restart after taskmanagers crash

2023-01-21 Thread Gyula Fóra
Hi Javier, I will try to look into this as I have not personally seen this problem while using the operator . It would be great if you could reach out to me on slack or email directly so we can discuss the issue and get to the bottom of it. Cheer Gyula On Fri, 20 Jan 2023 at 23:53, Javier Vegas