Re: After job cancel, leftover ZK state prevents job manager startup

2018-12-12 Thread Micah Wylde
We have a single Jobmanager in an HA setup. From looking at logs and metrics, it appears that before the issue occurred there was a long (15s) GC pause on the jobmanager, which then caused a leadership election. Because there is only one jobmanager, the same one became leader again after it

Re: After job cancel, leftover ZK state prevents job manager startup

2018-12-11 Thread Till Rohrmann
Hi Micah, the problem looks indeed similar to FLINK-10255. Could you tell me your exact cluster setup (HA with stand by JobManagers?). Moreover, the logs of all JobManagers on DEBUG level would be helpful for further debugging. Cheers, Till On Tue, Dec 11, 2018 at 10:09 AM Stefan Richter

Re: After job cancel, leftover ZK state prevents job manager startup

2018-12-11 Thread Stefan Richter
Hi, Thanks for reporting the problem, I think the exception trace looks indeed very similar to traces in the discussion for FLINK-10184. I will pull in Till who worked on the fix to hear his opinion. Maybe the current fix only made the problem less likely to appear but is not complete, yet?