We have a single Jobmanager in an HA setup. From looking at logs and
metrics, it appears that before the issue occurred there was a long (15s)
GC pause on the jobmanager, which then caused a leadership election.
Because there is only one jobmanager, the same one became leader again
after it
Hi Micah,
the problem looks indeed similar to FLINK-10255. Could you tell me your
exact cluster setup (HA with stand by JobManagers?). Moreover, the logs of
all JobManagers on DEBUG level would be helpful for further debugging.
Cheers,
Till
On Tue, Dec 11, 2018 at 10:09 AM Stefan Richter
Hi,
Thanks for reporting the problem, I think the exception trace looks indeed very
similar to traces in the discussion for FLINK-10184. I will pull in Till who
worked on the fix to hear his opinion. Maybe the current fix only made the
problem less likely to appear but is not complete, yet?