Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

2018-08-29 Thread Till Rohrmann
Thanks for the update Gerard. Fixing the resource cleanup in the case of standby Dispatchers/JobMasters has a high priority. We will hopefully fix the problem with the next bug fix release. Until then, the JobGraph entry must be removed from ZooKeeper manually. Cheers, Till On Wed, Aug 29, 2018

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

2018-08-29 Thread Gerard Garcia
Hi Till, Sorry for the late reply, I was waiting to update to Flink 1.6.0 to see if the problem got fixed but I still experience the first issue (jobgraph not deleted from zookeeper when task is canceled). The second issue (taskmanagers unable to register to the new elected jobmanager) was

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

2018-07-24 Thread Till Rohrmann
Hi Gerard, the first log snippet from the client does not show anything suspicious. The warning just says that you cannot use the Yarn CLI because it lacks the Hadoop dependencies in the classpath. The second snippet is indeed more interesting. If the TaskExecutors are not notified about the

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

2018-07-23 Thread Gerard Garcia
We have just started experiencing a different problem that could be related, maybe it helps to diagnose the issue. In the last 24h the jobmanager lost connection to Zookeeper a couple of times. Each time, a new jobmanager (in a different node) was elected leader correctly but the taskamangers

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

2018-07-23 Thread Gerard Garcia
Hi Till, I can't post the full log (as there is internal info in them) but I've found this. Is that what you are looking for? 11:29:17.351 [main] INFO org.apache.flink.client.cli.CliFrontend - 11:29:17.372 [main]

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

2018-07-19 Thread vino yang
Hi Till, You are right, we also saw the problem you said. Curator removes the specific job graph path asynchronously. But it's the only gist when recovering, right? Is there any plan to enhance this point? Thanks, vino. 2018-07-19 21:58 GMT+08:00 Till Rohrmann : > Hi Gerard, > > the logging

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

2018-07-19 Thread Till Rohrmann
Hi Gerard, the logging statement `Removed job graph ... from ZooKeeper` is actually not 100% accurate. The actual deletion is executed as an asynchronous background task and the log statement is not printed in the callback (which it should). Therefore, the deletion could still have failed. In

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

2018-07-19 Thread Gerard Garcia
Thanks Andrey, That is the log from the jobmanager just after it has finished cancelling the task: 11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping checkpoint coordinator for job e403893e5208ca47ace886a77e405291.

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

2018-07-18 Thread Andrey Zagrebin
Hi Gerard, There is an issue recently fixed for 1.5.2, 1.6.0: https://issues.apache.org/jira/browse/FLINK-9575 It might have caused your problem. Can you please provide log from JobManager/Entry point for further investigation? Cheers, Andrey

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

2018-07-18 Thread Gerard Garcia
Hi vino, Seems that jobs id stay in /jobgraphs when we cancel them manually. For example, after cancelling the job with id 75e16686cb4fe0d33ead8e29af131d09 the entry is still in zookeeper's path /flink/default/jobgraphs, but the job disappeared from /home/nas/flink/ha/default/blob/. That is the

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

2018-07-18 Thread vino yang
Hi Gerard, >From you provide information, you mean the path in Zookeeper "/jobgraphs" exists more jobs than you submitted? And can not be restarted because blob files can not be find? Can you provide more details, about the stack trace, log and which version of Flink? Normally, the jobgraph can

When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

2018-07-16 Thread gerardg
Hi, Our deployment consists of a standalone HA cluster of 8 machines with an external Zookeeper cluster. We have observed several times that when a jobmanager fails and a new one is elected, the new one tries to restart more jobs than the ones that were running and since it can't find some