Hi Gerard, There is an issue recently fixed for 1.5.2, 1.6.0: https://issues.apache.org/jira/browse/FLINK-9575 <https://issues.apache.org/jira/browse/FLINK-9575> It might have caused your problem.
Can you please provide log from JobManager/Entry point for further investigation? Cheers, Andrey > On 18 Jul 2018, at 10:16, Gerard Garcia <ger...@talaia.io> wrote: > > Hi vino, > > Seems that jobs id stay in /jobgraphs when we cancel them manually. For > example, after cancelling the job with id 75e16686cb4fe0d33ead8e29af131d09 > the entry is still in zookeeper's path /flink/default/jobgraphs, but the job > disappeared from /home/nas/flink/ha/default/blob/. > > That is the client log: > > 09:20:58.492 [main] INFO org.apache.flink.client.cli.CliFrontend - > Cancelling job 75e16686cb4fe0d33ead8e29af131d09. > 09:20:58.503 [main] INFO org.apache.flink.runtime.blob.FileSystemBlobStore > - Creating highly available BLOB storage directory at > file:///home/nas/flink/ha//default/blob > 09:20:58.505 [main] INFO org.apache.flink.runtime.util.ZooKeeperUtils - > Enforcing default ACL for ZK connections > 09:20:58.505 [main] INFO org.apache.flink.runtime.util.ZooKeeperUtils - > Using '/flink-eur/default' as Zookeeper namespace. > 09:20:58.539 [main] INFO > o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl - Starting > 09:20:58.543 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:zookeeper.version= > 3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT > 09:20:58.543 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:host.name <http://host.name/>=flink-eur-production1 > 09:20:58.543 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:java.version=1.8.0_131 > 09:20:58.544 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:java.vendor=Oracle Corporation > 09:20:58.546 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:java.home=/opt/jdk/jdk1.8.0_131/jre > 09:20:58.546 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar::: > 09:20:58.546 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib > 09:20:58.546 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:java.io.tmpdir=/tmp > 09:20:58.546 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:java.compiler=<NA> > 09:20:58.547 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:os.name <http://os.name/>=Linux > 09:20:58.547 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:os.arch=amd64 > 09:20:58.547 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:os.version=4.9.87-xxxx-std-ipv6-64 > 09:20:58.547 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:user.name <http://user.name/>=root > 09:20:58.547 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:user.home=/root > 09:20:58.547 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:user.dir=/opt/flink/flink-1.5.0/bin > 09:20:58.548 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Initiating > client connection, connectString=10.1.1.5:2181 > <http://10.1.1.5:2181/>,10.1.1.6:2181 <http://10.1.1.6:2181/>,10.1.1.7:2181 > <http://10.1.1.7:2181/> sessionTimeout=60000 > watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe > 09:20:58.555 [main-SendThread(10.1.1.5:2181 <http://10.1.1.5:2181/>)] WARN > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL > configuration failed: javax.security.auth.login.LoginException: No JAAS > configuration section named 'Client' was found in specified JAAS > configuration file: '/tmp/jaas-9143038863636945274.conf'. Will continue > connection to Zookeeper server without SASL authentication, if Zookeeper > server allows it. > 09:20:58.556 [main-SendThread(10.1.1.5:2181 <http://10.1.1.5:2181/>)] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket > connection to server 10.1.1.5/10.1.1.5:2181 <http://10.1.1.5/10.1.1.5:2181> > 09:20:58.556 [main-EventThread] ERROR > o.a.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication > failed > 09:20:58.569 [main-SendThread(10.1.1.5:2181 <http://10.1.1.5:2181/>)] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket > connection established to 10.1.1.5/10.1.1.5:2181 > <http://10.1.1.5/10.1.1.5:2181>, initiating session > 09:20:58.592 [main-SendThread(10.1.1.5:2181 <http://10.1.1.5:2181/>)] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session > establishment complete on server 10.1.1.5/10.1.1.5:2181 > <http://10.1.1.5/10.1.1.5:2181>, sessionid = 0x100571bda1903b7, negotiated > timeout = 40000 > 09:20:58.593 [main-EventThread] INFO > o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager - State change: > CONNECTED > 09:20:58.711 [main] INFO org.apache.flink.runtime.rest.RestClient - Rest > client endpoint started. > 09:20:58.713 [main] INFO > o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting > ZooKeeperLeaderRetrievalService /leader/rest_server_lock. > 09:20:58.755 [main] INFO > o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting > ZooKeeperLeaderRetrievalService /leader/dispatcher_lock. > 09:20:58.946 [main] INFO org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 09:20:58.946 [main] INFO org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 09:20:58.947 [main] INFO > o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping > ZooKeeperLeaderRetrievalService /leader/rest_server_lock. > 09:20:58.948 [main] INFO > o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping > ZooKeeperLeaderRetrievalService /leader/dispatcher_lock. > 09:20:58.949 [Curator-Framework-0] INFO > o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl - > backgroundOperationsLoop exiting > 09:20:58.968 [main] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Session: > 0x100571bda1903b7 closed > 09:20:58.968 [main] INFO org.apache.flink.client.cli.CliFrontend - > Cancelled job 75e16686cb4fe0d33ead8e29af131d09. > 09:20:58.969 [main-EventThread] INFO > o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - EventThread > shut down for session: 0x100571bda1903b7 > > I'm assuming that in /jobgraphs there should only be the job ids that are > currently running (at least it seemed that when the jobmanager restarted it > tried to restart the jobs ids stored there). Is that correct? > > Gerard > > On Wed, Jul 18, 2018 at 9:17 AM vino yang <yanghua1...@gmail.com > <mailto:yanghua1...@gmail.com>> wrote: > Hi Gerard, > > From you provide information, you mean the path in Zookeeper "/jobgraphs" > exists more jobs than you submitted? > And can not be restarted because blob files can not be find? > > Can you provide more details, about the stack trace, log and which version of > Flink? Normally, the jobgraph can not be added to Zookeeper except submit job > manually. > > Thanks, vino. > > 2018-07-16 21:19 GMT+08:00 gerardg <ger...@talaia.io > <mailto:ger...@talaia.io>>: > Hi, > > Our deployment consists of a standalone HA cluster of 8 machines with an > external Zookeeper cluster. We have observed several times that when a > jobmanager fails and a new one is elected, the new one tries to restart > more jobs than the ones that were running and since it can't find some > files, it fails and gets stuck in a restart loop. That is the error that we > see in the logs: > > > > These are the contents of /home/nas/flink/ha/default/blob/: > > > > We've checked zookeeper and there are actually a lot of jobgraphs in > /flink/default/jobgraphs > > > > There were only three jobs running so neither zookeeper nor the flink 'ha' > folder seems to have the correct number of jobgraphs stored. > > The only way we have to solve this is to remove everything at path /flink in > zookeeper and the 'ha' flink folder and restart the jobs manually. > > I'll try to monitor if some action (e.g. we have been canceling and > restoring jobs from savepoints quite often lately) leaves an entry in > zookeepers path /flink/default/jobgraphs of a job that is not running but > maybe someone can't point us to some configuration problem that could cause > this behavior. > > Thanks, > > Gerard > > > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> >