Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

Andrey Zagrebin Wed, 18 Jul 2018 07:25:00 -0700

Hi Gerard,

There is an issue recently fixed for 1.5.2, 1.6.0:
https://issues.apache.org/jira/browse/FLINK-9575 
<https://issues.apache.org/jira/browse/FLINK-9575>
It might have caused your problem.


Can you please provide log from JobManager/Entry point for further 
investigation?

Cheers,
Andrey

> On 18 Jul 2018, at 10:16, Gerard Garcia <ger...@talaia.io> wrote:
> 
> Hi vino,
> 
> Seems that jobs id stay in /jobgraphs when we cancel them manually. For 
> example, after cancelling the job with id 75e16686cb4fe0d33ead8e29af131d09 
> the entry is still in zookeeper's path /flink/default/jobgraphs, but the job 
> disappeared from /home/nas/flink/ha/default/blob/.
> 
> That is the client log:
> 
> 09:20:58.492 [main] INFO  org.apache.flink.client.cli.CliFrontend  - 
> Cancelling job 75e16686cb4fe0d33ead8e29af131d09.
> 09:20:58.503 [main] INFO  org.apache.flink.runtime.blob.FileSystemBlobStore  
> - Creating highly available BLOB storage directory at 
> file:///home/nas/flink/ha//default/blob
> 09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - 
> Enforcing default ACL for ZK connections
> 09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - 
> Using '/flink-eur/default' as Zookeeper namespace.
> 09:20:58.539 [main] INFO  
> o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
> 09:20:58.543 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client 
> environment:zookeeper.version=
> 3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
> 09:20:58.543 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client 
> environment:host.name <http://host.name/>=flink-eur-production1
> 09:20:58.543 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client 
> environment:java.version=1.8.0_131
> 09:20:58.544 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client 
> environment:java.vendor=Oracle Corporation
> 09:20:58.546 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client 
> environment:java.home=/opt/jdk/jdk1.8.0_131/jre
> 09:20:58.546 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client 
> environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
> 09:20:58.546 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client 
> environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
> 09:20:58.546 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client 
> environment:java.io.tmpdir=/tmp
> 09:20:58.546 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client 
> environment:java.compiler=<NA>
> 09:20:58.547 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client 
> environment:os.name <http://os.name/>=Linux
> 09:20:58.547 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client 
> environment:os.arch=amd64
> 09:20:58.547 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client 
> environment:os.version=4.9.87-xxxx-std-ipv6-64
> 09:20:58.547 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client 
> environment:user.name <http://user.name/>=root
> 09:20:58.547 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client 
> environment:user.home=/root
> 09:20:58.547 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client 
> environment:user.dir=/opt/flink/flink-1.5.0/bin
> 09:20:58.548 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating 
> client connection, connectString=10.1.1.5:2181 
> <http://10.1.1.5:2181/>,10.1.1.6:2181 <http://10.1.1.6:2181/>,10.1.1.7:2181 
> <http://10.1.1.7:2181/> sessionTimeout=60000 
> watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
> 09:20:58.555 [main-SendThread(10.1.1.5:2181 <http://10.1.1.5:2181/>)] WARN  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
> configuration failed: javax.security.auth.login.LoginException: No JAAS 
> configuration section named 'Client' was found in specified JAAS 
> configuration file: '/tmp/jaas-9143038863636945274.conf'. Will continue 
> connection to Zookeeper server without SASL authentication, if Zookeeper 
> server allows it.
> 09:20:58.556 [main-SendThread(10.1.1.5:2181 <http://10.1.1.5:2181/>)] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket 
> connection to server 10.1.1.5/10.1.1.5:2181 <http://10.1.1.5/10.1.1.5:2181>
> 09:20:58.556 [main-EventThread] ERROR 
> o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication 
> failed
> 09:20:58.569 [main-SendThread(10.1.1.5:2181 <http://10.1.1.5:2181/>)] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
> connection established to 10.1.1.5/10.1.1.5:2181 
> <http://10.1.1.5/10.1.1.5:2181>, initiating session
> 09:20:58.592 [main-SendThread(10.1.1.5:2181 <http://10.1.1.5:2181/>)] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
> establishment complete on server 10.1.1.5/10.1.1.5:2181 
> <http://10.1.1.5/10.1.1.5:2181>, sessionid = 0x100571bda1903b7, negotiated 
> timeout = 40000
> 09:20:58.593 [main-EventThread] INFO  
> o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State change: 
> CONNECTED
> 09:20:58.711 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest 
> client endpoint started.
> 09:20:58.713 [main] INFO  
> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting 
> ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
> 09:20:58.755 [main] INFO  
> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting 
> ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
> 09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - 
> Shutting down rest endpoint.
> 09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest 
> endpoint shutdown complete.
> 09:20:58.947 [main] INFO  
> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping 
> ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
> 09:20:58.948 [main] INFO  
> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping 
> ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
> 09:20:58.949 [Curator-Framework-0] INFO  
> o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - 
> backgroundOperationsLoop exiting
> 09:20:58.968 [main] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 
> 0x100571bda1903b7 closed
> 09:20:58.968 [main] INFO  org.apache.flink.client.cli.CliFrontend  - 
> Cancelled job 75e16686cb4fe0d33ead8e29af131d09.
> 09:20:58.969 [main-EventThread] INFO  
> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread 
> shut down for session: 0x100571bda1903b7
> 
> I'm assuming that in /jobgraphs there should only be the job ids that are 
> currently running (at least it seemed that when the jobmanager restarted it 
> tried to restart the jobs ids stored there). Is that correct?
> 
> Gerard 
> 
> On Wed, Jul 18, 2018 at 9:17 AM vino yang <yanghua1...@gmail.com 
> <mailto:yanghua1...@gmail.com>> wrote:
> Hi Gerard,
> 
> From you provide information, you mean the path in Zookeeper "/jobgraphs" 
> exists more jobs than you submitted? 
> And can not be restarted because blob files can not be find?
> 
> Can you provide more details, about the stack trace, log and which version of 
> Flink? Normally, the jobgraph can not be added to Zookeeper except submit job 
> manually. 
> 
> Thanks, vino.
> 
> 2018-07-16 21:19 GMT+08:00 gerardg <ger...@talaia.io 
> <mailto:ger...@talaia.io>>:
> Hi,
> 
> Our deployment consists of a standalone HA cluster of 8 machines with an
> external Zookeeper cluster. We have observed several times that when a
> jobmanager fails and a new one is elected, the new one tries to restart 
> more jobs than the ones that were running and since it can't find some
> files, it fails and gets stuck in a restart loop. That is the error that we
> see in the logs:
> 
> 
> 
> These are the contents of /home/nas/flink/ha/default/blob/:
> 
> 
> 
> We've checked zookeeper and there are actually a lot of jobgraphs in
> /flink/default/jobgraphs
> 
> 
> 
> There were only three jobs running so neither zookeeper nor the flink 'ha'
> folder seems to have the correct number of jobgraphs stored.
> 
> The only way we have to solve this is to remove everything at path /flink in
> zookeeper and the 'ha' flink folder and restart the jobs manually.
> 
> I'll try to monitor if some action (e.g. we have been canceling and
> restoring jobs from savepoints quite often lately) leaves an entry in
> zookeepers path /flink/default/jobgraphs of a job that is not running but
> maybe someone can't point us to some configuration problem that could cause
> this behavior.
> 
> Thanks,
> 
> Gerard
> 
> 
> 
> 
> 
> --
> Sent from: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ 
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

Reply via email to