Hi Till,

I can't post the full log (as there is internal info in them) but I've
found this. Is that what you are looking for?

11:29:17.351 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
--------------------------------------------------------------------------------
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
Starting Command Line Client (Version: 1.5-SNAPSHOT, Rev:a4fc4c6,
Date:05.06.2018 @ 10:22:30 CEST)
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  OS
current user: (...)
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
Current Hadoop/Kerberos user: <no hadoop dependency found>
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JVM:
Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.131-b11
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
Maximum heap size: 14254 MiBytes
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
JAVA_HOME: (not set)
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  No
Hadoop Dependency available
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JVM
Options:
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
 -Dlog.file=/opt/flink/flink-1.5.0/log/flink-root-client-(...).log
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
 -Dlog4j.configuration=file:/opt/flink/flink-1.5.0/conf/log4j-cli.properties
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
 -Dlogback.configurationFile=file:/opt/flink/flink-1.5.0/conf/logback.xml
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
Program Arguments:
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
 cancel
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
 e403893e5208ca47ace886a77e405291
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
Classpath:
/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
11:29:17.375 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
--------------------------------------------------------------------------------
11:29:17.380 [main] WARN  org.apache.flink.client.cli.CliFrontend  - Could
not load CLI class org.apache.flink.yarn.cli.FlinkYarnSessionCli.
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:264)
        at
org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1204)
        at
org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1160)
        at
org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1086)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.conf.Configuration
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 5 common frames omitted
11:29:17.385 [main] INFO  org.apache.flink.core.fs.FileSystem  - Hadoop is
not in the classpath/dependencies. The extended set of supported File
Systems via Hadoop is not available.
11:29:17.479 [main] INFO
o.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot
create Hadoop Security Module because Hadoop cannot be found in the
Classpath.
11:29:17.489 [main] INFO  org.apache.flink.runtime.security.SecurityUtils
- Cannot install HadoopSecurityContext because Hadoop cannot be found in
the Classpath.
11:29:17.518 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
Running 'cancel' command.
11:29:17.523 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
Cancelling job e403893e5208ca47ace886a77e405291.
11:29:17.537 [main] INFO
org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly
available BLOB storage directory at file:///home/nas/flink/ha//default/blob
11:29:17.538 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  -
Enforcing default ACL for ZK connections
11:29:17.539 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  -
Using '/flink/default' as Zookeeper namespace.
11:29:17.574 [main] INFO
o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
11:29:17.577 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f,
built on 03/23/2017 10:13 GMT
11:29:17.577 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:host.name=(...)
11:29:17.578 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.version=1.8.0_131
11:29:17.578 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.vendor=Oracle Corporation
11:29:17.579 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.home=/opt/jdk/jdk1.8.0_131/jre
11:29:17.580 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
11:29:17.580 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
11:29:17.580 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.io.tmpdir=/tmp
11:29:17.580 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.compiler=<NA>
11:29:17.580 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:os.name=Linux
11:29:17.580 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:os.arch=amd64
11:29:17.580 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:os.version=4.9.87-xxxx-std-ipv6-64
11:29:17.581 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:user.name=(...)
11:29:17.581 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:user.home=(...)
11:29:17.581 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:user.dir=/opt/flink/flink-1.5.0/bin
11:29:17.581 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating
client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181
sessionTimeout=60000
watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
11:29:17.589 [main-SendThread(10.1.1.5:2181)] WARN
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
configuration failed: javax.security.auth.login.LoginException: No JAAS
configuration section named 'Client' was found in specified JAAS
configuration file: '/tmp/jaas-3807415919448894740.conf'. Will continue
connection to Zookeeper server without SASL authentication, if Zookeeper
server allows it.
11:29:17.590 [main-SendThread(10.1.1.5:2181)] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening
socket connection to server 10.1.1.5/10.1.1.5:2181
11:29:17.590 [main-EventThread] ERROR
o.a.flink.shaded.curator.org.apache.curator.ConnectionState  -
Authentication failed
11:29:17.603 [main-SendThread(10.1.1.5:2181)] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
connection established to 10.1.1.5/10.1.1.5:2181, initiating session
11:29:17.625 [main-SendThread(10.1.1.5:2181)] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session
establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid =
0x100571bda1903c3, negotiated timeout = 40000
11:29:17.626 [main-EventThread] INFO
o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State
change: CONNECTED
11:29:17.764 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest
client endpoint started.
11:29:17.766 [main] INFO
o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting
ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
11:29:17.812 [main] INFO
o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting
ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
11:29:18.007 [main] INFO  org.apache.flink.runtime.rest.RestClient  -
Shutting down rest endpoint.
11:29:18.008 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest
endpoint shutdown complete.
11:29:18.008 [main] INFO
o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping
ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
11:29:18.009 [main] INFO
o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping
ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
11:29:18.010 [Curator-Framework-0] INFO
o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  -
backgroundOperationsLoop exiting
11:29:18.030 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session:
0x100571bda1903c3 closed
11:29:18.030 [main-EventThread] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread
shut down for session: 0x100571bda1903c3
11:29:18.030 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
Cancelled job e403893e5208ca47ace886a77e405291.

Gerard

On Fri, Jul 20, 2018 at 5:14 AM vino yang <yanghua1...@gmail.com> wrote:

> Hi Till,
>
> You are right, we also saw the problem you said. Curator removes the
> specific job graph path asynchronously. But it's the only gist when
> recovering, right? Is there any plan to enhance this point?
>
> Thanks, vino.
>
> 2018-07-19 21:58 GMT+08:00 Till Rohrmann <trohrm...@apache.org>:
>
>> Hi Gerard,
>>
>> the logging statement `Removed job graph ... from ZooKeeper` is actually
>> not 100% accurate. The actual deletion is executed as an asynchronous
>> background task and the log statement is not printed in the callback (which
>> it should). Therefore, the deletion could still have failed. In order to
>> see this, the full jobmanager/cluster entry point logs would be
>> tremendously helpful.
>>
>> Cheers,
>> Till
>>
>> On Thu, Jul 19, 2018 at 1:33 PM Gerard Garcia <ger...@talaia.io> wrote:
>>
>>> Thanks Andrey,
>>>
>>> That is the log from the jobmanager just after it has finished
>>> cancelling the task:
>>>
>>> 11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Stopping
>>> checkpoint coordinator for job e403893e5208ca47ace886a77e405291.
>>> 11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO
>>> o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting down
>>> 11:29:18.738 [flink-akka.actor.default-dispatcher-15695] INFO
>>> o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Removing
>>> /flink-eur/default/checkpoints/e403893e5208ca47ace886a77e405291 from
>>> ZooKeeper
>>> 11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO
>>> o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting down.
>>> 11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO
>>> o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing
>>> /checkpoint-counter/e403893e5208ca47ace886a77e405291 from ZooKeeper
>>> 11:29:18.827 [flink-akka.actor.default-dispatcher-15695] INFO
>>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job
>>> e403893e5208ca47ace886a77e405291 reached globally terminal state CANCELED.
>>> 11:29:18.846 [flink-akka.actor.default-dispatcher-15675] INFO
>>> org.apache.flink.runtime.jobmaster.JobMaster  - Stopping the JobMaster for
>>> job (...)(e403893e5208ca47ace886a77e405291).
>>> 11:29:18.848 [flink-akka.actor.default-dispatcher-15675] INFO
>>> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping
>>> ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>>> 11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO
>>> org.apache.flink.runtime.jobmaster.JobMaster  - Close ResourceManager
>>> connection d5fbc30a895066054e29fb2fd60fb0f1: JobManager is shutting down..
>>> 11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO
>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Suspending SlotPool.
>>> 11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO
>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Stopping SlotPool.
>>> 11:29:18.864 [flink-akka.actor.default-dispatcher-15688] INFO
>>> o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Disconnect
>>> job manager 
>>> 9cf221e2340597629fb932c03aa14...@akka.tcp://flink@(...):33827/user/jobmanager_9
>>> for job e403893e5208ca47ace886a77e405291 from the resource manager.
>>> 11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO
>>> o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService  - Stopping
>>> ZooKeeperLeaderElectionService
>>> ZooKeeperLeaderElectionService{leaderPath='/leader/e403893e5208ca47ace886a77e405291/job_manager_lock'}.
>>> 11:29:18.980 [flink-akka.actor.default-dispatcher-15695] INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed
>>> checkpoint 31154 for job 5d8c376b10d358b9c9470b3e70113626 (132520 bytes in
>>> 411 ms).
>>> 11:29:19.025 [flink-akka.actor.default-dispatcher-15683] INFO
>>> o.a.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed job
>>> graph e403893e5208ca47ace886a77e405291 from ZooKeeper.
>>>
>>>
>>> At the end it says removed job graph e403893e5208ca47ace886a77e405291
>>> from ZooKeeper but I still can see it at /flink/default/jobgraphs:
>>>
>>> [zk: localhost:2181(CONNECTED) 14] ls
>>> /flink/default/jobgraphs/e403893e5208ca47ace886a77e405291
>>> [3fe9c3c8-5bec-404e-a720-75f9b188124f,
>>> 36208299-0f6d-462c-bae4-2e3d53f50e8c]
>>>
>>> Gerard
>>>
>>> On Wed, Jul 18, 2018 at 4:24 PM Andrey Zagrebin <
>>> and...@data-artisans.com> wrote:
>>>
>>>> Hi Gerard,
>>>>
>>>> There is an issue recently fixed for 1.5.2, 1.6.0:
>>>> https://issues.apache.org/jira/browse/FLINK-9575
>>>> It might have caused your problem.
>>>>
>>>> Can you please provide log from JobManager/Entry point for further
>>>> investigation?
>>>>
>>>> Cheers,
>>>> Andrey
>>>>
>>>> On 18 Jul 2018, at 10:16, Gerard Garcia <ger...@talaia.io> wrote:
>>>>
>>>> Hi vino,
>>>>
>>>> Seems that jobs id stay in /jobgraphs when we cancel them manually. For
>>>> example, after cancelling the job with id 75e16686cb4fe0d33ead8e29af131d09
>>>> the entry is still in zookeeper's path /flink/default/jobgraphs, but the
>>>> job disappeared from /home/nas/flink/ha/default/blob/.
>>>>
>>>> That is the client log:
>>>>
>>>> 09:20:58.492 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>>> Cancelling job 75e16686cb4fe0d33ead8e29af131d09.
>>>> 09:20:58.503 [main] INFO
>>>> org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly
>>>> available BLOB storage directory at
>>>> file:///home/nas/flink/ha//default/blob
>>>> 09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils
>>>> - Enforcing default ACL for ZK connections
>>>> 09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils
>>>> - Using '/flink-eur/default' as Zookeeper namespace.
>>>> 09:20:58.539 [main] INFO
>>>> o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
>>>> 09:20:58.543 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>> environment:zookeeper.version=
>>>> 3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017
>>>> 10:13 GMT
>>>> 09:20:58.543 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>> environment:host.name=flink-eur-production1
>>>> 09:20:58.543 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>> environment:java.version=1.8.0_131
>>>> 09:20:58.544 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>> environment:java.vendor=Oracle Corporation
>>>> 09:20:58.546 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>> environment:java.home=/opt/jdk/jdk1.8.0_131/jre
>>>> 09:20:58.546 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>> environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
>>>> 09:20:58.546 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>> environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
>>>> 09:20:58.546 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>> environment:java.io.tmpdir=/tmp
>>>> 09:20:58.546 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>> environment:java.compiler=<NA>
>>>> 09:20:58.547 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>> environment:os.name=Linux
>>>> 09:20:58.547 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>> environment:os.arch=amd64
>>>> 09:20:58.547 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>> environment:os.version=4.9.87-xxxx-std-ipv6-64
>>>> 09:20:58.547 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>> environment:user.name=root
>>>> 09:20:58.547 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>> environment:user.home=/root
>>>> 09:20:58.547 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>> environment:user.dir=/opt/flink/flink-1.5.0/bin
>>>> 09:20:58.548 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating
>>>> client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,
>>>> 10.1.1.7:2181 sessionTimeout=60000
>>>> watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
>>>> 09:20:58.555 [main-SendThread(10.1.1.5:2181)] WARN
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
>>>> configuration failed: javax.security.auth.login.LoginException: No JAAS
>>>> configuration section named 'Client' was found in specified JAAS
>>>> configuration file: '/tmp/jaas-9143038863636945274.conf'. Will continue
>>>> connection to Zookeeper server without SASL authentication, if Zookeeper
>>>> server allows it.
>>>> 09:20:58.556 [main-SendThread(10.1.1.5:2181)] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening
>>>> socket connection to server 10.1.1.5/10.1.1.5:2181
>>>> 09:20:58.556 [main-EventThread] ERROR 
>>>> o.a.flink.shaded.curator.org.apache.curator.ConnectionState
>>>> - Authentication failed
>>>> 09:20:58.569 [main-SendThread(10.1.1.5:2181)] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
>>>> connection established to 10.1.1.5/10.1.1.5:2181, initiating session
>>>> 09:20:58.592 [main-SendThread(10.1.1.5:2181)] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session
>>>> establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid =
>>>> 0x100571bda1903b7, negotiated timeout = 40000
>>>> 09:20:58.593 [main-EventThread] INFO
>>>> o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State
>>>> change: CONNECTED
>>>> 09:20:58.711 [main] INFO  org.apache.flink.runtime.rest.RestClient  -
>>>> Rest client endpoint started.
>>>> 09:20:58.713 [main] INFO
>>>> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting
>>>> ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
>>>> 09:20:58.755 [main] INFO
>>>> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting
>>>> ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
>>>> 09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  -
>>>> Shutting down rest endpoint.
>>>> 09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  -
>>>> Rest endpoint shutdown complete.
>>>> 09:20:58.947 [main] INFO
>>>> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping
>>>> ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
>>>> 09:20:58.948 [main] INFO
>>>> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping
>>>> ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
>>>> 09:20:58.949 [Curator-Framework-0] INFO
>>>> o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  -
>>>> backgroundOperationsLoop exiting
>>>> 09:20:58.968 [main] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session:
>>>> 0x100571bda1903b7 closed
>>>> 09:20:58.968 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>>> Cancelled job 75e16686cb4fe0d33ead8e29af131d09.
>>>> 09:20:58.969 [main-EventThread] INFO
>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread
>>>> shut down for session: 0x100571bda1903b7
>>>>
>>>> I'm assuming that in /jobgraphs there should only be the job ids that
>>>> are currently running (at least it seemed that when the jobmanager
>>>> restarted it tried to restart the jobs ids stored there). Is that correct?
>>>>
>>>> Gerard
>>>>
>>>> On Wed, Jul 18, 2018 at 9:17 AM vino yang <yanghua1...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Gerard,
>>>>>
>>>>> From you provide information, you mean the path in Zookeeper
>>>>> "/jobgraphs" exists more jobs than you submitted?
>>>>> And can not be restarted because blob files can not be find?
>>>>>
>>>>> Can you provide more details, about the stack trace, log and which
>>>>> version of Flink? Normally, the jobgraph can not be added to Zookeeper
>>>>> except submit job manually.
>>>>>
>>>>> Thanks, vino.
>>>>>
>>>>> 2018-07-16 21:19 GMT+08:00 gerardg <ger...@talaia.io>:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Our deployment consists of a standalone HA cluster of 8 machines with
>>>>>> an
>>>>>> external Zookeeper cluster. We have observed several times that when a
>>>>>> jobmanager fails and a new one is elected, the new one tries to
>>>>>> restart
>>>>>> more jobs than the ones that were running and since it can't find some
>>>>>> files, it fails and gets stuck in a restart loop. That is the error
>>>>>> that we
>>>>>> see in the logs:
>>>>>>
>>>>>>
>>>>>>
>>>>>> These are the contents of /home/nas/flink/ha/default/blob/:
>>>>>>
>>>>>>
>>>>>>
>>>>>> We've checked zookeeper and there are actually a lot of jobgraphs in
>>>>>> /flink/default/jobgraphs
>>>>>>
>>>>>>
>>>>>>
>>>>>> There were only three jobs running so neither zookeeper nor the flink
>>>>>> 'ha'
>>>>>> folder seems to have the correct number of jobgraphs stored.
>>>>>>
>>>>>> The only way we have to solve this is to remove everything at path
>>>>>> /flink in
>>>>>> zookeeper and the 'ha' flink folder and restart the jobs manually.
>>>>>>
>>>>>> I'll try to monitor if some action (e.g. we have been canceling and
>>>>>> restoring jobs from savepoints quite often lately) leaves an entry in
>>>>>> zookeepers path /flink/default/jobgraphs of a job that is not running
>>>>>> but
>>>>>> maybe someone can't point us to some configuration problem that could
>>>>>> cause
>>>>>> this behavior.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Gerard
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sent from:
>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>>>>>
>>>>>
>>>>>
>>>>
>

Reply via email to