Fritz Budiyanto created FLINK-17853:
---------------------------------------

             Summary: JobGraph is not getting deleted after Job cancelation
                 Key: FLINK-17853
                 URL: https://issues.apache.org/jira/browse/FLINK-17853
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.9.2
         Environment: Flink 1.9.2
Zookeeper from AWS MSK
            Reporter: Fritz Budiyanto
         Attachments: flinkissue.txt

I have been seeing this issue several time where JobGraph are not cleaned up 
properly after Job deletion. Job deletion is performed by using "flink stop" 
command. As a result JobGraph node lingering in ZK, when Flink cluster is 
restarted, it will attempt to do HA restoration on non existing checkpoint 
which prevent the Flink cluster to come up.




2020-05-19 19:56:21,471 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor 
- Un-registering task and sending final execution state FINISHED to JobManager 
for task Source: kafkaConsumer[update_server] -> 
(DetectedUpdateMessageConverter -> Sink: update_server.detected_updates, 
DrivenCoordinatesMessageConverter -> Sink: update_server.driven_coordinates) 
588902a8096f49845b09fa1f595d6065.
2020-05-19 19:56:21,622 INFO 
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable - Free slot 
TaskSlot(index:0, state:ACTIVE, resource profile: 
ResourceProfile\{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, 
directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, 
networkMemoryInMB=2147483647, managedMemoryInMB=642}, allocationId: 
29f6a5f83c832486f2d7ebe5c779fa32, jobId: 86a028b3f7aada8ffe59859ca71d6385).
2020-05-19 19:56:21,622 INFO 
org.apache.flink.runtime.taskexecutor.JobLeaderService - Remove job 
86a028b3f7aada8ffe59859ca71d6385 from job leader monitoring.
2020-05-19 19:56:21,622 INFO 
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - 
Stopping ZooKeeperLeaderRetrievalService 
/leader/86a028b3f7aada8ffe59859ca71d6385/job_manager_lock.
2020-05-19 19:56:21,623 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor 
- Close JobManager connection for job 86a028b3f7aada8ffe59859ca71d6385.
2020-05-19 19:56:21,624 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor 
- Close JobManager connection for job 86a028b3f7aada8ffe59859ca71d6385.
2020-05-19 19:56:21,624 INFO 
org.apache.flink.runtime.taskexecutor.JobLeaderService - Cannot reconnect to 
job 86a028b3f7aada8ffe59859ca71d6385 because it is not registered.


...
Zookeeper CLI:

ls /flink/cluster_update/jobgraphs
[86a028b3f7aada8ffe59859ca71d6385]

 

Attached is the Flink logs in reverse order



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to