Forgot to mentioned, Flink version is 1.9.2

On May 19, 2020 at 6:22 PM, Fritz Budiyanto <[email protected]> wrote:


Hi All,


I have been seeing this issue several time where JobGraph are not cleaned up 
properly. As a result, when Flink cluster is restarted, it will attempt to do 
HA restore on a checkpoint which doesn't exist anymore and the new restarted 
cluster eventually go give up and stay down.

The workaround is to cleanup the jobgraph manually from Zookeeper. Is this a 
known issue? 


2020-05-19 19:56:21,471 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - 
Un-registering task and sending final execution state FINISHED to JobManager for task 
Source: kafkaConsumer[update_server] -> (DetectedUpdateMessageConverter -> Sink: 
update_server.detected_updates, DrivenCoordinatesMessageConverter -> Sink: 
update_server.driven_coordinates) 588902a8096f49845b09fa1f595d6065.
2020-05-19 19:56:21,622 INFO 
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable - Free slot 
TaskSlot(index:0, state:ACTIVE, resource profile: 
ResourceProfile{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, 
directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, 
networkMemoryInMB=2147483647, managedMemoryInMB=642}, allocationId: 
29f6a5f83c832486f2d7ebe5c779fa32, jobId: 86a028b3f7aada8ffe59859ca71d6385).
2020-05-19 19:56:21,622 INFO 
org.apache.flink.runtime.taskexecutor.JobLeaderService - Remove job 
86a028b3f7aada8ffe59859ca71d6385 from job leader monitoring.
2020-05-19 19:56:21,622 INFO 
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - 
Stopping ZooKeeperLeaderRetrievalService 
/leader/86a028b3f7aada8ffe59859ca71d6385/job_manager_lock.
2020-05-19 19:56:21,623 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor 
- Close JobManager connection for job 86a028b3f7aada8ffe59859ca71d6385.
2020-05-19 19:56:21,624 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor 
- Close JobManager connection for job 86a028b3f7aada8ffe59859ca71d6385.
2020-05-19 19:56:21,624 INFO 
org.apache.flink.runtime.taskexecutor.JobLeaderService - Cannot reconnect to 
job 86a028b3f7aada8ffe59859ca71d6385 because it is not registered.


...

Zookeeper CLI:


ls /flink/cluster_update/jobgraphs
[86a028b3f7aada8ffe59859ca71d6385]

Thanks,
Fritz

Reply via email to