Forgot to mentioned, Flink version is 1.9.2
On May 19, 2020 at 6:22 PM, Fritz Budiyanto <[email protected]> wrote:
Hi All,
I have been seeing this issue several time where JobGraph are not cleaned up
properly. As a result, when Flink cluster is restarted, it will attempt to do
HA restore on a checkpoint which doesn't exist anymore and the new restarted
cluster eventually go give up and stay down.
The workaround is to cleanup the jobgraph manually from Zookeeper. Is this a
known issue?
2020-05-19 19:56:21,471 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor -
Un-registering task and sending final execution state FINISHED to JobManager for task
Source: kafkaConsumer[update_server] -> (DetectedUpdateMessageConverter -> Sink:
update_server.detected_updates, DrivenCoordinatesMessageConverter -> Sink:
update_server.driven_coordinates) 588902a8096f49845b09fa1f595d6065.
2020-05-19 19:56:21,622 INFO
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable - Free slot
TaskSlot(index:0, state:ACTIVE, resource profile:
ResourceProfile{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647,
directMemoryInMB=2147483647, nativeMemoryInMB=2147483647,
networkMemoryInMB=2147483647, managedMemoryInMB=642}, allocationId:
29f6a5f83c832486f2d7ebe5c779fa32, jobId: 86a028b3f7aada8ffe59859ca71d6385).
2020-05-19 19:56:21,622 INFO
org.apache.flink.runtime.taskexecutor.JobLeaderService - Remove job
86a028b3f7aada8ffe59859ca71d6385 from job leader monitoring.
2020-05-19 19:56:21,622 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Stopping ZooKeeperLeaderRetrievalService
/leader/86a028b3f7aada8ffe59859ca71d6385/job_manager_lock.
2020-05-19 19:56:21,623 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor
- Close JobManager connection for job 86a028b3f7aada8ffe59859ca71d6385.
2020-05-19 19:56:21,624 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor
- Close JobManager connection for job 86a028b3f7aada8ffe59859ca71d6385.
2020-05-19 19:56:21,624 INFO
org.apache.flink.runtime.taskexecutor.JobLeaderService - Cannot reconnect to
job 86a028b3f7aada8ffe59859ca71d6385 because it is not registered.
...
Zookeeper CLI:
ls /flink/cluster_update/jobgraphs
[86a028b3f7aada8ffe59859ca71d6385]
Thanks,
Fritz