[jira] [Commented] (FLINK-21008) ClusterEntrypoint#shutDownAsync may not be fully executed
[ https://issues.apache.org/jira/browse/FLINK-21008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17270260#comment-17270260 ] Till Rohrmann commented on FLINK-21008: --- Wouldn't it work if {{ClusterEntrypoint.closeAsync()}} calls {code} shutDownAsync( ApplicationStatus.UNKNOWN, "Cluster entrypoint has been closed externally.", false) {code} meaning that a SIGTERM won't clean up the HA data. Now if the cluster entrypoint wants to shut down (e.g. because the job of the per-job cluster has finished), it will call {{shutDownAsync(.., .., true)}} which will clean up the HA data. > ClusterEntrypoint#shutDownAsync may not be fully executed > - > > Key: FLINK-21008 > URL: https://issues.apache.org/jira/browse/FLINK-21008 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.3, 1.12.1 >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Fix For: 1.13.0 > > > Recently, in our internal use case for native K8s integration with K8s HA > enabled, we found that the leader related ConfigMaps could be residual in > some corner situations. > After some investigations, I think it is possibly caused by the inappropriate > shutdown process. > In {{ClusterEntrypoint#shutDownAsync}}, we first call the > {{closeClusterComponent}}, which also includes deregistering the Flink > application from cluster management(e.g. Yarn, K8s). Then we call the > {{stopClusterServices}} and {{cleanupDirectories}}. Imagine that the cluster > management do the deregister very fast, the JobManager process receives > SIGNAL 15 before or is being executing the {{stopClusterServices}} and > {{cleanupDirectories}}. The jvm process will directly exit then. So the two > methods may not be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21008) ClusterEntrypoint#shutDownAsync may not be fully executed
[ https://issues.apache.org/jira/browse/FLINK-21008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17270224#comment-17270224 ] Yang Wang commented on FLINK-21008: --- After more consideration, I think it might be wrong to do the {{ClusterEntrypoint.closeAsync()}} when retrieved the SIGTERM. Imagine that the JobManager entrypoint is running normally, we send a SIGTERM and then the HA related data will be cleaned up. It is not right. Only when we are calling the {{shutDownAsync}} with {{cleanupHaData = true}}, but SIGTERM received before {{haServices.closeAndCleanupAllData()}} is executed. In such situation, we will have residual HA related ConfigMaps and ZooKeeper Nodes. So maybe using a shutdown hook to do the {{haServices.closeAndCleanupAllData()}} could solve the problem. And whether {{ClusterEntrypoint#shutDownAsync}} is fully executed or not does not matter. > ClusterEntrypoint#shutDownAsync may not be fully executed > - > > Key: FLINK-21008 > URL: https://issues.apache.org/jira/browse/FLINK-21008 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.3, 1.12.1 >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Fix For: 1.13.0 > > > Recently, in our internal use case for native K8s integration with K8s HA > enabled, we found that the leader related ConfigMaps could be residual in > some corner situations. > After some investigations, I think it is possibly caused by the inappropriate > shutdown process. > In {{ClusterEntrypoint#shutDownAsync}}, we first call the > {{closeClusterComponent}}, which also includes deregistering the Flink > application from cluster management(e.g. Yarn, K8s). Then we call the > {{stopClusterServices}} and {{cleanupDirectories}}. Imagine that the cluster > management do the deregister very fast, the JobManager process receives > SIGNAL 15 before or is being executing the {{stopClusterServices}} and > {{cleanupDirectories}}. The jvm process will directly exit then. So the two > methods may not be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21008) ClusterEntrypoint#shutDownAsync may not be fully executed
[ https://issues.apache.org/jira/browse/FLINK-21008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268502#comment-17268502 ] Yang Wang commented on FLINK-21008: --- Yes. I will try to get this done by letting {{SignalHandler}} could trigger the {{ClusterEntrypoint.closeAsync()}} for SIGTERM. > ClusterEntrypoint#shutDownAsync may not be fully executed > - > > Key: FLINK-21008 > URL: https://issues.apache.org/jira/browse/FLINK-21008 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.3, 1.12.1 >Reporter: Yang Wang >Priority: Critical > Fix For: 1.13.0 > > > Recently, in our internal use case for native K8s integration with K8s HA > enabled, we found that the leader related ConfigMaps could be residual in > some corner situations. > After some investigations, I think it is possibly caused by the inappropriate > shutdown process. > In {{ClusterEntrypoint#shutDownAsync}}, we first call the > {{closeClusterComponent}}, which also includes deregistering the Flink > application from cluster management(e.g. Yarn, K8s). Then we call the > {{stopClusterServices}} and {{cleanupDirectories}}. Imagine that the cluster > management do the deregister very fast, the JobManager process receives > SIGNAL 15 before or is being executing the {{stopClusterServices}} and > {{cleanupDirectories}}. The jvm process will directly exit then. So the two > methods may not be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21008) ClusterEntrypoint#shutDownAsync may not be fully executed
[ https://issues.apache.org/jira/browse/FLINK-21008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268453#comment-17268453 ] Till Rohrmann commented on FLINK-21008: --- Do you wanna try to take a stab at the problem [~fly_in_gis]? > ClusterEntrypoint#shutDownAsync may not be fully executed > - > > Key: FLINK-21008 > URL: https://issues.apache.org/jira/browse/FLINK-21008 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.3, 1.12.1 >Reporter: Yang Wang >Priority: Critical > Fix For: 1.13.0 > > > Recently, in our internal use case for native K8s integration with K8s HA > enabled, we found that the leader related ConfigMaps could be residual in > some corner situations. > After some investigations, I think it is possibly caused by the inappropriate > shutdown process. > In {{ClusterEntrypoint#shutDownAsync}}, we first call the > {{closeClusterComponent}}, which also includes deregistering the Flink > application from cluster management(e.g. Yarn, K8s). Then we call the > {{stopClusterServices}} and {{cleanupDirectories}}. Imagine that the cluster > management do the deregister very fast, the JobManager process receives > SIGNAL 15 before or is being executing the {{stopClusterServices}} and > {{cleanupDirectories}}. The jvm process will directly exit then. So the two > methods may not be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21008) ClusterEntrypoint#shutDownAsync may not be fully executed
[ https://issues.apache.org/jira/browse/FLINK-21008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268349#comment-17268349 ] Yang Wang commented on FLINK-21008: --- Triggering {{ClusterEntrypoint.closeAsync()}} for the SIGTERM is a better solution. Since ignoring the SIGTERM could cause other issues. For example, deleting a pod gracefully is impossible now. > ClusterEntrypoint#shutDownAsync may not be fully executed > - > > Key: FLINK-21008 > URL: https://issues.apache.org/jira/browse/FLINK-21008 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.3, 1.12.1 >Reporter: Yang Wang >Priority: Critical > Fix For: 1.13.0 > > > Recently, in our internal use case for native K8s integration with K8s HA > enabled, we found that the leader related ConfigMaps could be residual in > some corner situations. > After some investigations, I think it is possibly caused by the inappropriate > shutdown process. > In {{ClusterEntrypoint#shutDownAsync}}, we first call the > {{closeClusterComponent}}, which also includes deregistering the Flink > application from cluster management(e.g. Yarn, K8s). Then we call the > {{stopClusterServices}} and {{cleanupDirectories}}. Imagine that the cluster > management do the deregister very fast, the JobManager process receives > SIGNAL 15 before or is being executing the {{stopClusterServices}} and > {{cleanupDirectories}}. The jvm process will directly exit then. So the two > methods may not be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21008) ClusterEntrypoint#shutDownAsync may not be fully executed
[ https://issues.apache.org/jira/browse/FLINK-21008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268028#comment-17268028 ] Till Rohrmann commented on FLINK-21008: --- I think I would be in favour of triggering a {{ClusterEntrypoint.closeAsync()}} if we see a SIGTERM and then wait on the completion. That way, we also cover the case properly when someone sends us SIGTERM and the system wasn't shutting down. What do you think? > ClusterEntrypoint#shutDownAsync may not be fully executed > - > > Key: FLINK-21008 > URL: https://issues.apache.org/jira/browse/FLINK-21008 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.3, 1.12.1 >Reporter: Yang Wang >Priority: Critical > Fix For: 1.13.0 > > > Recently, in our internal use case for native K8s integration with K8s HA > enabled, we found that the leader related ConfigMaps could be residual in > some corner situations. > After some investigations, I think it is possibly caused by the inappropriate > shutdown process. > In {{ClusterEntrypoint#shutDownAsync}}, we first call the > {{closeClusterComponent}}, which also includes deregistering the Flink > application from cluster management(e.g. Yarn, K8s). Then we call the > {{stopClusterServices}} and {{cleanupDirectories}}. Imagine that the cluster > management do the deregister very fast, the JobManager process receives > SIGNAL 15 before or is being executing the {{stopClusterServices}} and > {{cleanupDirectories}}. The jvm process will directly exit then. So the two > methods may not be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21008) ClusterEntrypoint#shutDownAsync may not be fully executed
[ https://issues.apache.org/jira/browse/FLINK-21008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267983#comment-17267983 ] Yang Wang commented on FLINK-21008: --- We could implement an empty {{SignalHandler}}. And then register it in the {{KubernetesApplicationClusterEntrypoint}} and {{KubernetesSessionClusterEntrypoint}}. For Yarn related entrypoint, we could also have this done. > ClusterEntrypoint#shutDownAsync may not be fully executed > - > > Key: FLINK-21008 > URL: https://issues.apache.org/jira/browse/FLINK-21008 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.3, 1.12.1 >Reporter: Yang Wang >Priority: Critical > Fix For: 1.13.0 > > > Recently, in our internal use case for native K8s integration with K8s HA > enabled, we found that the leader related ConfigMaps could be residual in > some corner situations. > After some investigations, I think it is possibly caused by the inappropriate > shutdown process. > In {{ClusterEntrypoint#shutDownAsync}}, we first call the > {{closeClusterComponent}}, which also includes deregistering the Flink > application from cluster management(e.g. Yarn, K8s). Then we call the > {{stopClusterServices}} and {{cleanupDirectories}}. Imagine that the cluster > management do the deregister very fast, the JobManager process receives > SIGNAL 15 before or is being executing the {{stopClusterServices}} and > {{cleanupDirectories}}. The jvm process will directly exit then. So the two > methods may not be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21008) ClusterEntrypoint#shutDownAsync may not be fully executed
[ https://issues.apache.org/jira/browse/FLINK-21008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267854#comment-17267854 ] Till Rohrmann commented on FLINK-21008: --- Is it possible to ignore the SIGTERM signal in the JVM? > ClusterEntrypoint#shutDownAsync may not be fully executed > - > > Key: FLINK-21008 > URL: https://issues.apache.org/jira/browse/FLINK-21008 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.3, 1.12.1 >Reporter: Yang Wang >Priority: Critical > Fix For: 1.13.0 > > > Recently, in our internal use case for native K8s integration with K8s HA > enabled, we found that the leader related ConfigMaps could be residual in > some corner situations. > After some investigations, I think it is possibly caused by the inappropriate > shutdown process. > In {{ClusterEntrypoint#shutDownAsync}}, we first call the > {{closeClusterComponent}}, which also includes deregistering the Flink > application from cluster management(e.g. Yarn, K8s). Then we call the > {{stopClusterServices}} and {{cleanupDirectories}}. Imagine that the cluster > management do the deregister very fast, the JobManager process receives > SIGNAL 15 before or is being executing the {{stopClusterServices}} and > {{cleanupDirectories}}. The jvm process will directly exit then. So the two > methods may not be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21008) ClusterEntrypoint#shutDownAsync may not be fully executed
[ https://issues.apache.org/jira/browse/FLINK-21008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267825#comment-17267825 ] Yang Wang commented on FLINK-21008: --- Actually, maybe we do not need to respond to the SIGTERM in Yarn/K8s deployment since the cluster entrypoint will call the {{System.exit()}} eventually. > ClusterEntrypoint#shutDownAsync may not be fully executed > - > > Key: FLINK-21008 > URL: https://issues.apache.org/jira/browse/FLINK-21008 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.3, 1.12.1 >Reporter: Yang Wang >Priority: Critical > Fix For: 1.13.0 > > > Recently, in our internal use case for native K8s integration with K8s HA > enabled, we found that the leader related ConfigMaps could be residual in > some corner situations. > After some investigations, I think it is possibly caused by the inappropriate > shutdown process. > In {{ClusterEntrypoint#shutDownAsync}}, we first call the > {{closeClusterComponent}}, which also includes deregistering the Flink > application from cluster management(e.g. Yarn, K8s). Then we call the > {{stopClusterServices}} and {{cleanupDirectories}}. Imagine that the cluster > management do the deregister very fast, the JobManager process receives > SIGNAL 15 before or is being executing the {{stopClusterServices}} and > {{cleanupDirectories}}. The jvm process will directly exit then. So the two > methods may not be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21008) ClusterEntrypoint#shutDownAsync may not be fully executed
[ https://issues.apache.org/jira/browse/FLINK-21008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267821#comment-17267821 ] Till Rohrmann commented on FLINK-21008: --- I see, then an alternative solution would be to signal the external system to shut down after the whole Flink clean up has been done. The problem here is that the communication logic with the external client is encapsulated in the {{ResourceManager}} which at this point is already shut down. > ClusterEntrypoint#shutDownAsync may not be fully executed > - > > Key: FLINK-21008 > URL: https://issues.apache.org/jira/browse/FLINK-21008 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.3, 1.12.1 >Reporter: Yang Wang >Priority: Critical > Fix For: 1.13.0 > > > Recently, in our internal use case for native K8s integration with K8s HA > enabled, we found that the leader related ConfigMaps could be residual in > some corner situations. > After some investigations, I think it is possibly caused by the inappropriate > shutdown process. > In {{ClusterEntrypoint#shutDownAsync}}, we first call the > {{closeClusterComponent}}, which also includes deregistering the Flink > application from cluster management(e.g. Yarn, K8s). Then we call the > {{stopClusterServices}} and {{cleanupDirectories}}. Imagine that the cluster > management do the deregister very fast, the JobManager process receives > SIGNAL 15 before or is being executing the {{stopClusterServices}} and > {{cleanupDirectories}}. The jvm process will directly exit then. So the two > methods may not be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21008) ClusterEntrypoint#shutDownAsync may not be fully executed
[ https://issues.apache.org/jira/browse/FLINK-21008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267804#comment-17267804 ] Yang Wang commented on FLINK-21008: --- You are right. Deregistering the application from K8s(aka delete the JobManager deployment) will let the kubelet send a SIGTERM to JobManager process. But the Yarn has the same behavior. The reason why we do not come into this issue when deploying Flink application on Yarn is that the SIGTERM is sent a little late. Because Yarn ResourceManager tell NodeManager to kill(SIGTERM and followed by a SIGKILL) the JobManager via heartbeat, which is 3 seconds by default. However, on Kubernetes, kubelet is informed via watcher, which is no delay. Assume that the cluster entrypoint costs more that 3 second for the internal clean up( {{stopClusterServices}} and {{cleanupDirectories}}), we will run into the same situation on Yarn deployment. > ClusterEntrypoint#shutDownAsync may not be fully executed > - > > Key: FLINK-21008 > URL: https://issues.apache.org/jira/browse/FLINK-21008 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.3, 1.12.1 >Reporter: Yang Wang >Priority: Critical > Fix For: 1.13.0 > > > Recently, in our internal use case for native K8s integration with K8s HA > enabled, we found that the leader related ConfigMaps could be residual in > some corner situations. > After some investigations, I think it is possibly caused by the inappropriate > shutdown process. > In {{ClusterEntrypoint#shutDownAsync}}, we first call the > {{closeClusterComponent}}, which also includes deregistering the Flink > application from cluster management(e.g. Yarn, K8s). Then we call the > {{stopClusterServices}} and {{cleanupDirectories}}. Imagine that the cluster > management do the deregister very fast, the JobManager process receives > SIGNAL 15 before or is being executing the {{stopClusterServices}} and > {{cleanupDirectories}}. The jvm process will directly exit then. So the two > methods may not be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21008) ClusterEntrypoint#shutDownAsync may not be fully executed
[ https://issues.apache.org/jira/browse/FLINK-21008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267793#comment-17267793 ] Till Rohrmann commented on FLINK-21008: --- Is the problem that deregistering the application from K8s will trigger K8s to send a SIGTERM to the JobManager process? I guess then this behaves a bit differently from Yarn and needs to be changed. Is there a way to let the process properly terminate but still deleting the K8s resource (e.g. deployment)? Maybe we need to register a shutdown hook which waits on the {{ClusterEntrypoint}} to have completed its shutdown. > ClusterEntrypoint#shutDownAsync may not be fully executed > - > > Key: FLINK-21008 > URL: https://issues.apache.org/jira/browse/FLINK-21008 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.3, 1.12.1 >Reporter: Yang Wang >Priority: Major > Fix For: 1.13.0 > > > Recently, in our internal use case for native K8s integration with K8s HA > enabled, we found that the leader related ConfigMaps could be residual in > some corner situations. > After some investigations, I think it is possibly caused by the inappropriate > shutdown process. > In {{ClusterEntrypoint#shutDownAsync}}, we first call the > {{closeClusterComponent}}, which also includes deregistering the Flink > application from cluster management(e.g. Yarn, K8s). Then we call the > {{stopClusterServices}} and {{cleanupDirectories}}. Imagine that the cluster > management do the deregister very fast, the JobManager process receives > SIGNAL 15 before or is being executing the {{stopClusterServices}} and > {{cleanupDirectories}}. The jvm process will directly exit then. So the two > methods may not be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005)