[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info
[ https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631859#comment-17631859 ] Sriram Ganesh commented on FLINK-29609: --- I made changes in the reconciliation logic and testing. But you can pick this up. > Clean up jobmanager deployment on suspend after recording savepoint info > > > Key: FLINK-29609 > URL: https://issues.apache.org/jira/browse/FLINK-29609 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Assignee: Sriram Ganesh >Priority: Major > Fix For: kubernetes-operator-1.3.0 > > > Currently in case of suspending with savepoint. The jobmanager pod will > linger there forever after cancelling the job. > This is currently used to ensure consistency in case the > operator/cancel-with-savepoint operation fails. > Once we are sure however that the savepoint has been recorded and the job is > shut down, we should clean up all the resources. Optionally we can make this > configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info
[ https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631822#comment-17631822 ] Gyula Fora commented on FLINK-29609: [~sriramgr] I would like to take over this ticket unless you have made some progress. This has become time critical for us :) > Clean up jobmanager deployment on suspend after recording savepoint info > > > Key: FLINK-29609 > URL: https://issues.apache.org/jira/browse/FLINK-29609 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Assignee: Sriram Ganesh >Priority: Major > Fix For: kubernetes-operator-1.3.0 > > > Currently in case of suspending with savepoint. The jobmanager pod will > linger there forever after cancelling the job. > This is currently used to ensure consistency in case the > operator/cancel-with-savepoint operation fails. > Once we are sure however that the savepoint has been recorded and the job is > shut down, we should clean up all the resources. Optionally we can make this > configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info
[ https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626372#comment-17626372 ] Hector Miuler Malpica Gallegos commented on FLINK-29609: [~sriramgr] In my opinion, this should only happen in application mode, in session mode it should continue to exist waiting for a new job. > Clean up jobmanager deployment on suspend after recording savepoint info > > > Key: FLINK-29609 > URL: https://issues.apache.org/jira/browse/FLINK-29609 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Assignee: Sriram Ganesh >Priority: Major > Fix For: kubernetes-operator-1.3.0 > > > Currently in case of suspending with savepoint. The jobmanager pod will > linger there forever after cancelling the job. > This is currently used to ensure consistency in case the > operator/cancel-with-savepoint operation fails. > Once we are sure however that the savepoint has been recorded and the job is > shut down, we should clean up all the resources. Optionally we can make this > configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info
[ https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626053#comment-17626053 ] Sriram Ganesh commented on FLINK-29609: --- Please don't reassign to others for sometime unless it is urgent. I am spending time on this to find a better solution. > Clean up jobmanager deployment on suspend after recording savepoint info > > > Key: FLINK-29609 > URL: https://issues.apache.org/jira/browse/FLINK-29609 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Assignee: Sriram Ganesh >Priority: Major > Fix For: kubernetes-operator-1.3.0 > > > Currently in case of suspending with savepoint. The jobmanager pod will > linger there forever after cancelling the job. > This is currently used to ensure consistency in case the > operator/cancel-with-savepoint operation fails. > Once we are sure however that the savepoint has been recorded and the job is > shut down, we should clean up all the resources. Optionally we can make this > configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info
[ https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625561#comment-17625561 ] Sriram Ganesh commented on FLINK-29609: --- I found this place where we are not removing the JM pod. [https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-ope[…]che/flink/kubernetes/operator/service/AbstractFlinkService.java#L350. |https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L350] But we can't remove the JM pod as it is. Because pod upgrades and rollback also will get impacted. Can we have conditions like the pod can be removed after no action?. Any better suggestions will be appreciated. Thanks in advance. > Clean up jobmanager deployment on suspend after recording savepoint info > > > Key: FLINK-29609 > URL: https://issues.apache.org/jira/browse/FLINK-29609 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Assignee: Sriram Ganesh >Priority: Major > Fix For: kubernetes-operator-1.3.0 > > > Currently in case of suspending with savepoint. The jobmanager pod will > linger there forever after cancelling the job. > This is currently used to ensure consistency in case the > operator/cancel-with-savepoint operation fails. > Once we are sure however that the savepoint has been recorded and the job is > shut down, we should clean up all the resources. Optionally we can make this > configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info
[ https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624703#comment-17624703 ] Gyula Fora commented on FLINK-29609: This feature is only enabled for Application clusters. The current behaviour is fully intentional to allow job manager access after job completion / shutdown/ failure so that the operator can better track whats going on. The improvement we are looking for here is that once the operator actually recorded the final state in the CR status we could actually shut down resources as we don't need to check anymore. > Clean up jobmanager deployment on suspend after recording savepoint info > > > Key: FLINK-29609 > URL: https://issues.apache.org/jira/browse/FLINK-29609 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Assignee: Sriram Ganesh >Priority: Major > Fix For: kubernetes-operator-1.3.0 > > > Currently in case of suspending with savepoint. The jobmanager pod will > linger there forever after cancelling the job. > This is currently used to ensure consistency in case the > operator/cancel-with-savepoint operation fails. > Once we are sure however that the savepoint has been recorded and the job is > shut down, we should clean up all the resources. Optionally we can make this > configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info
[ https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624665#comment-17624665 ] Sriram Ganesh commented on FLINK-29609: --- Sure [~Miuler] . I have started exploring the issue. Just incase if someone can help me with this information - Is this issue happening for both application and session mode?. > Clean up jobmanager deployment on suspend after recording savepoint info > > > Key: FLINK-29609 > URL: https://issues.apache.org/jira/browse/FLINK-29609 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Assignee: Sriram Ganesh >Priority: Major > Fix For: kubernetes-operator-1.3.0 > > > Currently in case of suspending with savepoint. The jobmanager pod will > linger there forever after cancelling the job. > This is currently used to ensure consistency in case the > operator/cancel-with-savepoint operation fails. > Once we are sure however that the savepoint has been recorded and the job is > shut down, we should clean up all the resources. Optionally we can make this > configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info
[ https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624585#comment-17624585 ] Hector Miuler Malpica Gallegos commented on FLINK-29609: Please, take into account the stateless batch processes, which once finished processing, should clean all the resources > Clean up jobmanager deployment on suspend after recording savepoint info > > > Key: FLINK-29609 > URL: https://issues.apache.org/jira/browse/FLINK-29609 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Assignee: Sriram Ganesh >Priority: Major > Fix For: kubernetes-operator-1.3.0 > > > Currently in case of suspending with savepoint. The jobmanager pod will > linger there forever after cancelling the job. > This is currently used to ensure consistency in case the > operator/cancel-with-savepoint operation fails. > Once we are sure however that the savepoint has been recorded and the job is > shut down, we should clean up all the resources. Optionally we can make this > configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info
[ https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623554#comment-17623554 ] Sriram Ganesh commented on FLINK-29609: --- [~gyfora] - Please assign this to me. > Clean up jobmanager deployment on suspend after recording savepoint info > > > Key: FLINK-29609 > URL: https://issues.apache.org/jira/browse/FLINK-29609 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Fix For: kubernetes-operator-1.3.0 > > > Currently in case of suspending with savepoint. The jobmanager pod will > linger there forever after cancelling the job. > This is currently used to ensure consistency in case the > operator/cancel-with-savepoint operation fails. > Once we are sure however that the savepoint has been recorded and the job is > shut down, we should clean up all the resources. Optionally we can make this > configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010)