[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info

2022-11-10 Thread Sriram Ganesh (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631859#comment-17631859
 ] 

Sriram Ganesh commented on FLINK-29609:
---

I made changes in the reconciliation logic and testing. But you can pick this 
up.

> Clean up jobmanager deployment on suspend after recording savepoint info
> 
>
> Key: FLINK-29609
> URL: https://issues.apache.org/jira/browse/FLINK-29609
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Assignee: Sriram Ganesh
>Priority: Major
> Fix For: kubernetes-operator-1.3.0
>
>
> Currently in case of suspending with savepoint. The jobmanager pod will 
> linger there forever after cancelling the job.
> This is currently used to ensure consistency in case the 
> operator/cancel-with-savepoint operation fails.
> Once we are sure however that the savepoint has been recorded and the job is 
> shut down, we should clean up all the resources. Optionally we can make this 
> configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info

2022-11-10 Thread Gyula Fora (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631822#comment-17631822
 ] 

Gyula Fora commented on FLINK-29609:


[~sriramgr] I would like to take over this ticket unless you have made some 
progress.
This has become time critical for us :) 

> Clean up jobmanager deployment on suspend after recording savepoint info
> 
>
> Key: FLINK-29609
> URL: https://issues.apache.org/jira/browse/FLINK-29609
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Assignee: Sriram Ganesh
>Priority: Major
> Fix For: kubernetes-operator-1.3.0
>
>
> Currently in case of suspending with savepoint. The jobmanager pod will 
> linger there forever after cancelling the job.
> This is currently used to ensure consistency in case the 
> operator/cancel-with-savepoint operation fails.
> Once we are sure however that the savepoint has been recorded and the job is 
> shut down, we should clean up all the resources. Optionally we can make this 
> configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info

2022-10-30 Thread Hector Miuler Malpica Gallegos (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626372#comment-17626372
 ] 

Hector Miuler Malpica Gallegos commented on FLINK-29609:


[~sriramgr] In my opinion, this should only happen in application mode, in 
session mode it should continue to exist waiting for a new job.

> Clean up jobmanager deployment on suspend after recording savepoint info
> 
>
> Key: FLINK-29609
> URL: https://issues.apache.org/jira/browse/FLINK-29609
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Assignee: Sriram Ganesh
>Priority: Major
> Fix For: kubernetes-operator-1.3.0
>
>
> Currently in case of suspending with savepoint. The jobmanager pod will 
> linger there forever after cancelling the job.
> This is currently used to ensure consistency in case the 
> operator/cancel-with-savepoint operation fails.
> Once we are sure however that the savepoint has been recorded and the job is 
> shut down, we should clean up all the resources. Optionally we can make this 
> configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info

2022-10-29 Thread Sriram Ganesh (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626053#comment-17626053
 ] 

Sriram Ganesh commented on FLINK-29609:
---

Please don't reassign to others for sometime unless it is urgent. I am spending 
time on this to find a better solution. 

> Clean up jobmanager deployment on suspend after recording savepoint info
> 
>
> Key: FLINK-29609
> URL: https://issues.apache.org/jira/browse/FLINK-29609
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Assignee: Sriram Ganesh
>Priority: Major
> Fix For: kubernetes-operator-1.3.0
>
>
> Currently in case of suspending with savepoint. The jobmanager pod will 
> linger there forever after cancelling the job.
> This is currently used to ensure consistency in case the 
> operator/cancel-with-savepoint operation fails.
> Once we are sure however that the savepoint has been recorded and the job is 
> shut down, we should clean up all the resources. Optionally we can make this 
> configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info

2022-10-28 Thread Sriram Ganesh (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625561#comment-17625561
 ] 

Sriram Ganesh commented on FLINK-29609:
---

I found this place where we are not removing the JM pod. 
[https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-ope[…]che/flink/kubernetes/operator/service/AbstractFlinkService.java#L350.
 
|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L350]

But we can't remove the JM pod as it is. Because pod upgrades and rollback also 
will get impacted.

Can we have conditions like the pod can be removed after no action?. Any better 
suggestions will be appreciated. Thanks in advance.

> Clean up jobmanager deployment on suspend after recording savepoint info
> 
>
> Key: FLINK-29609
> URL: https://issues.apache.org/jira/browse/FLINK-29609
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Assignee: Sriram Ganesh
>Priority: Major
> Fix For: kubernetes-operator-1.3.0
>
>
> Currently in case of suspending with savepoint. The jobmanager pod will 
> linger there forever after cancelling the job.
> This is currently used to ensure consistency in case the 
> operator/cancel-with-savepoint operation fails.
> Once we are sure however that the savepoint has been recorded and the job is 
> shut down, we should clean up all the resources. Optionally we can make this 
> configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info

2022-10-26 Thread Gyula Fora (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624703#comment-17624703
 ] 

Gyula Fora commented on FLINK-29609:


This feature is only enabled for Application clusters. The current behaviour is 
fully intentional to allow job manager access after job completion / shutdown/ 
failure so that the operator can better track whats going on.

The improvement we are looking for here is that once the operator actually 
recorded the final state in the CR status we could actually shut down resources 
as we don't need to check anymore.

> Clean up jobmanager deployment on suspend after recording savepoint info
> 
>
> Key: FLINK-29609
> URL: https://issues.apache.org/jira/browse/FLINK-29609
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Assignee: Sriram Ganesh
>Priority: Major
> Fix For: kubernetes-operator-1.3.0
>
>
> Currently in case of suspending with savepoint. The jobmanager pod will 
> linger there forever after cancelling the job.
> This is currently used to ensure consistency in case the 
> operator/cancel-with-savepoint operation fails.
> Once we are sure however that the savepoint has been recorded and the job is 
> shut down, we should clean up all the resources. Optionally we can make this 
> configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info

2022-10-26 Thread Sriram Ganesh (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624665#comment-17624665
 ] 

Sriram Ganesh commented on FLINK-29609:
---

Sure [~Miuler] . I have started exploring the issue. Just incase if someone can 
help me with this information - Is this issue happening for both application 
and session mode?. 

> Clean up jobmanager deployment on suspend after recording savepoint info
> 
>
> Key: FLINK-29609
> URL: https://issues.apache.org/jira/browse/FLINK-29609
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Assignee: Sriram Ganesh
>Priority: Major
> Fix For: kubernetes-operator-1.3.0
>
>
> Currently in case of suspending with savepoint. The jobmanager pod will 
> linger there forever after cancelling the job.
> This is currently used to ensure consistency in case the 
> operator/cancel-with-savepoint operation fails.
> Once we are sure however that the savepoint has been recorded and the job is 
> shut down, we should clean up all the resources. Optionally we can make this 
> configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info

2022-10-26 Thread Hector Miuler Malpica Gallegos (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624585#comment-17624585
 ] 

Hector Miuler Malpica Gallegos commented on FLINK-29609:


Please, take into account the stateless batch processes, which once finished 
processing, should clean all the resources

> Clean up jobmanager deployment on suspend after recording savepoint info
> 
>
> Key: FLINK-29609
> URL: https://issues.apache.org/jira/browse/FLINK-29609
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Assignee: Sriram Ganesh
>Priority: Major
> Fix For: kubernetes-operator-1.3.0
>
>
> Currently in case of suspending with savepoint. The jobmanager pod will 
> linger there forever after cancelling the job.
> This is currently used to ensure consistency in case the 
> operator/cancel-with-savepoint operation fails.
> Once we are sure however that the savepoint has been recorded and the job is 
> shut down, we should clean up all the resources. Optionally we can make this 
> configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29609) Clean up jobmanager deployment on suspend after recording savepoint info

2022-10-24 Thread Sriram Ganesh (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623554#comment-17623554
 ] 

Sriram Ganesh commented on FLINK-29609:
---

[~gyfora] - Please assign this to me.

> Clean up jobmanager deployment on suspend after recording savepoint info
> 
>
> Key: FLINK-29609
> URL: https://issues.apache.org/jira/browse/FLINK-29609
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Priority: Major
> Fix For: kubernetes-operator-1.3.0
>
>
> Currently in case of suspending with savepoint. The jobmanager pod will 
> linger there forever after cancelling the job.
> This is currently used to ensure consistency in case the 
> operator/cancel-with-savepoint operation fails.
> Once we are sure however that the savepoint has been recorded and the job is 
> shut down, we should clean up all the resources. Optionally we can make this 
> configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)