[jira] [Created] (FLINK-27612) Generate waring events when deleting the session cluster

2022-05-14 Thread Aitozi (Jira)
Aitozi created FLINK-27612:
--

 Summary: Generate waring events when deleting the session cluster
 Key: FLINK-27612
 URL: https://issues.apache.org/jira/browse/FLINK-27612
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Aitozi
 Fix For: kubernetes-operator-1.0.0






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (FLINK-27576) Flink will request new pod when jm pod is delete, but will remove when TaskExecutor exceeded the idle timeout

2022-05-11 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534934#comment-17534934
 ] 

Aitozi edited comment on FLINK-27576 at 5/11/22 2:37 PM:
-

Hi [~zhisheng] there is a ticket tracking this 
https://issues.apache.org/jira/browse/FLINK-24713 I will open a PR for this 
soon 


was (Author: aitozi):
Hi [~zhisheng] there is a ticket tracking this 
[https://issues.apache.org/jira/projects/FLINK/issues/FLINK-24713] I will open 
a PR for this soon 

> Flink will request new pod when jm pod is delete, but will remove when 
> TaskExecutor exceeded the idle timeout 
> --
>
> Key: FLINK-27576
> URL: https://issues.apache.org/jira/browse/FLINK-27576
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.12.0
>Reporter: zhisheng
>Priority: Major
> Attachments: image-2022-05-11-20-06-58-955.png, 
> image-2022-05-11-20-08-01-739.png, jobmanager_log.txt
>
>
> flink 1.12.0 enable the ha(zk) and checkpoint, when i use kubectl delete the 
> jm pod, the job will  request new jm pod failover from the last checkpoint , 
> it is ok.  But it will request new tm pod again, but not use actually, the 
> new tm pod will closed when TaskExecutor exceeded the idle timeout . actually 
> it will use the old tm, why need to request for new tm pod? whether the job 
> will fail if the cluster has no resource for the new tm?Can we optimize and 
> reuse the old tm directly?
>  
> [^jobmanager_log.txt]
> ^!image-2022-05-11-20-06-58-955.png!^
> ^!image-2022-05-11-20-08-01-739.png!^



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27576) Flink will request new pod when jm pod is delete, but will remove when TaskExecutor exceeded the idle timeout

2022-05-11 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534934#comment-17534934
 ] 

Aitozi commented on FLINK-27576:


Hi [~zhisheng] there is a ticket tracking this 
[https://issues.apache.org/jira/projects/FLINK/issues/FLINK-24713] I will open 
a PR for this soon 

> Flink will request new pod when jm pod is delete, but will remove when 
> TaskExecutor exceeded the idle timeout 
> --
>
> Key: FLINK-27576
> URL: https://issues.apache.org/jira/browse/FLINK-27576
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.12.0
>Reporter: zhisheng
>Priority: Major
> Attachments: image-2022-05-11-20-06-58-955.png, 
> image-2022-05-11-20-08-01-739.png, jobmanager_log.txt
>
>
> flink 1.12.0 enable the ha(zk) and checkpoint, when i use kubectl delete the 
> jm pod, the job will  request new jm pod failover from the last checkpoint , 
> it is ok.  But it will request new tm pod again, but not use actually, the 
> new tm pod will closed when TaskExecutor exceeded the idle timeout . actually 
> it will use the old tm, why need to request for new tm pod? whether the job 
> will fail if the cluster has no resource for the new tm?Can we optimize and 
> reuse the old tm directly?
>  
> [^jobmanager_log.txt]
> ^!image-2022-05-11-20-06-58-955.png!^
> ^!image-2022-05-11-20-08-01-739.png!^



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch

2022-05-11 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534696#comment-17534696
 ] 

Aitozi commented on FLINK-27483:


Yes, I have not seen you have an initial PR just now, you can continue your 
work [~Fuyao Li] (y)

> Support adding custom HTTP header for HTTP based Jar fetch
> --
>
> Key: FLINK-27483
> URL: https://issues.apache.org/jira/browse/FLINK-27483
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
>Reporter: Fuyao Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: kubernetes-operator-1.0.0
>
>
> Hello Team,
> I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could 
> enable users to specify a URL to fetch jars.
> In many cases, user might want to add some custom headers to fetch jars from 
> remote private artifactory (like Oauth tokens, api-keys, etc).
> adding a field called artifactoryJarHeader will be very helpful.
> This field can be of array type.
> [key1, value1, key2, value2]
> It seems the current ArtifactManager doesn't support specify a custom header. 
> Please correct me if I am wrong.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch

2022-05-10 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534645#comment-17534645
 ] 

Aitozi commented on FLINK-27483:


+1 for this solution, I will open a PR for it this week

> Support adding custom HTTP header for HTTP based Jar fetch
> --
>
> Key: FLINK-27483
> URL: https://issues.apache.org/jira/browse/FLINK-27483
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
>Reporter: Fuyao Li
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> Hello Team,
> I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could 
> enable users to specify a URL to fetch jars.
> In many cases, user might want to add some custom headers to fetch jars from 
> remote private artifactory (like Oauth tokens, api-keys, etc).
> adding a field called artifactoryJarHeader will be very helpful.
> This field can be of array type.
> [key1, value1, key2, value2]
> It seems the current ArtifactManager doesn't support specify a custom header. 
> Please correct me if I am wrong.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (FLINK-26915) Extend the Reconciler and Observer interface

2022-05-08 Thread Aitozi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aitozi closed FLINK-26915.
--
Resolution: Won't Fix

> Extend the Reconciler and Observer interface
> 
>
> Key: FLINK-26915
> URL: https://issues.apache.org/jira/browse/FLINK-26915
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Assignee: Aitozi
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> As discussed in 
> [comments|https://github.com/apache/flink-kubernetes-operator/pull/112#discussion_r835762111],
>  I proposed make two changes to the Reconciler and Observer
>  # directly return the UpdateControl from the reconciler, because the 
> reconciler can in charge of the Update behavior, By this, we dont have to 
> infer the update control in the controller
>  # Make the params generic and extends from the ReconcilerContext and 
> ObserverContext. which will be easy for different controller to ship their 
> own objects for reconcile and observer. For example, in the FlinkSessionJob 
> case, we need to get the effective config from the FlinkDeployment first and 
> also pass the FlinkDeployment to the reconciler. 
> After the change, the reconciler will look like this:
> {code:java}
> public interface Reconciler> {     
> UpdateControl reconcile(CR cr, CTX context) throws Exception;     
> DeleteControl cleanup(CR cr, CTX ctx);
> }{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-26915) Extend the Reconciler and Observer interface

2022-05-08 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533425#comment-17533425
 ] 

Aitozi commented on FLINK-26915:


I have the same sense not to do this now, The main reason is that we have 
registered the event source for the {{{}FlinkDeployment{}}}, So the related CR 
can be obtained easily so it do not have to passed around the reconcile 
interface.

Closing it now.

> Extend the Reconciler and Observer interface
> 
>
> Key: FLINK-26915
> URL: https://issues.apache.org/jira/browse/FLINK-26915
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Assignee: Aitozi
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> As discussed in 
> [comments|https://github.com/apache/flink-kubernetes-operator/pull/112#discussion_r835762111],
>  I proposed make two changes to the Reconciler and Observer
>  # directly return the UpdateControl from the reconciler, because the 
> reconciler can in charge of the Update behavior, By this, we dont have to 
> infer the update control in the controller
>  # Make the params generic and extends from the ReconcilerContext and 
> ObserverContext. which will be easy for different controller to ship their 
> own objects for reconcile and observer. For example, in the FlinkSessionJob 
> case, we need to get the effective config from the FlinkDeployment first and 
> also pass the FlinkDeployment to the reconciler. 
> After the change, the reconciler will look like this:
> {code:java}
> public interface Reconciler> {     
> UpdateControl reconcile(CR cr, CTX context) throws Exception;     
> DeleteControl cleanup(CR cr, CTX ctx);
> }{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch

2022-05-08 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533420#comment-17533420
 ] 

Aitozi commented on FLINK-27483:


[~wangyang0918] I check the code again I found I mess them up, also sorry for 
misleading [~Fuyao Li] :(.

All the deployments share a same (dynamically) {{operatorConfiguration. }}

Then come back to how to solve this problem, I proposal to config the http 
headers in the {{ConfigManager#defaultConfig}} And use this in the 
{{HttpArtifactFetcher}}.  Is this ok ? [~Fuyao Li] [~wangyang0918]

> Support adding custom HTTP header for HTTP based Jar fetch
> --
>
> Key: FLINK-27483
> URL: https://issues.apache.org/jira/browse/FLINK-27483
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Fuyao Li
>Priority: Major
>
> Hello Team,
> I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could 
> enable users to specify a URL to fetch jars.
> In many cases, user might want to add some custom headers to fetch jars from 
> remote private artifactory (like Oauth tokens, api-keys, etc).
> adding a field called artifactoryJarHeader will be very helpful.
> This field can be of array type.
> [key1, value1, key2, value2]
> It seems the current ArtifactManager doesn't support specify a custom header. 
> Please correct me if I am wrong.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27497) Track terminal job states in the observer

2022-05-08 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533416#comment-17533416
 ] 

Aitozi commented on FLINK-27497:


I think it will be very useful improvements, I would like to give a try on this

> Track terminal job states in the observer
> -
>
> Key: FLINK-27497
> URL: https://issues.apache.org/jira/browse/FLINK-27497
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.0.0
>Reporter: Gyula Fora
>Priority: Critical
>
> With the improvements in FLINK-27468 Flink 1.15 app clusters will not be shut 
> down in case of terminal job states (failed, finished) etc.
> It is important to properly handle these states and let the user know about 
> it.
> We should always trigger events, and for terminally failed jobs record the 
> error information in the FlinkDeployment status.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (FLINK-27497) Track terminal job states in the observer

2022-05-08 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533416#comment-17533416
 ] 

Aitozi edited comment on FLINK-27497 at 5/8/22 6:39 AM:


I think it will be a very good improvement, I would like to give a try on this


was (Author: aitozi):
I think it will be very useful improvements, I would like to give a try on this

> Track terminal job states in the observer
> -
>
> Key: FLINK-27497
> URL: https://issues.apache.org/jira/browse/FLINK-27497
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.0.0
>Reporter: Gyula Fora
>Priority: Critical
>
> With the improvements in FLINK-27468 Flink 1.15 app clusters will not be shut 
> down in case of terminal job states (failed, finished) etc.
> It is important to properly handle these states and let the user know about 
> it.
> We should always trigger events, and for terminally failed jobs record the 
> error information in the FlinkDeployment status.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-24633) JobManager pod may stuck in containerCreating status during failover

2022-05-07 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533411#comment-17533411
 ] 

Aitozi commented on FLINK-24633:


I figured out that it's caused by our internal k8s cluster behavior, Closing it 
as Not a Problem

> JobManager pod may stuck in containerCreating status during failover
> 
>
> Key: FLINK-24633
> URL: https://issues.apache.org/jira/browse/FLINK-24633
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.14.0
>Reporter: Aitozi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (FLINK-24633) JobManager pod may stuck in containerCreating status during failover

2022-05-07 Thread Aitozi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aitozi closed FLINK-24633.
--
Resolution: Not A Problem

> JobManager pod may stuck in containerCreating status during failover
> 
>
> Key: FLINK-24633
> URL: https://issues.apache.org/jira/browse/FLINK-24633
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.14.0
>Reporter: Aitozi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (FLINK-17232) Rethink the implicit behavior to use the Service externalIP as the address of the Endpoint

2022-05-07 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17495844#comment-17495844
 ] 

Aitozi edited comment on FLINK-17232 at 5/8/22 4:00 AM:


[~wangyang0918] Could you help review 
[PR|https://github.com/apache/flink/pull/18762] ?  I want to move forward to 
finish the another part of this work which rely on the current PR 


was (Author: aitozi):
[~wangyang0918] Could you help review 
[PR|https://github.com/apache/flink/pull/18762] ?  I want to move forward to 
finish the another part of this work which reply on the current PR 

> Rethink the implicit behavior to use the Service externalIP as the address of 
> the Endpoint
> --
>
> Key: FLINK-17232
> URL: https://issues.apache.org/jira/browse/FLINK-17232
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Aitozi
>Priority: Major
>  Labels: auto-unassigned, stale-assigned
>
> Currently, for the LB/NodePort type Service, if we found that the 
> {{LoadBalancer}} in the {{Service}} is null, we would use the externalIPs 
> configured in the external Service as the address of the Endpoint. Again, 
> this is another implicit toleration and may confuse the users.
> This ticket proposes to rethink the implicit toleration behaviour.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch

2022-05-05 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532585#comment-17532585
 ] 

Aitozi commented on FLINK-27483:


The difference is whether we need to support the authentication configuration 
for per job level

> Support adding custom HTTP header for HTTP based Jar fetch
> --
>
> Key: FLINK-27483
> URL: https://issues.apache.org/jira/browse/FLINK-27483
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Fuyao Li
>Priority: Major
>
> Hello Team,
> I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could 
> enable users to specify a URL to fetch jars.
> In many cases, user might want to add some custom headers to fetch jars from 
> remote private artifactory (like Oauth tokens, api-keys, etc).
> adding a field called artifactoryJarHeader will be very helpful.
> This field can be of array type.
> [key1, value1, key2, value2]
> It seems the current ArtifactManager doesn't support specify a custom header. 
> Please correct me if I am wrong.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch

2022-05-05 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532234#comment-17532234
 ] 

Aitozi commented on FLINK-27483:


I think the reconciliation interval, timeout can be set per FlinkDeployment 
now, There is a discussion here 
[https://lists.apache.org/thread/pnf2gk9dgqv3qrtszqbfcdxf32t2gr3x]
https://issues.apache.org/jira/browse/FLINK-27023

So I think the flinkConfiguration in the FlinkDeployment can be used to control 
the reconcile options.

 

The filesystem's configuration are initialize at the entrypoint of the 
Operator. To keep it simple, I think we could do the same thing for the  
HttpArtifactFetcher by directly using the config from the configManager and 
extract http headers from the config, what do you think [~wangyang0918] ? 

> Support adding custom HTTP header for HTTP based Jar fetch
> --
>
> Key: FLINK-27483
> URL: https://issues.apache.org/jira/browse/FLINK-27483
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Fuyao Li
>Priority: Major
>
> Hello Team,
> I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could 
> enable users to specify a URL to fetch jars.
> In many cases, user might want to add some custom headers to fetch jars from 
> remote private artifactory (like Oauth tokens, api-keys, etc).
> adding a field called artifactoryJarHeader will be very helpful.
> This field can be of array type.
> [key1, value1, key2, value2]
> It seems the current ArtifactManager doesn't support specify a custom header. 
> Please correct me if I am wrong.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch

2022-05-04 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531528#comment-17531528
 ] 

Aitozi commented on FLINK-27483:


The config type can be listed as below:
 * The session cluster/application's config are decided by the 
flinkDeployment's flinkConfiguration mixed with the default config.
 * The operator's reconcile configs like interval, client timeout, cancel job 
timeout's config are built from the operator config and the related 
FlinkDeployment config.

I think for the session job we can built the reconcile configs from the session 
job's {{configuration + flink deployment config + default config}}.

If the config field is added, I think we could add an option like 
{{kubernetes.operator.user.artifacts.http.header: k1:v1,k2:v2}} and the 
HttpArtifactFetcher will decorate the http request based on the header config.

> Support adding custom HTTP header for HTTP based Jar fetch
> --
>
> Key: FLINK-27483
> URL: https://issues.apache.org/jira/browse/FLINK-27483
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Fuyao Li
>Priority: Major
>
> Hello Team,
> I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could 
> enable users to specify a URL to fetch jars.
> In many cases, user might want to add some custom headers to fetch jars from 
> remote private artifactory (like Oauth tokens, api-keys, etc).
> adding a field called artifactoryJarHeader will be very helpful.
> This field can be of array type.
> [key1, value1, key2, value2]
> It seems the current ArtifactManager doesn't support specify a custom header. 
> Please correct me if I am wrong.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch

2022-05-04 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531506#comment-17531506
 ] 

Aitozi edited comment on FLINK-27483 at 5/4/22 6:29 AM:


I think supporting the custom headers is useful, but I'm not tend add a 
dedicated field {{artifactoryJarHeader}} for it, because it only works when 
jarURI is based on http, besides it's not much extendable.

I'd like to add a {{configuration}} filed in {{FlinkSessionJobSpec}} . The 
field will used to control the reconcile logic like the reconcile interval, 
delay and also can pass the headers for the http url. WDYT [~wangyang0918] 
[~Fuyao Li] ?


was (Author: aitozi):
I think the custom headers is reasonable, but I'm not tend add a dedicated 
field {{artifactoryJarHeader}} for it, because it only works when jarURI is 
based on http, besides it's not much extendable.

I'd like to add a {{configuration}} filed in {{FlinkSessionJobSpec}} . The 
field will used to control the reconcile logic like the reconcile interval, 
delay and also can pass the headers for the http url. WDYT [~wangyang0918] 
[~Fuyao Li] ?

> Support adding custom HTTP header for HTTP based Jar fetch
> --
>
> Key: FLINK-27483
> URL: https://issues.apache.org/jira/browse/FLINK-27483
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Fuyao Li
>Priority: Major
>
> Hello Team,
> I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could 
> enable users to specify a URL to fetch jars.
> In many cases, user might want to add some custom headers to fetch jars from 
> remote private artifactory (like Oauth tokens, api-keys, etc).
> adding a field called artifactoryJarHeader will be very helpful.
> This field can be of array type.
> [key1, value1, key2, value2]
> It seems the current ArtifactManager doesn't support specify a custom header. 
> Please correct me if I am wrong.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch

2022-05-04 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531506#comment-17531506
 ] 

Aitozi commented on FLINK-27483:


I think the custom headers is reasonable, but I'm not tend add a dedicated 
field {{artifactoryJarHeader}} for it, because it only works when jarURI is 
based on http, besides it's not much extendable.

I'd like to add a {{configuration}} filed in {{FlinkSessionJobSpec}} . The 
field will used to control the reconcile logic like the reconcile interval, 
delay and also can pass the headers for the http url. WDYT [~wangyang0918] ?

> Support adding custom HTTP header for HTTP based Jar fetch
> --
>
> Key: FLINK-27483
> URL: https://issues.apache.org/jira/browse/FLINK-27483
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Fuyao Li
>Priority: Major
>
> Hello Team,
> I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could 
> enable users to specify a URL to fetch jars.
> In many cases, user might want to add some custom headers to fetch jars from 
> remote private artifactory (like Oauth tokens, api-keys, etc).
> adding a field called artifactoryJarHeader will be very helpful.
> This field can be of array type.
> [key1, value1, key2, value2]
> It seems the current ArtifactManager doesn't support specify a custom header. 
> Please correct me if I am wrong.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch

2022-05-04 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531506#comment-17531506
 ] 

Aitozi edited comment on FLINK-27483 at 5/4/22 6:28 AM:


I think the custom headers is reasonable, but I'm not tend add a dedicated 
field {{artifactoryJarHeader}} for it, because it only works when jarURI is 
based on http, besides it's not much extendable.

I'd like to add a {{configuration}} filed in {{FlinkSessionJobSpec}} . The 
field will used to control the reconcile logic like the reconcile interval, 
delay and also can pass the headers for the http url. WDYT [~wangyang0918] 
[~Fuyao Li] ?


was (Author: aitozi):
I think the custom headers is reasonable, but I'm not tend add a dedicated 
field {{artifactoryJarHeader}} for it, because it only works when jarURI is 
based on http, besides it's not much extendable.

I'd like to add a {{configuration}} filed in {{FlinkSessionJobSpec}} . The 
field will used to control the reconcile logic like the reconcile interval, 
delay and also can pass the headers for the http url. WDYT [~wangyang0918] ?

> Support adding custom HTTP header for HTTP based Jar fetch
> --
>
> Key: FLINK-27483
> URL: https://issues.apache.org/jira/browse/FLINK-27483
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Fuyao Li
>Priority: Major
>
> Hello Team,
> I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could 
> enable users to specify a URL to fetch jars.
> In many cases, user might want to add some custom headers to fetch jars from 
> remote private artifactory (like Oauth tokens, api-keys, etc).
> adding a field called artifactoryJarHeader will be very helpful.
> This field can be of array type.
> [key1, value1, key2, value2]
> It seems the current ArtifactManager doesn't support specify a custom header. 
> Please correct me if I am wrong.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-25865) Support to set restart policy of TaskManager pod for native K8s integration

2022-05-04 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531499#comment-17531499
 ] 

Aitozi commented on FLINK-25865:


Hi [~wangyang0918] are you working on this now ? If not, I would like to work 
on this.

> Support to set restart policy of TaskManager pod for native K8s integration
> ---
>
> Key: FLINK-25865
> URL: https://issues.apache.org/jira/browse/FLINK-25865
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes
>Reporter: Yang Wang
>Priority: Major
>
> After FLIP-201, Flink's TaskManagers will be able to be restarted without 
> losing its local state. So it is reasonable to make the restart policy[1] of 
> TaskManager pod could be configured.
> The current restart policy is {{{}Never{}}}. Flink will always delete the 
> failed TaskManager pod directly and create a new one instead. This ticket 
> could help to decrease the recovery time of TaskManager failure.
>  
> Please note that the working directory needs to be located in the 
> emptyDir[1], which is retained in different restarts.
>  
> [1]. 
> https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy
> [2]. https://kubernetes.io/docs/concepts/storage/volumes/#emptydir



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-26647) Can not add extra config files on native Kubernetes

2022-05-03 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531498#comment-17531498
 ] 

Aitozi commented on FLINK-26647:


I think it will bring convenience to add a way to add config files to the 
configMap. Because we can not predict what config files users may need.

Meanwhile, users are responsible for the files to be shipped, they should pay 
attention to avoid to add too much files to the configMap. If it's necessary to 
guard the shipped config files not too large, we can check the size before 
create the configMap, Right? But I have not come up with a suitable size 
threshold.

 I lean to make it easy to ship config files first, and add the safe guard if 
necessary, what do you think [~wangyang0918] ?

> Can not add extra config files on native Kubernetes 
> 
>
> Key: FLINK-26647
> URL: https://issues.apache.org/jira/browse/FLINK-26647
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.5
>Reporter: Zhe Wang
>Priority: Critical
>
> When using native Kubernetes mode (both session and application), predefine 
> FLINK_CONF_DIR environment with config files in. Only two files( 
> *flink-conf.yaml and log4j-console.properties* ) are populated to configmap 
> which means missing of other config files(like sql-client-defaults.yaml, 
> zoo.cfg etc.)
> Tried these, neither worked out:
> 1) After native Kubernetes startup, change both configmap and deployment:
>     1. add all my config files to configmap.
>     2. add config file to deployment.spec.template.spec.volumes[]
>     3. Flink job pod startups fail(log: lost leadership )
>  
> 2) Using a *pod-template-file.taskmanager* file:
>     1. add config files to created confimap.
>     2. add my config files to template(others can be merged by Flink as guide 
> says)
>     3. Flink task pod startup fail, log: Duplicated volume name



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27461) Add a convenient way to set userAgent for the kubeclient

2022-04-30 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530390#comment-17530390
 ] 

Aitozi commented on FLINK-27461:


cc [~wangyang0918] WDYT ? 

> Add a convenient way to set userAgent for the kubeclient
> 
>
> Key: FLINK-27461
> URL: https://issues.apache.org/jira/browse/FLINK-27461
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes
>Reporter: Aitozi
>Priority: Major
>
> Currently, we construct the kubeclient from the kubeconfig file or from the 
> context. However, If we have to set the user agent for the okttp client it 
> will be a little hard. 
> We use the kubernetes cluster with different team, we need to distinguish the 
> request for k8s guys to monitor the apiserver requests. We usually set the 
> user agent for different groups. The fabric client expose a way to extract 
> config from the property or enviroments. But the key for the user agent is 
> bundled with the client version, it's not so convenient. Can we support to 
> set the user agent by a dedicated flink option.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27461) Add a convenient way to set userAgent for the kubeclient

2022-04-30 Thread Aitozi (Jira)
Aitozi created FLINK-27461:
--

 Summary: Add a convenient way to set userAgent for the kubeclient
 Key: FLINK-27461
 URL: https://issues.apache.org/jira/browse/FLINK-27461
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Reporter: Aitozi


Currently, we construct the kubeclient from the kubeconfig file or from the 
context. However, If we have to set the user agent for the okttp client it will 
be a little hard. 

We use the kubernetes cluster with different team, we need to distinguish the 
request for k8s guys to monitor the apiserver requests. We usually set the user 
agent for different groups. The fabric client expose a way to extract config 
from the property or enviroments. But the key for the user agent is bundled 
with the client version, it's not so convenient. Can we support to set the user 
agent by a dedicated flink option.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27337) Prevent session cluster to be deleted when there are running jobs

2022-04-30 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530384#comment-17530384
 ] 

Aitozi commented on FLINK-27337:


OK, I will implement the deletions logic first, and will think about the update 
part again. If necessary, I will open a discussion on mailing list. 

> Prevent session cluster to be deleted when there are running jobs
> -
>
> Key: FLINK-27337
> URL: https://issues.apache.org/jira/browse/FLINK-27337
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Yang Wang
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> We should prevent the session cluster to be deleted when there are running 
> jobs. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27458) Expose allowNonRestoredState flag in JobSpec

2022-04-30 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530382#comment-17530382
 ] 

Aitozi commented on FLINK-27458:


I am willing to work on this

> Expose allowNonRestoredState flag in JobSpec
> 
>
> Key: FLINK-27458
> URL: https://issues.apache.org/jira/browse/FLINK-27458
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> We should probably expose this option as a top level spec field otherwise it 
> is impossible to set this on a per job level for SessionJobs.
> What do you think [~aitozi] [~wangyang0918] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27458) Expose allowNonRestoredState flag in JobSpec

2022-04-30 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530381#comment-17530381
 ] 

Aitozi commented on FLINK-27458:


The application job can set {{execution.savepoint.ignore-unclaimed-state: 
true}} to skip the unclaimed state. But the session job can not touch the 
cluster flink conf, So I think exposing allowNonRestoredState flag in JobSpec 
is reasonable.

BTW, If this value set to true, I think we should also set the 
{{execution.savepoint.ignore-unclaimed-state}} to true for application mode, 
WDYT ? 

> Expose allowNonRestoredState flag in JobSpec
> 
>
> Key: FLINK-27458
> URL: https://issues.apache.org/jira/browse/FLINK-27458
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> We should probably expose this option as a top level spec field otherwise it 
> is impossible to set this on a per job level for SessionJobs.
> What do you think [~aitozi] [~wangyang0918] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (FLINK-27337) Prevent session cluster to be deleted when there are running jobs

2022-04-29 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529879#comment-17529879
 ] 

Aitozi edited comment on FLINK-27337 at 4/29/22 9:13 AM:
-

I want to solve this in the following way
 # In webhook we can prevent the deletion/upgrade if there is session jobs in 
the session cluster.
 # In SessionReconciler cleanup and reconcile we should also check whether 
there is session job and decide to upgrade/delete session cluster or postpone 
to do that until there is no session jobs. (Because user may not enable the 
webhook)
 # Make the check configurable, users can choose to manually suspend the job 
before upgrade/delete the session cluster or directly done by the operator 
(propagating the delete or upgrade)

what do you think [~wangyang0918] [~gyfora] 


was (Author: aitozi):
 

I want to solve this in the following way
 # In webhook we can prevent the deletion/upgrade if there is session jobs in 
the session cluster.
 # In SessionReconciler cleanup and reconcile we should also check whether 
there is session job and decide to upgrade/delete session cluster or postpone 
to do that until there is no session jobs. (Because user may not enable the 
webhook)
 # Make the check configurable, users can choose to manually suspend the job 
before upgrade/delete the session cluster or directly done by the operator 
(propagating the delete or upgrade)

what do you think [~wangyang0918] [~gyfora] 

> Prevent session cluster to be deleted when there are running jobs
> -
>
> Key: FLINK-27337
> URL: https://issues.apache.org/jira/browse/FLINK-27337
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Yang Wang
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> We should prevent the session cluster to be deleted when there are running 
> jobs. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27337) Prevent session cluster to be deleted when there are running jobs

2022-04-29 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529879#comment-17529879
 ] 

Aitozi commented on FLINK-27337:


 

I want to solve this in the following way
 # In webhook we can prevent the deletion/upgrade if there is session jobs in 
the session cluster.
 # In SessionReconciler cleanup and reconcile we should also check whether 
there is session job and decide to upgrade/delete session cluster or postpone 
to do that until there is no session jobs. (Because user may not enable the 
webhook)
 # Make the check configurable, users can choose to manually suspend the job 
before upgrade/delete the session cluster or directly done by the operator 
(propagating the delete or upgrade)

what do you think [~wangyang0918] [~gyfora] 

> Prevent session cluster to be deleted when there are running jobs
> -
>
> Key: FLINK-27337
> URL: https://issues.apache.org/jira/browse/FLINK-27337
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Yang Wang
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> We should prevent the session cluster to be deleted when there are running 
> jobs. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27451) Enable the validator plugin in webhook

2022-04-29 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529852#comment-17529852
 ] 

Aitozi commented on FLINK-27451:


I will give a quick fix for this.

> Enable the validator plugin in webhook
> --
>
> Key: FLINK-27451
> URL: https://issues.apache.org/jira/browse/FLINK-27451
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.0.0
>Reporter: Aitozi
>Priority: Major
>
> Currently the validator plugin is only enable in the operator. I think it 
> should also be enabled in webhook. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27451) Enable the validator plugin in webhook

2022-04-29 Thread Aitozi (Jira)
Aitozi created FLINK-27451:
--

 Summary: Enable the validator plugin in webhook
 Key: FLINK-27451
 URL: https://issues.apache.org/jira/browse/FLINK-27451
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.0.0
Reporter: Aitozi


Currently the validator plugin is only enable in the operator. I think it 
should also be enabled in webhook. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (FLINK-27262) Enrich validator for FlinkSessionJob

2022-04-27 Thread Aitozi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aitozi updated FLINK-27262:
---
Summary: Enrich validator for FlinkSessionJob  (was: Enrich validator for 
FlinkSesionJob)

> Enrich validator for FlinkSessionJob
> 
>
> Key: FLINK-27262
> URL: https://issues.apache.org/jira/browse/FLINK-27262
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Yang Wang
>Assignee: Aitozi
>Priority: Major
>
> We need to enrich the validator to cover FlinkSesionJob.
> At least we could have the following rules.
>  * Upgrade mode is savepoint, then {{state.savepoints.dir}} should be 
> configured in session FlinkDeployment



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27370) Add a new SessionJobState - Failed

2022-04-26 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528217#comment-17528217
 ] 

Aitozi commented on FLINK-27370:


Get it, I think we can add the {{success}} field here as a direct indication of 
whether reconcile succeed. cc [~gyfora] do you have some inputs for this ? 

> Add a new SessionJobState - Failed
> --
>
> Key: FLINK-27370
> URL: https://issues.apache.org/jira/browse/FLINK-27370
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Xin Hao
>Priority: Minor
>
> It will be nice if we can add a new SessionJobState Failed to indicate there 
> is an error for the session job.
> {code:java}
> status:
>   error: 'The error message'
>   jobStatus:
> savepointInfo: {}
>   reconciliationStatus:
> reconciliationTimestamp: 0
> state: DEPLOYED {code}
> Reason:
> 1. It will be easier for monitoring
> 2. I have a personal controller to submit session jobs, it will be cleaner to 
> check the state by a single field and get the details by the error field.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27370) Add a new SessionJobState - Failed

2022-04-26 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528104#comment-17528104
 ] 

Aitozi commented on FLINK-27370:


This field will only reflect whether last reconcile succeed not the session job 
state. If you want to know the session job state, maybe you can refer to the 
{{status.jobStatus.state}} 

> Add a new SessionJobState - Failed
> --
>
> Key: FLINK-27370
> URL: https://issues.apache.org/jira/browse/FLINK-27370
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Xin Hao
>Priority: Minor
>
> It will be nice if we can add a new SessionJobState Failed to indicate there 
> is an error for the session job.
> {code:java}
> status:
>   error: 'The error message'
>   jobStatus:
> savepointInfo: {}
>   reconciliationStatus:
> reconciliationTimestamp: 0
> state: DEPLOYED {code}
> Reason:
> 1. It will be easier for monitoring
> 2. I have a personal controller to submit session jobs, it will be cleaner to 
> check the state by a single field and get the details by the error field.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27262) Enrich validator for FlinkSesionJob

2022-04-26 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527928#comment-17527928
 ] 

Aitozi commented on FLINK-27262:


I have one thing to discuss here: do we need to support user create a session 
job CR when flinkDeployment CR have not been created. If not, we can always 
validate both together and will make the validate work easy I think. cc 
[~wangyang0918] [~gyfora] 

> Enrich validator for FlinkSesionJob
> ---
>
> Key: FLINK-27262
> URL: https://issues.apache.org/jira/browse/FLINK-27262
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Yang Wang
>Assignee: Aitozi
>Priority: Major
>
> We need to enrich the validator to cover FlinkSesionJob.
> At least we could have the following rules.
>  * Upgrade mode is savepoint, then {{state.savepoints.dir}} should be 
> configured in session FlinkDeployment



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27370) Add a new SessionJobState - Failed

2022-04-26 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527908#comment-17527908
 ] 

Aitozi commented on FLINK-27370:


It previously have a {{success=true/false}} field do you mean that ? 

> Add a new SessionJobState - Failed
> --
>
> Key: FLINK-27370
> URL: https://issues.apache.org/jira/browse/FLINK-27370
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Xin Hao
>Priority: Minor
>
> It will be nice if we can add a new SessionJobState Failed to indicate there 
> is an error for the session job.
> {code:java}
> status:
>   error: 'The error message'
>   jobStatus:
> savepointInfo: {}
>   reconciliationStatus:
> reconciliationTimestamp: 0
> state: DEPLOYED {code}
> Reason:
> 1. It will be easier for monitoring
> 2. I have a personal controller to submit session jobs, it will be cleaner to 
> check the state by a single field and get the details by the error field.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27397) Improve the CrdReferenceDoclet generator to handle the abstract class

2022-04-25 Thread Aitozi (Jira)
Aitozi created FLINK-27397:
--

 Summary: Improve the CrdReferenceDoclet generator to handle the 
abstract class
 Key: FLINK-27397
 URL: https://issues.apache.org/jira/browse/FLINK-27397
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.0.0
Reporter: Aitozi


As discussed 
[here|https://github.com/apache/flink-kubernetes-operator/pull/176#discussion_r856945232]
 . We should improve the abstract class handle in the {{CrdReferenceDoclet}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27358) Kubernetes operator throws NPE when testing with Flink 1.15

2022-04-24 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527147#comment-17527147
 ] 

Aitozi commented on FLINK-27358:


Do we need to link the related upstream ticket here ? 

> Kubernetes operator throws NPE when testing with Flink 1.15
> ---
>
> Key: FLINK-27358
> URL: https://issues.apache.org/jira/browse/FLINK-27358
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: kubernetes-operator-1.0.0
>
>
> {code:java}
> 2022-04-22 10:19:18,307 o.a.f.k.o.c.FlinkDeploymentController [WARN 
> ][default/flink-example-statemachine] Attempt count: 5, last attempt: true
> 2022-04-22 10:19:18,329 i.j.o.p.e.ReconciliationDispatcher 
> [ERROR][default/flink-example-statemachine] Error during event processing 
> ExecutionScope{ resource id: 
> CustomResourceID{name='flink-example-statemachine', namespace='default'}, 
> version: 4979543} failed.
> org.apache.flink.kubernetes.operator.exception.ReconciliationException: 
> java.lang.NullPointerException
>     at 
> org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:110)
>     at 
> org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:53)
>     at 
> io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:101)
>     at 
> io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:76)
>     at 
> io.javaoperatorsdk.operator.api.monitoring.Metrics.timeControllerExecution(Metrics.java:34)
>     at 
> io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:75)
>     at 
> io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:143)
>     at 
> io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:109)
>     at 
> io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:74)
>     at 
> io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:50)
>     at 
> io.javaoperatorsdk.operator.processing.event.EventProcessor$ControllerExecution.run(EventProcessor.java:349)
>     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
> Source)
>     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)
>     at java.base/java.lang.Thread.run(Unknown Source)
> Caused by: java.lang.NullPointerException
>     at 
> org.apache.flink.kubernetes.operator.utils.FlinkUtils.lambda$deleteJobGraphInKubernetesHA$0(FlinkUtils.java:253)
>     at java.base/java.util.ArrayList.forEach(Unknown Source)
>     at 
> org.apache.flink.kubernetes.operator.utils.FlinkUtils.deleteJobGraphInKubernetesHA(FlinkUtils.java:248)
>     at 
> org.apache.flink.kubernetes.operator.service.FlinkService.submitApplicationCluster(FlinkService.java:130)
>     at 
> org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deployFlinkJob(ApplicationReconciler.java:205)
>     at 
> org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.restoreFromLastSavepoint(ApplicationReconciler.java:218)
>     at 
> org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.reconcile(ApplicationReconciler.java:117)
>     at 
> org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.reconcile(ApplicationReconciler.java:56)
>     at 
> org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:106)
>     ... 13 more {code}
> The root cause is that the Kubernetes HA implementation has changed from 
> 1.15. When the job is cancelled, the data of leader ConfigMap will be 
> cleared. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27362) Support restartNonce semantics in session job

2022-04-24 Thread Aitozi (Jira)
Aitozi created FLINK-27362:
--

 Summary: Support restartNonce semantics in session job
 Key: FLINK-27362
 URL: https://issues.apache.org/jira/browse/FLINK-27362
 Project: Flink
  Issue Type: Sub-task
Reporter: Aitozi






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27262) Enrich validator for FlinkSesionJob

2022-04-24 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526985#comment-17526985
 ] 

Aitozi commented on FLINK-27262:


FYI, I'm working on this now.

> Enrich validator for FlinkSesionJob
> ---
>
> Key: FLINK-27262
> URL: https://issues.apache.org/jira/browse/FLINK-27262
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Yang Wang
>Priority: Major
>
> We need to enrich the validator to cover FlinkSesionJob.
> At least we could have the following rules.
>  * Upgrade mode is savepoint, then {{state.savepoints.dir}} should be 
> configured in session FlinkDeployment



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27360) Rename clusterId field of FlinkSessionJobSpec to deploymentName

2022-04-23 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526951#comment-17526951
 ] 

Aitozi commented on FLINK-27360:


Make sense to me, It's a minor change, I will open a PR for it.

> Rename clusterId field of FlinkSessionJobSpec to deploymentName
> ---
>
> Key: FLINK-27360
> URL: https://issues.apache.org/jira/browse/FLINK-27360
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> Since the sessionjob logic only works together with a FlinkDeployment, and 
> the clusterId always has to match the name, it would be better to explicitly 
> call it deploymentName so that it is more intuitive for the users.
> What do you think?
> cc [~aitozi] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27337) Prevent session cluster to be deleted when there are running jobs

2022-04-21 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525579#comment-17525579
 ] 

Aitozi commented on FLINK-27337:


As discussed in the [mailing 
list|https://lists.apache.org/thread/9xvdwtf5tbr28lonj21p58tn5gdndns5], the 
deletion of all the session job may be a critical operation. So if there is an 
upgrade or cleanup event for the session cluster, we can postpone the real 
cleanup/stop until all the session job suspended by user. We can avoid to 
remove the finializer and re-schedule to check whether all the job are stopped. 
 This way is more safe but may involve more step when user want to delete or 
upgrade the session cluster, what about make it both supported. 

> Prevent session cluster to be deleted when there are running jobs
> -
>
> Key: FLINK-27337
> URL: https://issues.apache.org/jira/browse/FLINK-27337
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Yang Wang
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> We should prevent the session cluster to be deleted when there are running 
> jobs. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27334) Support auto generate the doc for the KubernetesOperatorConfigOptions

2022-04-20 Thread Aitozi (Jira)
Aitozi created FLINK-27334:
--

 Summary: Support auto generate the doc for the 
KubernetesOperatorConfigOptions
 Key: FLINK-27334
 URL: https://issues.apache.org/jira/browse/FLINK-27334
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Aitozi
 Fix For: kubernetes-operator-1.0.0






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (FLINK-27270) Add document of session job operations

2022-04-20 Thread Aitozi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aitozi updated FLINK-27270:
---
Description: 
# Basic operations
 # How to support different source jars

> Add document of session job operations
> --
>
> Key: FLINK-27270
> URL: https://issues.apache.org/jira/browse/FLINK-27270
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> # Basic operations
>  # How to support different source jars



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27309) Allow to load default flink configs in the k8s operator dynamically

2022-04-19 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524250#comment-17524250
 ] 

Aitozi commented on FLINK-27309:


I think it's a very useful feature in production. But do we need to outline 
what config can dynamically take effect. Currently, most option are controlling 
the reconciler interval or check behavior can take effect at the next reconcile 
turn. But still with option like {{operator.reconciler.max.parallelism}} which 
may can't take effect directly, Right ?

> Allow to load default flink configs in the k8s operator dynamically
> ---
>
> Key: FLINK-27309
> URL: https://issues.apache.org/jira/browse/FLINK-27309
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> Current default configs used by the k8s operator will be saved in the 
> /opt/flink/conf dir in the k8s operator pod and will be loaded only once when 
> the operator is created.
> Since the flink k8s operator could be a long running service and users may 
> want to modify the default configs(e.g the metric reporter sampling interval) 
> for newly created deployments, it may better to load the default configs 
> dynamically(i.e. parsing the latest /opt/flink/conf/flink-conf.yaml) in the 
> {{ReconcilerFactory}} and {{ObserverFactory}}, instead of redeploying the 
> operator.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-27279) Extract common status interfaces

2022-04-19 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524248#comment-17524248
 ] 

Aitozi commented on FLINK-27279:


[~gyfora] FYI, I'm working on this now. 

> Extract common status interfaces
> 
>
> Key: FLINK-27279
> URL: https://issues.apache.org/jira/browse/FLINK-27279
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> FlinkDeploymentStatus - FlinkSessionJobStatus
> and
> ReconciliationStatus - FlinkSessionJobReconciiationStatus
> share most of their content and extracting the shared parts into interfaces 
> would allow us to unify status update logic and remove some code duplicaiton



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-27279) Extract common status interfaces

2022-04-18 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523590#comment-17523590
 ] 

Aitozi commented on FLINK-27279:


Get it

> Extract common status interfaces
> 
>
> Key: FLINK-27279
> URL: https://issues.apache.org/jira/browse/FLINK-27279
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> FlinkDeploymentStatus - FlinkSessionJobStatus
> and
> ReconciliationStatus - FlinkSessionJobReconciiationStatus
> share most of their content and extracting the shared parts into interfaces 
> would allow us to unify status update logic and remove some code duplicaiton



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-27279) Extract common status interfaces

2022-04-18 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523573#comment-17523573
 ] 

Aitozi commented on FLINK-27279:


hah, I'm willing to take this ticket. Does this also means that we can offload 
some methods in the {{ReconciliationUtils}} to the common related status 
objects as discussed 
[here|https://github.com/apache/flink-kubernetes-operator/pull/165#discussion_r847997670]

> Extract common status interfaces
> 
>
> Key: FLINK-27279
> URL: https://issues.apache.org/jira/browse/FLINK-27279
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> FlinkDeploymentStatus - FlinkSessionJobStatus
> and
> ReconciliationStatus - FlinkSessionJobReconciiationStatus
> share most of their content and extracting the shared parts into interfaces 
> would allow us to unify status update logic and remove some code duplicaiton



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-27261) Disable web.cancel.enable for session cluster

2022-04-16 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523273#comment-17523273
 ] 

Aitozi commented on FLINK-27261:


There is a bit different between the session job and application cluster. If 
user cancel the session job from web ui, I think the session job will sync the 
state to cancelled. 

But I still think we'd better disable it by default to make the job operation 
are all managed by the CR

> Disable web.cancel.enable for session cluster
> -
>
> Key: FLINK-27261
> URL: https://issues.apache.org/jira/browse/FLINK-27261
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Yang Wang
>Priority: Major
>  Labels: starter
>
> In FLINK-27154, we disable {{web.cancel.enable}} for application cluster. We 
> should also do this for session cluster.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-27261) Disable web.cancel.enable for session cluster

2022-04-16 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523273#comment-17523273
 ] 

Aitozi edited comment on FLINK-27261 at 4/17/22 4:03 AM:
-

There is a bit difference between the session job and application cluster. If 
user cancel the session job from web ui, I think the session job will sync the 
state to cancelled. 

But I still think we'd better disable it by default to make the job operation 
are all managed by the CR


was (Author: aitozi):
There is a bit different between the session job and application cluster. If 
user cancel the session job from web ui, I think the session job will sync the 
state to cancelled. 

But I still think we'd better disable it by default to make the job operation 
are all managed by the CR

> Disable web.cancel.enable for session cluster
> -
>
> Key: FLINK-27261
> URL: https://issues.apache.org/jira/browse/FLINK-27261
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Yang Wang
>Priority: Major
>  Labels: starter
>
> In FLINK-27154, we disable {{web.cancel.enable}} for application cluster. We 
> should also do this for session cluster.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27273) Support both configMap and zookeeper based HA data clean up

2022-04-16 Thread Aitozi (Jira)
Aitozi created FLINK-27273:
--

 Summary: Support both configMap and zookeeper based HA data clean 
up
 Key: FLINK-27273
 URL: https://issues.apache.org/jira/browse/FLINK-27273
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Aitozi
 Fix For: kubernetes-operator-1.0.0


As discussed in 
[comments|https://github.com/apache/flink-kubernetes-operator/pull/28#discussion_r815695041]

We only support clean up the ha data based configMap. Considering that 
zookeeper is still widely used as ha service when deploy on the kubernetes, I 
think we should still take it into account, otherwise, It will come up with 
some unexpected behavior when play with zk ha jobs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-27160) Add e2e tests for FlinkSessionJob type

2022-04-16 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523110#comment-17523110
 ] 

Aitozi commented on FLINK-27160:


I'm working on this now

> Add e2e tests for FlinkSessionJob type
> --
>
> Key: FLINK-27160
> URL: https://issues.apache.org/jira/browse/FLINK-27160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-27262) Enrich validator for FlinkSesionJob

2022-04-16 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523109#comment-17523109
 ] 

Aitozi commented on FLINK-27262:


BTW, I think the webhook have not handle the validation of the sessionjob. I 
thinks this functionality will also should be included in this ticket.

> Enrich validator for FlinkSesionJob
> ---
>
> Key: FLINK-27262
> URL: https://issues.apache.org/jira/browse/FLINK-27262
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Yang Wang
>Priority: Major
>
> We need to enrich the validator to cover FlinkSesionJob.
> At least we could have the following rules.
>  * Upgrade mode is savepoint, then {{state.savepoints.dir}} should be 
> configured in session FlinkDeployment



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-27257) Flink kubernetes operator triggers savepoint failed because of not all tasks running

2022-04-16 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523105#comment-17523105
 ] 

Aitozi commented on FLINK-27257:


I also encountered this problem, I think it is caused by the 
{{ApplicationReconciler#isJobRunning}} results is not exact 

> Flink kubernetes operator triggers savepoint failed because of not all tasks 
> running
> 
>
> Key: FLINK-27257
> URL: https://issues.apache.org/jira/browse/FLINK-27257
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Yang Wang
>Priority: Major
>
> {code:java}
> 2022-04-15 02:38:56,551 o.a.f.k.o.s.FlinkService       [INFO 
> ][default/flink-example-statemachine] Fetching savepoint result with 
> triggerId: 182d7f176496856d7b33fe2f3767da18
> 2022-04-15 02:38:56,690 o.a.f.k.o.s.FlinkService       
> [ERROR][default/flink-example-statemachine] Savepoint error
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint 
> triggering task Source: Custom Source (1/2) of job 
>  is not being executed at the moment. 
> Aborting checkpoint. Failure reason: Not all required tasks are currently 
> running.
>     at 
> org.apache.flink.runtime.checkpoint.DefaultCheckpointPlanCalculator.checkTasksStarted(DefaultCheckpointPlanCalculator.java:143)
>     at 
> org.apache.flink.runtime.checkpoint.DefaultCheckpointPlanCalculator.lambda$calculateCheckpointPlan$1(DefaultCheckpointPlanCalculator.java:105)
>     at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRunAsync$4(AkkaRpcActor.java:455)
>     at 
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:455)
>     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:213)
>     at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78)
>     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
>     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
>     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
>     at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
>     at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
>     at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
>     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
>     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
>     at akka.actor.Actor.aroundReceive(Actor.scala:537)
>     at akka.actor.Actor.aroundReceive$(Actor.scala:535)
>     at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
>     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
>     at akka.actor.ActorCell.invoke(ActorCell.scala:548)
>     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
>     at akka.dispatch.Mailbox.run(Mailbox.scala:231)
>     at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
>     at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
>     at 
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
>     at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
>     at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
> 2022-04-15 02:38:56,693 o.a.f.k.o.o.SavepointObserver  
> [ERROR][default/flink-example-statemachine] Checkpoint triggering task 
> Source: Custom Source (1/2) of job  is not 
> being executed at the moment. Aborting checkpoint. Failure reason: Not all 
> required tasks are currently running. {code}
> How to reproduce?
> Update arbitrary fields(e.g. parallelism) along with 
> {{{}savepointTriggerNonce{}}}.
>  
> The root cause might be the running state return by 
> {{ClusterClient#listJobs()}} does not mean all the tasks are running.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26870) Implement session job observer

2022-04-16 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523099#comment-17523099
 ] 

Aitozi commented on FLINK-26870:


I think this is already include in the 
[https://github.com/apache/flink-kubernetes-operator/pull/164]. Closing this 
ticket now

> Implement session job observer
> --
>
> Key: FLINK-26870
> URL: https://issues.apache.org/jira/browse/FLINK-26870
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (FLINK-26870) Implement session job observer

2022-04-16 Thread Aitozi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aitozi closed FLINK-26870.
--
Resolution: Fixed

> Implement session job observer
> --
>
> Key: FLINK-26870
> URL: https://issues.apache.org/jira/browse/FLINK-26870
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27270) Add document of session job operations

2022-04-16 Thread Aitozi (Jira)
Aitozi created FLINK-27270:
--

 Summary: Add document of session job operations
 Key: FLINK-27270
 URL: https://issues.apache.org/jira/browse/FLINK-27270
 Project: Flink
  Issue Type: Sub-task
  Components: Kubernetes Operator
Reporter: Aitozi
 Fix For: kubernetes-operator-1.0.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27269) Clean up the jar file after submitting the job

2022-04-16 Thread Aitozi (Jira)
Aitozi created FLINK-27269:
--

 Summary: Clean up the jar file after submitting the job 
 Key: FLINK-27269
 URL: https://issues.apache.org/jira/browse/FLINK-27269
 Project: Flink
  Issue Type: Sub-task
Reporter: Aitozi


When testing, I found that the jar files will exploded after several submits, I 
think we should clean up the jars after submitting



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-27262) Enrich validator for FlinkSesionJob

2022-04-16 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523078#comment-17523078
 ] 

Aitozi commented on FLINK-27262:


Agree, the validator for session job is not complete now. I will work on this, 
Please help assign the ticket 

> Enrich validator for FlinkSesionJob
> ---
>
> Key: FLINK-27262
> URL: https://issues.apache.org/jira/browse/FLINK-27262
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Yang Wang
>Priority: Major
>
> We need to enrich the validator to cover FlinkSesionJob.
> At least we could have the following rules.
>  * Upgrade mode is savepoint, then {{state.savepoints.dir}} should be 
> configured in session FlinkDeployment



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-27203) Add supports for uploading jars from object storage (s3, gcs, oss)

2022-04-14 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522186#comment-17522186
 ] 

Aitozi commented on FLINK-27203:


Hi [~haoxin] , do you mean for uploading the session jobs jars ? There is a 
ticket tracking this 

> Add supports for uploading jars from object storage (s3, gcs, oss)
> --
>
> Key: FLINK-27203
> URL: https://issues.apache.org/jira/browse/FLINK-27203
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Xin Hao
>Priority: Major
>
> I think it will be efficient if we can read jars from object storage.
>  
>  *  We can detect the object storage provider by path scheme. (Such as: 
> `gs://` for GCS)
>  * Because the most cloud providers' SDK support native permission 
> integration with K8s, for example, the GCP, the user can bind the permission 
> by service accounts, and the code running in the K8s cluster will be simple 
> `{{{}storage = StorageOptions.getDefaultInstance().service{}}}`
>  * In this way, the users only need to bind the service accounts permissions 
> and don't need to handle volumes mount and jars initialize (download jars 
> from object storage into the volumes by themselves).
> I think we can define the interface first and let the community help 
> contribute to the implementation for the different cloud providers.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-27161) Support user jar from different type of filesystems

2022-04-11 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17520316#comment-17520316
 ] 

Aitozi commented on FLINK-27161:


I'm working on this now.

> Support user jar from different type of filesystems
> ---
>
> Key: FLINK-27161
> URL: https://issues.apache.org/jira/browse/FLINK-27161
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27161) Support user jar from different type of filesystems

2022-04-10 Thread Aitozi (Jira)
Aitozi created FLINK-27161:
--

 Summary: Support user jar from different type of filesystems
 Key: FLINK-27161
 URL: https://issues.apache.org/jira/browse/FLINK-27161
 Project: Flink
  Issue Type: Sub-task
  Components: Kubernetes Operator
Reporter: Aitozi
 Fix For: kubernetes-operator-1.0.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27160) Add e2e tests for FlinkSessionJob type

2022-04-10 Thread Aitozi (Jira)
Aitozi created FLINK-27160:
--

 Summary: Add e2e tests for FlinkSessionJob type
 Key: FLINK-27160
 URL: https://issues.apache.org/jira/browse/FLINK-27160
 Project: Flink
  Issue Type: Sub-task
  Components: Kubernetes Operator
Reporter: Aitozi
 Fix For: kubernetes-operator-1.0.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (FLINK-27001) Support to specify the resource of the operator

2022-04-07 Thread Aitozi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aitozi closed FLINK-27001.
--
Resolution: Fixed

> Support to specify the resource of the operator 
> 
>
> Key: FLINK-27001
> URL: https://issues.apache.org/jira/browse/FLINK-27001
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
>
> Supporting to specify the operator resource requirements and limits



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-27001) Support to specify the resource of the operator

2022-04-07 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519298#comment-17519298
 ] 

Aitozi commented on FLINK-27001:


Yes, closing it.

> Support to specify the resource of the operator 
> 
>
> Key: FLINK-27001
> URL: https://issues.apache.org/jira/browse/FLINK-27001
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
>
> Supporting to specify the operator resource requirements and limits



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-27028) Support to upload jar and run jar in RestClusterClient

2022-04-02 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516448#comment-17516448
 ] 

Aitozi commented on FLINK-27028:


cc [~wangyang0918]  [~chesnay]   If no objection, I'm willing to open a pull 
request for this.

> Support to upload jar and run jar in RestClusterClient
> --
>
> Key: FLINK-27028
> URL: https://issues.apache.org/jira/browse/FLINK-27028
> Project: Flink
>  Issue Type: Improvement
>  Components: Client / Job Submission
>Reporter: Aitozi
>Priority: Major
>
> The {{flink-kubernetes-operator}} is using the JarUpload + JarRun to support 
> the session job submission. However, currently the RestClusterClient do not 
> expose a way to upload the user jar to session cluster and trigger the jar 
> run api. So a naked RestClient is used to achieve this, but it lacks the 
> common retry logic.
> Can we expose these two api the the rest cluster client to make it more 
> convenient to use in the operator 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-27028) Support to upload jar and run jar in RestClusterClient

2022-04-02 Thread Aitozi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aitozi updated FLINK-27028:
---
Description: 
The {{flink-kubernetes-operator}} is using the JarUpload + JarRun to support 
the session job submission. However, currently the RestClusterClient do not 
expose a way to upload the user jar to session cluster and trigger the jar run 
api. So a naked RestClient is used to achieve this, but it lacks the common 
retry logic.

Can we expose these two api the the rest cluster client to make it more 
convenient to use in the operator 

  was:
The flink-kubernetes-operator is using the JarUpload + JarRun to support the 
session job management. However, currently the RestClusterClient do not expose 
a way to upload the user jar to session cluster and trigger the jar run api. So 
I used to naked RestClient to achieve this. 

Can we expose these two api the the rest cluster client to make it more 
convenient to use in the operator


> Support to upload jar and run jar in RestClusterClient
> --
>
> Key: FLINK-27028
> URL: https://issues.apache.org/jira/browse/FLINK-27028
> Project: Flink
>  Issue Type: Improvement
>  Components: Client / Job Submission
>Reporter: Aitozi
>Priority: Major
>
> The {{flink-kubernetes-operator}} is using the JarUpload + JarRun to support 
> the session job submission. However, currently the RestClusterClient do not 
> expose a way to upload the user jar to session cluster and trigger the jar 
> run api. So a naked RestClient is used to achieve this, but it lacks the 
> common retry logic.
> Can we expose these two api the the rest cluster client to make it more 
> convenient to use in the operator 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27028) Support to upload jar and run jar in RestClusterClient

2022-04-02 Thread Aitozi (Jira)
Aitozi created FLINK-27028:
--

 Summary: Support to upload jar and run jar in RestClusterClient
 Key: FLINK-27028
 URL: https://issues.apache.org/jira/browse/FLINK-27028
 Project: Flink
  Issue Type: Improvement
  Components: Client / Job Submission
Reporter: Aitozi


The flink-kubernetes-operator is using the JarUpload + JarRun to support the 
session job management. However, currently the RestClusterClient do not expose 
a way to upload the user jar to session cluster and trigger the jar run api. So 
I used to naked RestClient to achieve this. 

Can we expose these two api the the rest cluster client to make it more 
convenient to use in the operator



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-27001) Support to specify the resource of the operator

2022-04-02 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516445#comment-17516445
 ] 

Aitozi commented on FLINK-27001:


It seems this can also implement by 
https://issues.apache.org/jira/browse/FLINK-26663 So I will not work on this 
right now, I will keep an eye on this.

> Support to specify the resource of the operator 
> 
>
> Key: FLINK-27001
> URL: https://issues.apache.org/jira/browse/FLINK-27001
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
>
> Supporting to specify the operator resource requirements and limits



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27001) Support to specify the resource of the operator

2022-04-01 Thread Aitozi (Jira)
Aitozi created FLINK-27001:
--

 Summary: Support to specify the resource of the operator 
 Key: FLINK-27001
 URL: https://issues.apache.org/jira/browse/FLINK-27001
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Aitozi


Supporting to specify the operator resource requirements and limits



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27000) Support to set JVM args for operator

2022-04-01 Thread Aitozi (Jira)
Aitozi created FLINK-27000:
--

 Summary: Support to set JVM args for operator
 Key: FLINK-27000
 URL: https://issues.apache.org/jira/browse/FLINK-27000
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Aitozi


In production we often need to set the JVM option to operator



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-20808) Remove redundant checkstyle rules

2022-04-01 Thread Aitozi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aitozi updated FLINK-20808:
---
Attachment: (was: image-2022-04-02-12-46-11-005.png)

> Remove redundant checkstyle rules
> -
>
> Key: FLINK-20808
> URL: https://issues.apache.org/jira/browse/FLINK-20808
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Build System
>Reporter: Chesnay Schepler
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, auto-deprioritized-minor
> Attachments: image-2022-04-02-12-46-28-065.png
>
>
> There are probably a few checkstyle rules that are now enforced by spotless, 
> and we could remove these to clarify the responsibilities of each tool.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-20808) Remove redundant checkstyle rules

2022-04-01 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516216#comment-17516216
 ] 

Aitozi commented on FLINK-20808:


No need to answer, after some search, I found it can be solved by set the IDE 
line separator to {{LF :)}}

!image-2022-04-02-12-46-28-065.png|width=354,height=85!

> Remove redundant checkstyle rules
> -
>
> Key: FLINK-20808
> URL: https://issues.apache.org/jira/browse/FLINK-20808
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Build System
>Reporter: Chesnay Schepler
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, auto-deprioritized-minor
> Attachments: image-2022-04-02-12-46-11-005.png, 
> image-2022-04-02-12-46-28-065.png
>
>
> There are probably a few checkstyle rules that are now enforced by spotless, 
> and we could remove these to clarify the responsibilities of each tool.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-20808) Remove redundant checkstyle rules

2022-04-01 Thread Aitozi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aitozi updated FLINK-20808:
---
Attachment: image-2022-04-02-12-46-28-065.png

> Remove redundant checkstyle rules
> -
>
> Key: FLINK-20808
> URL: https://issues.apache.org/jira/browse/FLINK-20808
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Build System
>Reporter: Chesnay Schepler
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, auto-deprioritized-minor
> Attachments: image-2022-04-02-12-46-11-005.png, 
> image-2022-04-02-12-46-28-065.png
>
>
> There are probably a few checkstyle rules that are now enforced by spotless, 
> and we could remove these to clarify the responsibilities of each tool.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-20808) Remove redundant checkstyle rules

2022-04-01 Thread Aitozi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aitozi updated FLINK-20808:
---
Attachment: image-2022-04-02-12-46-11-005.png

> Remove redundant checkstyle rules
> -
>
> Key: FLINK-20808
> URL: https://issues.apache.org/jira/browse/FLINK-20808
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Build System
>Reporter: Chesnay Schepler
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, auto-deprioritized-minor
> Attachments: image-2022-04-02-12-46-11-005.png, 
> image-2022-04-02-12-46-28-065.png
>
>
> There are probably a few checkstyle rules that are now enforced by spotless, 
> and we could remove these to clarify the responsibilities of each tool.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-20808) Remove redundant checkstyle rules

2022-04-01 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516211#comment-17516211
 ] 

Aitozi commented on FLINK-20808:


Hi [~chesnay]  sorry to bother you here, I run into a case that: I follow the 
development doc to use the save action and google-java-format to automatically 
format the code. But the formatted code can not pass the checksytyle rule 
[NewlineAtEndOfFile] But I check the unpassed code,It has the end line 
actually. Is there some bug for checkstyle or I miss some configuration for 
development?

> Remove redundant checkstyle rules
> -
>
> Key: FLINK-20808
> URL: https://issues.apache.org/jira/browse/FLINK-20808
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Build System
>Reporter: Chesnay Schepler
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, auto-deprioritized-minor
>
> There are probably a few checkstyle rules that are now enforced by spotless, 
> and we could remove these to clarify the responsibilities of each tool.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26996) Break the reconcile after first create session cluster

2022-04-01 Thread Aitozi (Jira)
Aitozi created FLINK-26996:
--

 Summary: Break the reconcile after first create session cluster
 Key: FLINK-26996
 URL: https://issues.apache.org/jira/browse/FLINK-26996
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Reporter: Aitozi


When I test session cluster, I found that it will always start twice for the 
session cluster. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-26989) Fix the potential lost of secondary resource when watching a namespace

2022-04-01 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515886#comment-17515886
 ] 

Aitozi edited comment on FLINK-26989 at 4/1/22 12:03 PM:
-

If there's one namespace, the eventsource do not duplicated for the same type, 
so not a bug, closing~


was (Author: aitozi):
If there's a namespace, the eventsource do not duplicated for the same type, so 
not a bug, closing~

> Fix the potential lost of secondary resource when watching a namespace
> --
>
> Key: FLINK-26989
> URL: https://issues.apache.org/jira/browse/FLINK-26989
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
>  Labels: pull-request-available
>
> The event source is register under the name of namespace when watching 
> namespace is not empty, But when we get from the context, we use the 
> different condition, It may make it miss the target event source



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (FLINK-26989) Fix the potential lost of secondary resource when watching a namespace

2022-04-01 Thread Aitozi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aitozi closed FLINK-26989.
--
Resolution: Not A Problem

If there's a namespace, the eventsource do not duplicated for the same type, so 
not a bug, closing~

> Fix the potential lost of secondary resource when watching a namespace
> --
>
> Key: FLINK-26989
> URL: https://issues.apache.org/jira/browse/FLINK-26989
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
>  Labels: pull-request-available
>
> The event source is register under the name of namespace when watching 
> namespace is not empty, But when we get from the context, we use the 
> different condition, It may make it miss the target event source



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26989) Fix the potential lost of secondary resource when watching a namespace

2022-04-01 Thread Aitozi (Jira)
Aitozi created FLINK-26989:
--

 Summary: Fix the potential lost of secondary resource when 
watching a namespace
 Key: FLINK-26989
 URL: https://issues.apache.org/jira/browse/FLINK-26989
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Reporter: Aitozi


The event source is register under the name of namespace when watching 
namespace is not empty, But when we get from the context, we use the different 
condition, It may make it miss the target event source



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26915) Extend the Reconciler and Observer interface

2022-03-29 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17514154#comment-17514154
 ] 

Aitozi commented on FLINK-26915:


cc [~gyfora] [~wangyang0918] 

> Extend the Reconciler and Observer interface
> 
>
> Key: FLINK-26915
> URL: https://issues.apache.org/jira/browse/FLINK-26915
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> As discussed in 
> [comments|https://github.com/apache/flink-kubernetes-operator/pull/112#discussion_r835762111],
>  I proposed make two changes to the Reconciler and Observer
>  # directly return the UpdateControl from the reconciler, because the 
> reconciler can in charge of the Update behavior, By this, we dont have to 
> infer the update control in the controller
>  # Make the params generic and extends from the ReconcilerContext and 
> ObserverContext. which will be easy for different controller to ship their 
> own objects for reconcile and observer. For example, in the FlinkSessionJob 
> case, we need to get the effective config from the FlinkDeployment first and 
> also pass the FlinkDeployment to the reconciler. 
> After the change, the reconciler will look like this:
> {code:java}
> public interface Reconciler> {     
> UpdateControl reconcile(CR cr, CTX context) throws Exception;     
> DeleteControl cleanup(CR cr, CTX ctx);
> }{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-26915) Extend the Reconciler and Observer interface

2022-03-29 Thread Aitozi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aitozi updated FLINK-26915:
---
Parent: FLINK-26784
Issue Type: Sub-task  (was: Improvement)

> Extend the Reconciler and Observer interface
> 
>
> Key: FLINK-26915
> URL: https://issues.apache.org/jira/browse/FLINK-26915
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> As discussed in 
> [comments|https://github.com/apache/flink-kubernetes-operator/pull/112#discussion_r835762111],
>  I proposed make two changes to the Reconciler and Observer
>  # directly return the UpdateControl from the reconciler, because the 
> reconciler can in charge of the Update behavior, By this, we dont have to 
> infer the update control in the controller
>  # Make the params generic and extends from the ReconcilerContext and 
> ObserverContext. which will be easy for different controller to ship their 
> own objects for reconcile and observer. For example, in the FlinkSessionJob 
> case, we need to get the effective config from the FlinkDeployment first and 
> also pass the FlinkDeployment to the reconciler. 
> After the change, the reconciler will look like this:
> {code:java}
> public interface Reconciler> {     
> UpdateControl reconcile(CR cr, CTX context) throws Exception;     
> DeleteControl cleanup(CR cr, CTX ctx);
> }{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-26915) Extend the Reconciler and Observer interface

2022-03-29 Thread Aitozi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aitozi updated FLINK-26915:
---
Description: 
As discussed in 
[comments|https://github.com/apache/flink-kubernetes-operator/pull/112#discussion_r835762111],
 I proposed make two changes to the Reconciler and Observer
 # directly return the UpdateControl from the reconciler, because the 
reconciler can in charge of the Update behavior, By this, we dont have to infer 
the update control in the controller
 # Make the params generic and extends from the ReconcilerContext and 
ObserverContext. which will be easy for different controller to ship their own 
objects for reconcile and observer. For example, in the FlinkSessionJob case, 
we need to get the effective config from the FlinkDeployment first and also 
pass the FlinkDeployment to the reconciler. 

After the change, the reconciler will look like this:
{code:java}
public interface Reconciler> {     
UpdateControl reconcile(CR cr, CTX context) throws Exception;     
DeleteControl cleanup(CR cr, CTX ctx);
}{code}
 

 

  was:
As discussed in 
[comments|https://github.com/apache/flink-kubernetes-operator/pull/112#discussion_r835762111],
 I proposed make two changes to the Reconciler and Observer
 # directly return the UpdateControl from the reconciler, because the 
reconciler can in charge of the Update behavior, By this, we dont have to infer 
the update control in the controller
 # Make the params generic and extends from the ReconcilerContext and 
ObserverContext. which will be easy for different controller to ship their own 
objects for reconcile and observer. For example, in the FlinkSessionJob case, 
we need to get the effective config from the FlinkDeployment first and also 
pass the FlinkDeployment to the reconciler. 

After the change, the reconciler will look like this:

{{}}
{code:java}

{code}
{{public interface Reconciler> \{     
UpdateControl reconcile(CR cr, CTX context) throws Exception;     
DeleteControl cleanup(CR cr, CTX ctx);
} }}

{{}}

 


> Extend the Reconciler and Observer interface
> 
>
> Key: FLINK-26915
> URL: https://issues.apache.org/jira/browse/FLINK-26915
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> As discussed in 
> [comments|https://github.com/apache/flink-kubernetes-operator/pull/112#discussion_r835762111],
>  I proposed make two changes to the Reconciler and Observer
>  # directly return the UpdateControl from the reconciler, because the 
> reconciler can in charge of the Update behavior, By this, we dont have to 
> infer the update control in the controller
>  # Make the params generic and extends from the ReconcilerContext and 
> ObserverContext. which will be easy for different controller to ship their 
> own objects for reconcile and observer. For example, in the FlinkSessionJob 
> case, we need to get the effective config from the FlinkDeployment first and 
> also pass the FlinkDeployment to the reconciler. 
> After the change, the reconciler will look like this:
> {code:java}
> public interface Reconciler> {     
> UpdateControl reconcile(CR cr, CTX context) throws Exception;     
> DeleteControl cleanup(CR cr, CTX ctx);
> }{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26915) Extend the Reconciler and Observer interface

2022-03-29 Thread Aitozi (Jira)
Aitozi created FLINK-26915:
--

 Summary: Extend the Reconciler and Observer interface
 Key: FLINK-26915
 URL: https://issues.apache.org/jira/browse/FLINK-26915
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Aitozi
 Fix For: kubernetes-operator-1.0.0


As discussed in 
[comments|https://github.com/apache/flink-kubernetes-operator/pull/112#discussion_r835762111],
 I proposed make two changes to the Reconciler and Observer
 # directly return the UpdateControl from the reconciler, because the 
reconciler can in charge of the Update behavior, By this, we dont have to infer 
the update control in the controller
 # Make the params generic and extends from the ReconcilerContext and 
ObserverContext. which will be easy for different controller to ship their own 
objects for reconcile and observer. For example, in the FlinkSessionJob case, 
we need to get the effective config from the FlinkDeployment first and also 
pass the FlinkDeployment to the reconciler. 

After the change, the reconciler will look like this:

{{}}
{code:java}

{code}
{{public interface Reconciler> \{     
UpdateControl reconcile(CR cr, CTX context) throws Exception;     
DeleteControl cleanup(CR cr, CTX ctx);
} }}

{{}}

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-18356) flink-table-planner Exit code 137 returned from process

2022-03-28 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513799#comment-17513799
 ] 

Aitozi commented on FLINK-18356:


[~martijnvisser] Thanks for your information, I have addressed the OOM error by 
apply the latest patch from release-1.14, not sure which commit solved it. 
Anyway, I can run the pipeline now.

> flink-table-planner Exit code 137 returned from process
> ---
>
> Key: FLINK-18356
> URL: https://issues.apache.org/jira/browse/FLINK-18356
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines, Tests
>Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0
>Reporter: Piotr Nowojski
>Assignee: Martijn Visser
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.15.0
>
> Attachments: 1234.jpg, app-profiling_4.gif
>
>
> {noformat}
> = test session starts 
> ==
> platform linux -- Python 3.7.3, pytest-5.4.3, py-1.8.2, pluggy-0.13.1
> cachedir: .tox/py37-cython/.pytest_cache
> rootdir: /__w/3/s/flink-python
> collected 568 items
> pyflink/common/tests/test_configuration.py ..[  
> 1%]
> pyflink/common/tests/test_execution_config.py ...[  
> 5%]
> pyflink/dataset/tests/test_execution_environment.py .
> ##[error]Exit code 137 returned from process: file name '/bin/docker', 
> arguments 'exec -i -u 1002 
> 97fc4e22522d2ced1f4d23096b8929045d083dd0a99a4233a8b20d0489e9bddb 
> /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.
> Finishing: Test - python
> {noformat}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=3729=logs=9cada3cb-c1d3-5621-16da-0f718fb86602=8d78fe4f-d658-5c70-12f8-4921589024c3



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26889) Eliminate the duplicated construct for FlinkOperatorConfiguration in test

2022-03-28 Thread Aitozi (Jira)
Aitozi created FLINK-26889:
--

 Summary: Eliminate the duplicated construct for 
FlinkOperatorConfiguration in test
 Key: FLINK-26889
 URL: https://issues.apache.org/jira/browse/FLINK-26889
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Aitozi


A minor improvement to reduce the boilerplate code



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-18356) flink-table-planner Exit code 137 returned from process

2022-03-28 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513286#comment-17513286
 ] 

Aitozi commented on FLINK-18356:


Hi, I'm running testing CI against release-1.14 locally and still meet the 
problem with exit code 137, Is release-1.14 miss some fix for it ?

> flink-table-planner Exit code 137 returned from process
> ---
>
> Key: FLINK-18356
> URL: https://issues.apache.org/jira/browse/FLINK-18356
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines, Tests
>Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0
>Reporter: Piotr Nowojski
>Assignee: Martijn Visser
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.15.0
>
> Attachments: 1234.jpg, app-profiling_4.gif
>
>
> {noformat}
> = test session starts 
> ==
> platform linux -- Python 3.7.3, pytest-5.4.3, py-1.8.2, pluggy-0.13.1
> cachedir: .tox/py37-cython/.pytest_cache
> rootdir: /__w/3/s/flink-python
> collected 568 items
> pyflink/common/tests/test_configuration.py ..[  
> 1%]
> pyflink/common/tests/test_execution_config.py ...[  
> 5%]
> pyflink/dataset/tests/test_execution_environment.py .
> ##[error]Exit code 137 returned from process: file name '/bin/docker', 
> arguments 'exec -i -u 1002 
> 97fc4e22522d2ced1f4d23096b8929045d083dd0a99a4233a8b20d0489e9bddb 
> /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.
> Finishing: Test - python
> {noformat}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=3729=logs=9cada3cb-c1d3-5621-16da-0f718fb86602=8d78fe4f-d658-5c70-12f8-4921589024c3



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26676) Set ClusterIP service type when watching specific namespaces

2022-03-26 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512832#comment-17512832
 ] 

Aitozi commented on FLINK-26676:


Hi, what error we will get if we not set to ClusterIP? 

> Set ClusterIP service type when watching specific namespaces
> 
>
> Key: FLINK-26676
> URL: https://issues.apache.org/jira/browse/FLINK-26676
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Assignee: Gyula Fora
>Priority: Major
>  Labels: pull-request-available
>
> As noted in this PR
> [https://github.com/apache/flink-kubernetes-operator/pull/42#issue-1159776739]
> Users get service account related error messages unless we set:
> {noformat}
> kubernetes.rest-service.exposed.type: ClusterIP{noformat}
> In cases where we are watching specific namespaces.
> We should configure this automatically unless override by the user in the 
> flinkConfiguration for these cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-26873) Align the helm chart version with the flink operator

2022-03-26 Thread Aitozi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aitozi updated FLINK-26873:
---
Component/s: Kubernetes Operator

> Align the helm chart version with the flink operator
> 
>
> Key: FLINK-26873
> URL: https://issues.apache.org/jira/browse/FLINK-26873
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
>
> Now the flink-operator helm chart version is 1.0.13. I think it should be 
> aligned to the flink-operator version during release 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26873) Align the helm chart version with the flink operator

2022-03-26 Thread Aitozi (Jira)
Aitozi created FLINK-26873:
--

 Summary: Align the helm chart version with the flink operator
 Key: FLINK-26873
 URL: https://issues.apache.org/jira/browse/FLINK-26873
 Project: Flink
  Issue Type: Sub-task
Reporter: Aitozi


Now the flink-operator helm chart version is 1.0.13. I think it should be 
aligned to the flink-operator version during release 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26873) Align the helm chart version with the flink operator

2022-03-26 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512748#comment-17512748
 ] 

Aitozi commented on FLINK-26873:


cc [~gyfora] 

> Align the helm chart version with the flink operator
> 
>
> Key: FLINK-26873
> URL: https://issues.apache.org/jira/browse/FLINK-26873
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
>
> Now the flink-operator helm chart version is 1.0.13. I think it should be 
> aligned to the flink-operator version during release 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26871) Handle Session job spec change

2022-03-26 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512747#comment-17512747
 ] 

Aitozi commented on FLINK-26871:


I will work on this 

> Handle Session job spec change 
> ---
>
> Key: FLINK-26871
> URL: https://issues.apache.org/jira/browse/FLINK-26871
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Aitozi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26871) Handle Session job spec change

2022-03-26 Thread Aitozi (Jira)
Aitozi created FLINK-26871:
--

 Summary: Handle Session job spec change 
 Key: FLINK-26871
 URL: https://issues.apache.org/jira/browse/FLINK-26871
 Project: Flink
  Issue Type: Sub-task
Reporter: Aitozi






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26870) Implement session job observer

2022-03-26 Thread Aitozi (Jira)
Aitozi created FLINK-26870:
--

 Summary: Implement session job observer
 Key: FLINK-26870
 URL: https://issues.apache.org/jira/browse/FLINK-26870
 Project: Flink
  Issue Type: Sub-task
Reporter: Aitozi






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (FLINK-26807) The batch job not work well with Operator

2022-03-22 Thread Aitozi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aitozi closed FLINK-26807.
--
Resolution: Duplicate

> The batch job not work well with Operator
> -
>
> Key: FLINK-26807
> URL: https://issues.apache.org/jira/browse/FLINK-26807
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
>
> When I test the batch job or finite streaming job, the flinkdep will be an 
> orphaned resource and keep listing job after job finished. Because the 
> JobManagerDeploymentStatus will not be sync again.
> I think we should sync the global terminated status from the application job, 
> and do the clean up work for the flinkdep resource



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26807) The batch job not work well with Operator

2022-03-22 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510991#comment-17510991
 ] 

Aitozi commented on FLINK-26807:


Get it, I will look over that discussion, Closing this one as duplicated.

> The batch job not work well with Operator
> -
>
> Key: FLINK-26807
> URL: https://issues.apache.org/jira/browse/FLINK-26807
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
>
> When I test the batch job or finite streaming job, the flinkdep will be an 
> orphaned resource and keep listing job after job finished. Because the 
> JobManagerDeploymentStatus will not be sync again.
> I think we should sync the global terminated status from the application job, 
> and do the clean up work for the flinkdep resource



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-26807) The batch job not work well with Operator

2022-03-22 Thread Aitozi (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aitozi updated FLINK-26807:
---
Description: 
When I test the batch job or finite streaming job, the flinkdep will be an 
orphaned resource and keep listing job after job finished. Because the 
JobManagerDeploymentStatus will not be sync again.

I think we should sync the global terminated status from the application job, 
and do the clean up work for the flinkdep resource

  was:
When I test the batch job or finite streaming job, the flinkdep will be an 
orphaned resource and keep listing job. Because the JobManagerDeploymentStatus 
will not be sync again.

I think we should sync the global terminated status from the application job, 
and do the clean up work for the flinkdep resource


> The batch job not work well with Operator
> -
>
> Key: FLINK-26807
> URL: https://issues.apache.org/jira/browse/FLINK-26807
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Aitozi
>Priority: Major
>
> When I test the batch job or finite streaming job, the flinkdep will be an 
> orphaned resource and keep listing job after job finished. Because the 
> JobManagerDeploymentStatus will not be sync again.
> I think we should sync the global terminated status from the application job, 
> and do the clean up work for the flinkdep resource



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26807) The batch job not work well with Operator

2022-03-22 Thread Aitozi (Jira)
Aitozi created FLINK-26807:
--

 Summary: The batch job not work well with Operator
 Key: FLINK-26807
 URL: https://issues.apache.org/jira/browse/FLINK-26807
 Project: Flink
  Issue Type: Sub-task
  Components: Kubernetes Operator
Reporter: Aitozi


When I test the batch job or finite streaming job, the flinkdep will be an 
orphaned resource and keep listing job. Because the JobManagerDeploymentStatus 
will not be sync again.

I think we should sync the global terminated status from the application job, 
and do the clean up work for the flinkdep resource



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-25480) Create dashboard/monitoring to see resource usage per E2E test

2022-03-22 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510448#comment-17510448
 ] 

Aitozi edited comment on FLINK-25480 at 3/22/22, 12:16 PM:
---

FYI, I encounter the same problem with 1.14.4 when running test in container. I 
test in 16C 32G container, and {{mvn -Dflink.forkCount=2 
-Dflink.forkCountTestPackage=2 -Dmaven.test.failure.ignore=true verify}} 
command exit 137 finally.
At the meantime, I opened another screen to run  {{vsar --cpu --mem -l}} to 
monitor the memory usage. But I still not catch the memory stroke. 
Hope to be helpful to you guys. I'm curious about the root cause, because it 
stopped me from building our stable CI pipeline.


was (Author: aitozi):
FYI, I encounter the same problem with 1.14.4 when running test in container. I 
test in 16C 32G container, and {{mvn -Dflink.forkCount=2 
-Dflink.forkCountTestPackage=2 -Dmaven.test.failure.ignore=true verify}} 
command exit 137 finally. At the meantime, I opened another screen to run  
{{vsar --cpu --mem -l}} to monitor the memory usage. But I still not catch the 
memory stroke. Hope to be helpful to your guys. I'm curious about it, because 
it stop me from building our stable CI pipeline.

> Create dashboard/monitoring to see resource usage per E2E test
> --
>
> Key: FLINK-25480
> URL: https://issues.apache.org/jira/browse/FLINK-25480
> Project: Flink
>  Issue Type: Improvement
>  Components: Test Infrastructure
>Affects Versions: 1.15.0, 1.13.6, 1.14.3
>Reporter: Martijn Visser
>Priority: Critical
>  Labels: test-stability
>
> Over the past couple of weeks, we've encountered multiple problems with tests 
> failing due to out-of-memory errors and/or exit code 137 happening. These are 
> happening both on Alibaba CI machines, as well as Azure hosted agents. For 
> the Alibaba CI machines, we've mitigated the problem by reducing the number 
> of workers per CI machine from 7 to 5. These workers can spin up multiple 
> Docker containers, especially with Testcontainers getting used more and more. 
> If we can get insights in the resource usage per end-to-end test, it will 
> also help in debugging test infrastructure problems more quickly. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-25480) Create dashboard/monitoring to see resource usage per E2E test

2022-03-22 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510448#comment-17510448
 ] 

Aitozi edited comment on FLINK-25480 at 3/22/22, 12:15 PM:
---

FYI, I encounter the same problem with 1.14.4 when running test in container. I 
test in 16C 32G container, and {{mvn -Dflink.forkCount=2 
-Dflink.forkCountTestPackage=2 -Dmaven.test.failure.ignore=true verify}} 
command exit 137 finally. At the meantime, I opened another screen to run  
{{vsar --cpu --mem -l}} to monitor the memory usage. But I still not catch the 
memory stroke. Hope to be helpful to your guys. I'm curious about it, because 
it stop me from building our stable CI pipeline.


was (Author: aitozi):
FYI, I encounter the same problem with 1.14.4 when running test in container. I 
test in 16C 32G container, and {{mvn verify}} command exit 137 finally. At the 
meantime, I opened another screen to run  {{vsar --cpu --mem -l}} to monitor 
the memory usage. But I still not catch the memory stroke. Hope to be helpful 
to your guys. I'm curious about it, because it stop me from building our stable 
CI pipeline.

> Create dashboard/monitoring to see resource usage per E2E test
> --
>
> Key: FLINK-25480
> URL: https://issues.apache.org/jira/browse/FLINK-25480
> Project: Flink
>  Issue Type: Improvement
>  Components: Test Infrastructure
>Affects Versions: 1.15.0, 1.13.6, 1.14.3
>Reporter: Martijn Visser
>Priority: Critical
>  Labels: test-stability
>
> Over the past couple of weeks, we've encountered multiple problems with tests 
> failing due to out-of-memory errors and/or exit code 137 happening. These are 
> happening both on Alibaba CI machines, as well as Azure hosted agents. For 
> the Alibaba CI machines, we've mitigated the problem by reducing the number 
> of workers per CI machine from 7 to 5. These workers can spin up multiple 
> Docker containers, especially with Testcontainers getting used more and more. 
> If we can get insights in the resource usage per end-to-end test, it will 
> also help in debugging test infrastructure problems more quickly. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-25480) Create dashboard/monitoring to see resource usage per E2E test

2022-03-22 Thread Aitozi (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510448#comment-17510448
 ] 

Aitozi edited comment on FLINK-25480 at 3/22/22, 12:14 PM:
---

FYI, I encounter the same problem with 1.14.4 when running test in container. I 
test in 16C 32G container, and {{mvn verify}} command exit 137 finally. At the 
meantime, I opened another screen to run  {{vsar --cpu --mem -l}} to monitor 
the memory usage. But I still not catch the memory stroke. Hope to be helpful 
to your guys. I'm curious about it, because it stop me from building our stable 
CI pipeline.


was (Author: aitozi):
FYI, I encounter the same problem with 1.14.4 when running test in container. I 
test in 16C 32G container, and {{mvn verify}} command exit 137 finally. Then I 
open another screen to run  {{vsar --cpu --mem -l}} . But I still not catch the 
memory stroke. I'm curious about it.

> Create dashboard/monitoring to see resource usage per E2E test
> --
>
> Key: FLINK-25480
> URL: https://issues.apache.org/jira/browse/FLINK-25480
> Project: Flink
>  Issue Type: Improvement
>  Components: Test Infrastructure
>Affects Versions: 1.15.0, 1.13.6, 1.14.3
>Reporter: Martijn Visser
>Priority: Critical
>  Labels: test-stability
>
> Over the past couple of weeks, we've encountered multiple problems with tests 
> failing due to out-of-memory errors and/or exit code 137 happening. These are 
> happening both on Alibaba CI machines, as well as Azure hosted agents. For 
> the Alibaba CI machines, we've mitigated the problem by reducing the number 
> of workers per CI machine from 7 to 5. These workers can spin up multiple 
> Docker containers, especially with Testcontainers getting used more and more. 
> If we can get insights in the resource usage per end-to-end test, it will 
> also help in debugging test infrastructure problems more quickly. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


<    1   2   3   4   5   6   7   8   9   >