[jira] [Commented] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18

2024-02-02 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813604#comment-17813604
 ] 

Yang Wang commented on FLINK-34333:
---

+1 for what Chesnay said. We already shade the fabric8 k8s client in 
kubernetes-client module. It should not be used directly in other projects.

> Fix FLINK-34007 LeaderElector bug in 1.18
> -
>
> Key: FLINK-34333
> URL: https://issues.apache.org/jira/browse/FLINK-34333
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.18.1
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
>
> FLINK-34007 revealed a bug in the k8s client v6.6.2 which we're using since 
> Flink 1.18. This issue was fixed with FLINK-34007 for Flink 1.19 which 
> required an update of the k8s client to v6.9.0.
> This Jira issue is about finding a solution in Flink 1.18 for the very same 
> problem FLINK-34007 covered. It's a dedicated Jira issue because we want to 
> unblock the release of 1.19 by resolving FLINK-34007.
> Just to summarize why the upgrade to v6.9.0 is desired: There's a bug in 
> v6.6.2 which might prevent the leadership lost event being forwarded to the 
> client ([#5463|https://github.com/fabric8io/kubernetes-client/issues/5463]). 
> An initial proposal where the release call was handled in Flink's 
> {{KubernetesLeaderElector}} didn't work due to the leadership lost event 
> being triggered twice (see [FLINK-34007 PR 
> comment|https://github.com/apache/flink/pull/24132#discussion_r1467175902])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18

2024-02-02 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813603#comment-17813603
 ] 

Yang Wang commented on FLINK-34333:
---

I am not aware of other potential issues and in favor of upgrading the k8s 
client version. 

> Fix FLINK-34007 LeaderElector bug in 1.18
> -
>
> Key: FLINK-34333
> URL: https://issues.apache.org/jira/browse/FLINK-34333
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.18.1
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
>
> FLINK-34007 revealed a bug in the k8s client v6.6.2 which we're using since 
> Flink 1.18. This issue was fixed with FLINK-34007 for Flink 1.19 which 
> required an update of the k8s client to v6.9.0.
> This Jira issue is about finding a solution in Flink 1.18 for the very same 
> problem FLINK-34007 covered. It's a dedicated Jira issue because we want to 
> unblock the release of 1.19 by resolving FLINK-34007.
> Just to summarize why the upgrade to v6.9.0 is desired: There's a bug in 
> v6.6.2 which might prevent the leadership lost event being forwarded to the 
> client ([#5463|https://github.com/fabric8io/kubernetes-client/issues/5463]). 
> An initial proposal where the release call was handled in Flink's 
> {{KubernetesLeaderElector}} didn't work due to the leadership lost event 
> being triggered twice (see [FLINK-34007 PR 
> comment|https://github.com/apache/flink/pull/24132#discussion_r1467175902])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-23 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809858#comment-17809858
 ] 

Yang Wang commented on FLINK-34007:
---

It seems that the fabric8 K8s client community leans towards to make the 
{{LeaderElector}} non-restartable. And using a same lock identity should also 
work for us when creating a new leader-elector instance. So your PR looks good 
to me.

 

About the test:

We just need to have a test to guard that a new leader-elector instance should 
be created instead of reusing the existing one. Right?

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Blocker
>  Labels: pull-request-available
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-18 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808441#comment-17808441
 ] 

Yang Wang commented on FLINK-34007:
---

I also remember that the fabric8 kubernetes-client community has a very good 
response. If the {{LeaderElector}} is designed for only run once, though I do 
not think this is the reasonable behavior, then we need to create a new 
{{LeaderElector}} when lost leadership.

For option #3, it might be unnecessary because {{LeaderElector}} could work as 
expected when creating a new instance with same lock identity. It is a larger 
effort to do such refactor without additional benefits.

 

BTW, maybe I miss some background. [~gyfora] Could you please share me why we 
need to change the thread pool to 3 in {{{}KubernetesLeaderElector{}}}?

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Blocker
>  Labels: pull-request-available
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34135) A number of ci failures with Access to the path '.../_work/_temp/containerHandlerInvoker.js' is denied.

2024-01-18 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808420#comment-17808420
 ] 

Yang Wang commented on FLINK-34135:
---

The CI should work since the {{containerHandlerInvoker.js}} and other files are 
recreated after restarted the AZure agents.

> A number of ci failures with Access to the path 
> '.../_work/_temp/containerHandlerInvoker.js' is denied.
> ---
>
> Key: FLINK-34135
> URL: https://issues.apache.org/jira/browse/FLINK-34135
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Reporter: Sergey Nuyanzin
>Assignee: Jeyhun Karimov
>Priority: Blocker
>  Labels: test-stability
>
> There is a number of builds failing with something like 
> {noformat}
> ##[error]Access to the path 
> '/home/agent03/myagent/_work/_temp/containerHandlerInvoker.js' is denied.
> {noformat}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56490=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=fb588352-ef18-568d-b447-699986250ccb
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56481=logs=5c8e7682-d68f-54d1-16a2-a09310218a49=554d7c3f-d38e-55f4-96b4-ada3a9cb7d6f=9
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56481=logs=fa307d6d-91b1-5ab6-d460-ef50f552b1fe=1798d435-832b-51fe-a9ad-efb9abf4ab04=9
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56481=logs=a1ac4ce4-9a4f-5fdb-3290-7e163fba19dc=e4c57254-ec06-5788-3f8e-5ad5dffb418e=9
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56481=logs=2c3cbe13-dee0-5837-cf47-3053da9a8a78=56881383-f398-5091-6b3b-22a7eeb7cfa8=9
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56481=logs=b0a398c0-685b-599c-eb57-c8c2a771138e=2d9c27d0-8dbb-5be9-7271-453f74f48ab3=9
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56481=logs=162f98f7-8967-5f47-2782-a1e178ec2ad3=c9934c56-710d-5f85-d2b8-28ec1fd700ed=9



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-17 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807700#comment-17807700
 ] 

Yang Wang commented on FLINK-34007:
---

I agree with you that we could enable {{{}isReleaseOnCancel{}}}, which will set 
the holder identity to empty.

 

If this issue also existed in 1.17-, then we might need to get the jstack of 
JobManager to see where the {{LeaderElector}} stuck.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-16 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807538#comment-17807538
 ] 

Yang Wang commented on FLINK-34007:
---

I think you are right that {{stopped}} flag in {{LeaderElector}} never reset 
after lost leadership. However, I am afraid even though we recreate a new 
LeaderElector it still not works unless we have a different 
{{{}HolderIdentity{}}}. From 
[LeaderElector:232|[https://github.com/fabric8io/kubernetes-client/blob/v6.6.2/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L232]],
 {{onStartLeading}} only happens when the holder identity changes.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-15 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807009#comment-17807009
 ] 

Yang Wang commented on FLINK-34007:
---

{quote}At least based on the reports of this Jira issue, there must have been 
an incident (which caused the lease to not be renewed)
{quote}
I am afraid we could not get this conclusion before we have the K8s APIServer 
audit logs to verify that the lease annotation did not get renewed. Because it 
could also happen that the lease annotation get renewed normally while the 
onStartLeading callback is not executed somehow. 

 
{quote}Therefore, the issue should exist in the entire version range [5.12.3, 
6.6.2].
{quote}
If this issue only happened in the Flink 1.18, then it should be related with 
the fabric8 K8s client 6.6.2 behavior change. Otherwise, we still have not find 
the root cause.

 

You are right. The slight difference in the revocation protocol in the 
[FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box]
 changes about clear the leader information in ConfigMap is not related with 
this issue.

 

BTW, if we know how to reproduce this issue, it will be easier to find the root 
cause. Because we might also need the K8s APIServer audit log to do some deep 
analysis.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-24332) Support to mount a dynamically-created persistent volume claim per TaskManager

2024-01-15 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806763#comment-17806763
 ] 

Yang Wang commented on FLINK-24332:
---

We do not need this feature now since Kubernetes has already supported 
ephemeral volume[1].

 

[1]. [https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/]

 

 
{code:java}
apiVersion: v1
kind: Pod
metadata:
  name: taskmanager-pod-template
spec:
  containers:
# Do not change the main container name
- name: flink-main-container
  volumeMounts:
- mountPath: /opt/flink/volumes/ephemeral
  name: ephemeral-volume
- mountPath: /opt/flink/log
  name: flink-logs
  volumes:
- name: flink-logs
  emptyDir: { } 
- name: ephemeral-volume
  ephemeral:
volumeClaimTemplate:
  spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: ""
resources:
  requests:
storage: 1Gi{code}
 

 

> Support to mount a dynamically-created persistent volume claim per TaskManager
> --
>
> Key: FLINK-24332
> URL: https://issues.apache.org/jira/browse/FLINK-24332
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / Kubernetes
>Reporter: Yang Wang
>Priority: Major
>
> Pod template could be used to mount a same PVC for all the TaskManagers. 
> However, in many cases, users need to mount dynamically-created persistent 
> volume claim for each TaskManager.
>  
> Refer to 
> [https://lists.apache.org/thread.html/r08ed40ee541c69a74c6d48cc315671198a1910dbd34fd731fe77da37%40%3Cuser.flink.apache.org%3E]
>  for more information.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-15 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806714#comment-17806714
 ] 

Yang Wang commented on FLINK-34007:
---

Maybe this issue is related with fabric8 K8s client behavior change. In 
v5.12.4[1], the {{onStartLeading}} callback will always be executed when 
acquired the leadership. But in v6.6.2[2], the {{onStartingLeading}} callback 
will be executed only when the holder identity changed. In this case, the 
JobManager does not crash while the holder identity keeps same.

 

[1]. 
[https://github.com/fabric8io/kubernetes-client/blob/v5.12.4/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L71]

[2]. 
[https://github.com/fabric8io/kubernetes-client/blob/v6.6.2/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L232]

 

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-14 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806600#comment-17806600
 ] 

Yang Wang commented on FLINK-34007:
---

Maybe I did not make myself clear. I mean the old leader JM should try to 
remove the annotation of HA ConfigMap 
{{control-plane.alpha.kubernetes.io/leader}} when lost leadership. From the 
fabric8 K8s client impl[1], {{isLeader}} callback will be executed only when 
the holder identity changed.

 

[1]. 
[https://github.com/fabric8io/kubernetes-client/blob/v6.6.2/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L232]

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail

2024-01-14 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806592#comment-17806592
 ] 

Yang Wang commented on FLINK-33728:
---

Not only the {{KubernetesResourceManagerDriver}} will create a new watch when 
received the {{{}TooOldResourceVersion{}}}, but also the fabric8 K8s client has 
the similar logic in {{{}Reflector.java{}}}[1], which we are using for the 
Flink Kubernetes HA implementation.

 

In my opinion, the K8s APIServer should have the ability to protect itself by 
using the flow control[2]. Then it will reject some requests if it could not 
process too many requests. Flink will then retry to create a new watch when the 
previous one failed. What Flink could do more is using a 
{{ExponentialBackoffDelayRetryStrategy}} to replace current continuous retry 
strategy.

 

[1]. 
[https://github.com/fabric8io/kubernetes-client/blob/v6.6.2/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/informers/impl/cache/Reflector.java#L288]

[2]. [https://kubernetes.io/docs/concepts/cluster-administration/flow-control/]

 

> do not rewatch when KubernetesResourceManagerDriver watch fail
> --
>
> Key: FLINK-33728
> URL: https://issues.apache.org/jira/browse/FLINK-33728
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / Kubernetes
>Reporter: xiaogang zhou
>Priority: Major
>  Labels: pull-request-available
>
> I met massive production problem when kubernetes ETCD slow responding happen. 
> After Kube recoverd after 1 hour, Thousands of Flink jobs using 
> kubernetesResourceManagerDriver rewatched when recieving 
> ResourceVersionTooOld,  which caused great pressure on API Server and made 
> API server failed again... 
>  
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in  PodCallbackHandlerImpl# handleError method?
>  
> We can just neglect the disconnection of watching process. and try to rewatch 
> once new requestResource called. And we can leverage on the akka heartbeat 
> timeout to discover the TM failure, just like YARN mode do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-11 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805871#comment-17805871
 ] 

Yang Wang commented on FLINK-34007:
---

[~mapohl] Could you please confirm that whether "multi-component leader 
election" will clean up the leader annotation on the ConfigMap when lost 
leadership?

It seems that the fabric8 Kubernetes client leader elector will not work 
properly by {{run()}} more than once if we do not clean up the leader 
annotation.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-11 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805487#comment-17805487
 ] 

Yang Wang commented on FLINK-34007:
---

Do you mean the {{KubernetesLeaderElector}} could not obtain the leadership due 
to continuous resource conflicts? I am not sure of this because you only share 
one line DEBUG log.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33155) Flink ResourceManager continuously fails to start TM container on YARN when Kerberos enabled

2023-09-25 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768679#comment-17768679
 ] 

Yang Wang commented on FLINK-33155:
---

Thanks for your comments.

 

> Changing the default behavior from file to UGI can be a breaking change to 
> users which are depending on that some way

What I mean is to get the delegation token from UGI instead of reading from 
file, just like we have already done in the {{{}YarnClusterDescriptor{}}}[1]. I 
am not sure why this will be a breaking change because the tokens in the 
{{ContainerLaunchContext}} are just same.

 

[1]. 
[https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/YarnClusterDescriptor.java#L1334]

> Flink ResourceManager continuously fails to start TM container on YARN when 
> Kerberos enabled
> 
>
> Key: FLINK-33155
> URL: https://issues.apache.org/jira/browse/FLINK-33155
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Reporter: Yang Wang
>Priority: Major
>
> When Kerberos enabled(with key tab) and after one day(the container token 
> expired), Flink fails to create the TaskManager container on YARN due to the 
> following exception.
>  
> {code:java}
> 2023-09-25 16:48:50,030 INFO  
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
> Worker container_1695106898104_0003_01_69 is terminated. Diagnostics: 
> Container container_1695106898104_0003_01_69 was invalid. Diagnostics: 
> [2023-09-25 16:48:45.710]token (token for hadoop: HDFS_DELEGATION_TOKEN 
> owner=hadoop/master-1-1.c-5ee7bdc598b6e1cc.cn-beijing.emr.aliyuncs@emr.c-5ee7bdc598b6e1cc.com,
>  renewer=, realUser=, issueDate=1695196431487, maxDate=1695801231487, 
> sequenceNumber=12, masterKeyId=3) can't be found in cache
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for hadoop: HDFS_DELEGATION_TOKEN owner=, renewer=, 
> realUser=, issueDate=1695196431487, maxDate=1695801231487, sequenceNumber=12, 
> masterKeyId=3) can't be found in cache
>     at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1545)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1491)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1388)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>     at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:907)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
>     at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
>     at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1666)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1573)
>     at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1588)
>     at 
> org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
>     at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
>     at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
>     at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>     at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
>     at 
> 

[jira] [Updated] (FLINK-33155) Flink ResourceManager continuously fails to start TM container on YARN when Kerberos enabled

2023-09-25 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-33155:
--
Description: 
When Kerberos enabled(with key tab) and after one day(the container token 
expired), Flink fails to create the TaskManager container on YARN due to the 
following exception.

 
{code:java}
2023-09-25 16:48:50,030 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Worker container_1695106898104_0003_01_69 is terminated. Diagnostics: 
Container container_1695106898104_0003_01_69 was invalid. Diagnostics: 
[2023-09-25 16:48:45.710]token (token for hadoop: HDFS_DELEGATION_TOKEN 
owner=hadoop/master-1-1.c-5ee7bdc598b6e1cc.cn-beijing.emr.aliyuncs@emr.c-5ee7bdc598b6e1cc.com,
 renewer=, realUser=, issueDate=1695196431487, maxDate=1695801231487, 
sequenceNumber=12, masterKeyId=3) can't be found in cache
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 token (token for hadoop: HDFS_DELEGATION_TOKEN owner=, renewer=, 
realUser=, issueDate=1695196431487, maxDate=1695801231487, sequenceNumber=12, 
masterKeyId=3) can't be found in cache
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1545)
    at org.apache.hadoop.ipc.Client.call(Client.java:1491)
    at org.apache.hadoop.ipc.Client.call(Client.java:1388)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
    at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
    at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:907)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
    at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1666)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1573)
    at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1588)
    at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
    at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:243)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:236)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:224)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750) {code}
The root cause might be that we are reading the delegation token from JM local 
file[1]. It will expire after one day. When the old TaskManager container 
crashes and ResourceManager tries to create a new one, the YARN NodeManager 
will use the expired token to localize the resources for TaskManager and then 
fail.

Instead, we could read the latest valid token from 

[jira] [Updated] (FLINK-33155) Flink ResourceManager continuously fails to start TM container on YARN when Kerberos enabled

2023-09-25 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-33155:
--
Description: 
When Kerberos enabled(with key tab) and after one day(the container token 
expired), Flink fails to create the TaskManager container on YARN due to the 
following exception.

 
{code:java}
2023-09-25 16:48:50,030 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Worker container_1695106898104_0003_01_69 is terminated. Diagnostics: 
Container container_1695106898104_0003_01_69 was invalid. Diagnostics: 
[2023-09-25 16:48:45.710]token (token for hadoop: HDFS_DELEGATION_TOKEN 
owner=hadoop/master-1-1.c-5ee7bdc598b6e1cc.cn-beijing.emr.aliyuncs@emr.c-5ee7bdc598b6e1cc.com,
 renewer=, realUser=, issueDate=1695196431487, maxDate=1695801231487, 
sequenceNumber=12, masterKeyId=3) can't be found in cache
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 token (token for hadoop: HDFS_DELEGATION_TOKEN owner=, renewer=, 
realUser=, issueDate=1695196431487, maxDate=1695801231487, sequenceNumber=12, 
masterKeyId=3) can't be found in cache
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1545)
    at org.apache.hadoop.ipc.Client.call(Client.java:1491)
    at org.apache.hadoop.ipc.Client.call(Client.java:1388)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
    at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
    at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:907)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
    at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1666)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1573)
    at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1588)
    at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
    at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:243)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:236)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:224)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750) {code}
The root cause might be that we are reading the delegation token from JM local 
file[1]. It will expire after one day. When the old TaskManager container 
crashes and ResourceManager tries to create a new one, the YARN NodeManager 
will use the expired token to localize the resources for TaskManager and then 
fail.

 

[1]. 

[jira] [Updated] (FLINK-33155) Flink ResourceManager continuously fails to start TM container on YARN when Kerberos enabled

2023-09-25 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-33155:
--
Component/s: Deployment / YARN

> Flink ResourceManager continuously fails to start TM container on YARN when 
> Kerberos enabled
> 
>
> Key: FLINK-33155
> URL: https://issues.apache.org/jira/browse/FLINK-33155
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Reporter: Yang Wang
>Priority: Major
>
> When Kerberos enabled(with key tab) and after one day(the container token 
> expired), Flink fails to create the TaskManager container on YARN due to the 
> following exception.
>  
> {code:java}
> 2023-09-25 16:48:50,030 INFO  
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
> Worker container_1695106898104_0003_01_69 is terminated. Diagnostics: 
> Container container_1695106898104_0003_01_69 was invalid. Diagnostics: 
> [2023-09-25 16:48:45.710]token (token for hadoop: HDFS_DELEGATION_TOKEN 
> owner=, renewer=, realUser=, issueDate=1695196431487, 
> maxDate=1695801231487, sequenceNumber=12, masterKeyId=3) can't be found in 
> cacheorg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for hadoop: HDFS_DELEGATION_TOKEN 
> owner=hadoop/master-1-1.c-5ee7bdc598b6e1cc.cn-beijing.emr.aliyuncs@emr.c-5ee7bdc598b6e1cc.com,
>  renewer=, realUser=, issueDate=1695196431487, maxDate=1695801231487, 
> sequenceNumber=12, masterKeyId=3) can't be found in cacheat 
> org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1545)at 
> org.apache.hadoop.ipc.Client.call(Client.java:1491)at 
> org.apache.hadoop.ipc.Client.call(Client.java:1388)at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:907)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
>at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
> at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)at 
> org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1666)at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1573)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1588)
> at 
> org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
> at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
> at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)at 
> org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)at 
> java.security.AccessController.doPrivileged(Native Method)at 
> javax.security.auth.Subject.doAs(Subject.java:422)at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:243)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:236)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:224)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at 
> 

[jira] [Created] (FLINK-33155) Flink ResourceManager continuously fails to start TM container on YARN when Kerberos enabled

2023-09-25 Thread Yang Wang (Jira)
Yang Wang created FLINK-33155:
-

 Summary: Flink ResourceManager continuously fails to start TM 
container on YARN when Kerberos enabled
 Key: FLINK-33155
 URL: https://issues.apache.org/jira/browse/FLINK-33155
 Project: Flink
  Issue Type: Bug
Reporter: Yang Wang


When Kerberos enabled(with key tab) and after one day(the container token 
expired), Flink fails to create the TaskManager container on YARN due to the 
following exception.

 
{code:java}
2023-09-25 16:48:50,030 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Worker container_1695106898104_0003_01_69 is terminated. Diagnostics: 
Container container_1695106898104_0003_01_69 was invalid. Diagnostics: 
[2023-09-25 16:48:45.710]token (token for hadoop: HDFS_DELEGATION_TOKEN 
owner=, renewer=, realUser=, issueDate=1695196431487, 
maxDate=1695801231487, sequenceNumber=12, masterKeyId=3) can't be found in 
cacheorg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 token (token for hadoop: HDFS_DELEGATION_TOKEN 
owner=hadoop/master-1-1.c-5ee7bdc598b6e1cc.cn-beijing.emr.aliyuncs@emr.c-5ee7bdc598b6e1cc.com,
 renewer=, realUser=, issueDate=1695196431487, maxDate=1695801231487, 
sequenceNumber=12, masterKeyId=3) can't be found in cacheat 
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1545)at 
org.apache.hadoop.ipc.Client.call(Client.java:1491)at 
org.apache.hadoop.ipc.Client.call(Client.java:1388)at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:907)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)at 
org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1666)at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1573)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1588)
at 
org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)at 
org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)at 
org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)at 
org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)at 
java.security.AccessController.doPrivileged(Native Method)at 
javax.security.auth.Subject.doAs(Subject.java:422)at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:243)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:236)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:224)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
   at java.lang.Thread.run(Thread.java:750) {code}
The root cause might be that we are reading the delegation token from JM local 
file[1]. It will expire after one day. When the old TaskManager container 
crashes and ResourceManager tries to create a new one, 

[jira] [Commented] (FLINK-32775) Support yarn.provided.lib.dirs to add parent directory to classpath

2023-09-13 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764937#comment-17764937
 ] 

Yang Wang commented on FLINK-32775:
---

Thanks [~argoyal]  for your contribution.

> Support yarn.provided.lib.dirs to add parent directory to classpath
> ---
>
> Key: FLINK-32775
> URL: https://issues.apache.org/jira/browse/FLINK-32775
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / YARN
>Reporter: Archit Goyal
>Assignee: Archit Goyal
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> Currently with {*}yarn.provided.lib.dirs{*}, Flink libs can be copied to HDFS 
> location in each cluster and when set Flink tries to reuse the same jars 
> avoiding uploading it every time and YARN also caches it in the nodes.
>  
> This works fine with jars but if we try to add the xml file parent directory 
> to path, Flink job submission fails. If I add the parent directory of the xml 
> to the 
> {noformat}
> yarn.ship-files{noformat}
>  Flink job is submitted successfully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-32775) Support yarn.provided.lib.dirs to add parent directory to classpath

2023-09-13 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang closed FLINK-32775.
-
Resolution: Fixed

> Support yarn.provided.lib.dirs to add parent directory to classpath
> ---
>
> Key: FLINK-32775
> URL: https://issues.apache.org/jira/browse/FLINK-32775
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / YARN
>Reporter: Archit Goyal
>Assignee: Archit Goyal
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> Currently with {*}yarn.provided.lib.dirs{*}, Flink libs can be copied to HDFS 
> location in each cluster and when set Flink tries to reuse the same jars 
> avoiding uploading it every time and YARN also caches it in the nodes.
>  
> This works fine with jars but if we try to add the xml file parent directory 
> to path, Flink job submission fails. If I add the parent directory of the xml 
> to the 
> {noformat}
> yarn.ship-files{noformat}
>  Flink job is submitted successfully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32775) Support yarn.provided.lib.dirs to add parent directory to classpath

2023-09-13 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32775:
--
Fix Version/s: 1.19.0

> Support yarn.provided.lib.dirs to add parent directory to classpath
> ---
>
> Key: FLINK-32775
> URL: https://issues.apache.org/jira/browse/FLINK-32775
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / YARN
>Reporter: Archit Goyal
>Assignee: Archit Goyal
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> Currently with {*}yarn.provided.lib.dirs{*}, Flink libs can be copied to HDFS 
> location in each cluster and when set Flink tries to reuse the same jars 
> avoiding uploading it every time and YARN also caches it in the nodes.
>  
> This works fine with jars but if we try to add the xml file parent directory 
> to path, Flink job submission fails. If I add the parent directory of the xml 
> to the 
> {noformat}
> yarn.ship-files{noformat}
>  Flink job is submitted successfully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32775) Support yarn.provided.lib.dirs to add parent directory to classpath

2023-09-13 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764936#comment-17764936
 ] 

Yang Wang commented on FLINK-32775:
---

Merged in master via 1f9621806451411af26ccbab5c5342ef3308e219.

> Support yarn.provided.lib.dirs to add parent directory to classpath
> ---
>
> Key: FLINK-32775
> URL: https://issues.apache.org/jira/browse/FLINK-32775
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / YARN
>Reporter: Archit Goyal
>Assignee: Archit Goyal
>Priority: Minor
>  Labels: pull-request-available
>
> Currently with {*}yarn.provided.lib.dirs{*}, Flink libs can be copied to HDFS 
> location in each cluster and when set Flink tries to reuse the same jars 
> avoiding uploading it every time and YARN also caches it in the nodes.
>  
> This works fine with jars but if we try to add the xml file parent directory 
> to path, Flink job submission fails. If I add the parent directory of the xml 
> to the 
> {noformat}
> yarn.ship-files{noformat}
>  Flink job is submitted successfully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-29 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760194#comment-17760194
 ] 

Yang Wang commented on FLINK-32678:
---

I just remember we have a {{toString}} in the 
{{KubernetesLeaderElectionDriver}} which will print the leader ConfigMap name. 
Anyway, given that we only have one ConfigMap for the leader election, using 
the object id also makes sense to me.

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: qps-configmap-get-115.png, qps-configmap-get-117.jpg, 
> qps-configmap-get-118.png, qps-configmap-patch-118.png, 
> qps-configmap-put-115.png, qps-configmap-put-117.jpg
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-29 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759570#comment-17759570
 ] 

Yang Wang edited comment on FLINK-32678 at 8/29/23 6:13 AM:


*Stress Test*
Run 1000 Flink Jobs with 1 JM and 1 TM for each
1. Flink version 1.15.4 with {{high-availability.use-old-ha-services=true}}
Flink JobManager has 4 leader electors(RestServer, ResourceManager, Dispatcher, 
JobManager) to periodically update the K8s ConfigMap. So the QPS of {{PUT 
ConfigMap}}  for 1000 jobs will be roughly 800 req/s ≈ 4(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is twice as much as {{{}PUT{}}}.

2. Flink version 1.17.1(same as 1.15.4 with 
{{{}high-availability.use-old-ha-services=false{}}})
Flink will only have one shared leader elector. So the QPS of {{PUT ConfigMap}} 
 for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 1000(Flink 
JobManager pods) / 5(renew interval). The the QPS of {{GET ConfigMap}} is twice 
as much as {{{}PUT{}}}.
 
3. Flink version 1.18-snapshot
Flink will only have one shared leader elector. So the QPS of {{PATCH 
ConfigMap}}  for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is same as {{{}PATCH{}}}.
 
!qps-configmap-put-115.png|width=694,height=176!
!qps-configmap-put-117.jpg|width=694,height=176!
!qps-configmap-patch-118.png|width=694,height=176!

>From the above two pictures, we could verify that the new leader elector in 
>1.18 only sends a quarter of the write requests of the old one in 1.15 on the 
>K8s APIServer. It will significantly reduce the stress on the K8s APIServer.

 

!qps-configmap-get-115.png|width=694,height=176!
!qps-configmap-get-117.jpg|width=694,height=176!
!qps-configmap-get-118.png|width=694,height=176!

We could also find that the read requests of 1.18 are half of the 1.17. The 
root cause is fabric8 6.6.2(FLINK-31997) has introduced the PATCH http method 
for updating the leader annotation. It will save a GET request for each update.

 
||Flink Version||PUT/PATCH QPS||GET QPS||
|1.15.4 with old HA|800|1600|
|1.17.1|200|400|
|1.18.0|200|200|

 

All in all, the Flink 1.18 puts less stress on the K8s APIServer while all the 
1000 Flink jobs run normally as before.


was (Author: fly_in_gis):
*Stress Test*
Run 1000 Flink Jobs with 1 JM and 1 TM for each
1. Flink version 1.15.4 with {{high-availability.use-old-ha-services=true}}
Flink JobManager has 4 leader electors(RestServer, ResourceManager, Dispatcher, 
JobManager) to periodically update the K8s ConfigMap. So the QPS of {{PUT 
ConfigMap}}  for 1000 jobs will be roughly 800 req/s ≈ 4(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is twice as much as {{{}PUT{}}}.

2. Flink version 1.17.1(same as 1.15.4 with 
{{{}high-availability.use-old-ha-services=false{}}})
Flink will only have one shared leader elector. So the QPS of {{PUT ConfigMap}} 
 for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 1000(Flink 
JobManager pods) / 5(renew interval). The the QPS of {{GET ConfigMap}} is twice 
as much as {{{}PUT{}}}.
 
3. Flink version 1.18-snapshot
Flink will only have one shared leader elector. So the QPS of {{PATCH 
ConfigMap}}  for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is same as {{{}PATCH{}}}.
 
!qps-configmap-put-115.png|width=694,height=176!
!qps-configmap-put-117.jpg|width=694,height=176!
!qps-configmap-patch-118.png|width=694,height=176!

>From the above two pictures, we could verify that the new leader elector in 
>1.18 only sends a quarter of the write requests of the old one in 1.15 on the 
>K8s APIServer. It will significantly reduce the stress on the K8s APIServer.

 

!qps-configmap-get-115.png|width=694,height=176!
!qps-configmap-get-117.jpg|width=694,height=176!
!qps-configmap-get-118.png|width=694,height=176!

We could also find that the read requests of 1.18 are half of the 1.17. The 
root cause is fabric8 6.6.2(FLINK-31997) has introduced the PATCH http method 
for updating the leader annotation. It will save a GET request for each update.

 
||Flink Version||PUT/PATCH QPS||GET QPS||
|1.15.4 with old HA|800|1600|
|1.17.1|200|400|
|1.18.0|200|200|

 

All in all, the Flink 1.18 takes less stress on the K8s APIServer while all the 
1000 Flink jobs run normally as before.

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0

[jira] [Comment Edited] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759570#comment-17759570
 ] 

Yang Wang edited comment on FLINK-32678 at 8/29/23 5:20 AM:


*Stress Test*
Run 1000 Flink Jobs with 1 JM and 1 TM for each
1. Flink version 1.15.4 with {{high-availability.use-old-ha-services=true}}
Flink JobManager has 4 leader electors(RestServer, ResourceManager, Dispatcher, 
JobManager) to periodically update the K8s ConfigMap. So the QPS of {{PUT 
ConfigMap}}  for 1000 jobs will be roughly 800 req/s ≈ 4(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is twice as much as {{{}PUT{}}}.

2. Flink version 1.17.1(same as 1.15.4 with 
{{{}high-availability.use-old-ha-services=false{}}})
Flink will only have one shared leader elector. So the QPS of {{PUT ConfigMap}} 
 for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 1000(Flink 
JobManager pods) / 5(renew interval). The the QPS of {{GET ConfigMap}} is twice 
as much as {{{}PUT{}}}.
 
3. Flink version 1.18-snapshot
Flink will only have one shared leader elector. So the QPS of {{PATCH 
ConfigMap}}  for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is same as {{{}PATCH{}}}.
 
!qps-configmap-put-115.png|width=694,height=176!
!qps-configmap-put-117.jpg|width=694,height=176!
!qps-configmap-patch-118.png|width=694,height=176!

>From the above two pictures, we could verify that the new leader elector in 
>1.18 only sends a quarter of the write requests of the old one in 1.15 on the 
>K8s APIServer. It will significantly reduce the stress on the K8s APIServer.

 

!qps-configmap-get-115.png|width=694,height=176!
!qps-configmap-get-117.jpg|width=694,height=176!
!qps-configmap-get-118.png|width=694,height=176!

We could also find that the read requests of 1.18 are half of the 1.17. The 
root cause is fabric8 6.6.2(FLINK-31997) has introduced the PATCH http method 
for updating the leader annotation. It will save a GET request for each update.

 
||Flink Version||PUT/PATCH QPS||GET QPS||
|1.15.4 with old HA|800|1600|
|1.17.1|200|400|
|1.18.0|200|200|

 

All in all, the Flink 1.18 takes less stress on the K8s APIServer while all the 
1000 Flink jobs run normally as before.


was (Author: fly_in_gis):
*Stress Test*
Run 1000 Flink Jobs with 1 JM and 1 TM for each
1. Flink version 1.15.4 with {{high-availability.use-old-ha-services=true}}
Flink JobManager has 4 leader electors(RestServer, ResourceManager, Dispatcher, 
JobManager) to periodically update the K8s ConfigMap. So the QPS of {{PUT 
ConfigMap}}  for 1000 jobs will be roughly 800 req/s ≈ 4(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is twice as much as {{{}PUT{}}}.

2. Flink version 1.17.1(same as 1.15.4 with 
{{{}high-availability.use-old-ha-services=false{}}})
Flink will only have one shared leader elector. So the QPS of {{PUT ConfigMap}} 
 for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 1000(Flink 
JobManager pods) / 5(renew interval). The the QPS of {{GET ConfigMap}} is twice 
as much as {{{}PUT{}}}.
 
3. Flink version 1.18-snapshot
Flink will only have one shared leader elector. So the QPS of {{PATCH 
ConfigMap}}  for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is same as {{{}PATCH{}}}.
 
!qps-configmap-put-115.png|width=694,height=176!
!qps-configmap-put-117.jpg|width=694,height=176!
!qps-configmap-patch-118.png|width=694,height=176!

>From the above two pictures, we could verify that the new leader elector in 
>1.18 only sends a quarter of the write requests of the old one in 1.15 on the 
>K8s APIServer. It will significantly reduce the stress on the K8s APIServer.

 

!qps-configmap-get-115.png|width=694,height=176!
!qps-configmap-get-117.jpg|width=694,height=176!
!qps-configmap-get-118.png|width=694,height=176!

We could also find that the read requests are half of the 1.17. The root cause 
is fabric8 6.6.2(FLINK-31997) has introduced the PATCH http method for updating 
the leader annotation. It will save a GET request for each update.

 
||Flink Version||PUT/PATCH QPS||GET QPS||
|1.15.4 with old HA|800|1600|
|1.17.1|200|400|
|1.18.0|200|200|

 

All in all, the Flink 1.18 takes less stress on the K8s APIServer while all the 
1000 Flink jobs run normally as before.

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>  

[jira] [Comment Edited] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759570#comment-17759570
 ] 

Yang Wang edited comment on FLINK-32678 at 8/29/23 5:20 AM:


*Stress Test*
Run 1000 Flink Jobs with 1 JM and 1 TM for each
1. Flink version 1.15.4 with {{high-availability.use-old-ha-services=true}}
Flink JobManager has 4 leader electors(RestServer, ResourceManager, Dispatcher, 
JobManager) to periodically update the K8s ConfigMap. So the QPS of {{PUT 
ConfigMap}}  for 1000 jobs will be roughly 800 req/s ≈ 4(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is twice as much as {{{}PUT{}}}.

2. Flink version 1.17.1(same as 1.15.4 with 
{{{}high-availability.use-old-ha-services=false{}}})
Flink will only have one shared leader elector. So the QPS of {{PUT ConfigMap}} 
 for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 1000(Flink 
JobManager pods) / 5(renew interval). The the QPS of {{GET ConfigMap}} is twice 
as much as {{{}PUT{}}}.
 
3. Flink version 1.18-snapshot
Flink will only have one shared leader elector. So the QPS of {{PATCH 
ConfigMap}}  for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is same as {{{}PATCH{}}}.
 
!qps-configmap-put-115.png|width=694,height=176!
!qps-configmap-put-117.jpg|width=694,height=176!
!qps-configmap-patch-118.png|width=694,height=176!

>From the above two pictures, we could verify that the new leader elector in 
>1.18 only sends a quarter of the write requests of the old one in 1.15 on the 
>K8s APIServer. It will significantly reduce the stress on the K8s APIServer.

 

!qps-configmap-get-115.png|width=694,height=176!
!qps-configmap-get-117.jpg|width=694,height=176!
!qps-configmap-get-118.png|width=694,height=176!

We could also find that the read requests are half of the 1.17. The root cause 
is fabric8 6.6.2(FLINK-31997) has introduced the PATCH http method for updating 
the leader annotation. It will save a GET request for each update.

 
||Flink Version||PUT/PATCH QPS||GET QPS||
|1.15.4 with old HA|800|1600|
|1.17.1|200|400|
|1.18.0|200|200|

 

All in all, the Flink 1.18 takes less stress on the K8s APIServer while all the 
1000 Flink jobs run normally as before.


was (Author: fly_in_gis):
*Stress Test*
Run 1000 Flink Jobs with 1 JM and 1 TM for each
1. Flink version 1.15.4 with {{high-availability.use-old-ha-services=true}}
Flink JobManager has 4 leader electors(RestServer, ResourceManager, Dispatcher, 
JobManager) to periodically update the K8s ConfigMap. So the QPS of {{PUT 
ConfigMap}}  for 1000 jobs will be roughly 800 req/s ≈ 4(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is twice as much as {{PUT}}.

2. Flink version 1.17.1(same as 1.15.4 with 
{{high-availability.use-old-ha-services=false}})
Flink will only have one shared leader elector. So the QPS of {{PUT ConfigMap}} 
 for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 1000(Flink 
JobManager pods) / 5(renew interval). The the QPS of {{GET ConfigMap}} is twice 
as much as {{PUT}}.
 
3. Flink version 1.18-snapshot
Flink will only have one shared leader elector. So the QPS of {{PATCH 
ConfigMap}}  for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is same as {{PATCH}}.
 
!qps-configmap-put-115.png|width=694,height=176!
!qps-configmap-put-117.png|width=694,height=176!
!qps-configmap-patch-118.png|width=694,height=176!

>From the above two pictures, we could verify that the new leader elector in 
>1.18 only sends a quarter of the write requests of the old one in 1.15 on the 
>K8s APIServer. It will significantly reduce the stress on the K8s APIServer.

 

!qps-configmap-get-115.png|width=694,height=176!
!qps-configmap-get-117.png|width=694,height=176!
!qps-configmap-get-118.png|width=694,height=176!

We could also find that the read requests are half of the 1.17. The root cause 
is fabric8 6.6.2(FLINK-31997) has introduced the PATCH http method for updating 
the leader annotation. It will save a GET request for each update.

 

All in all, the Flink 1.18 takes less stress on the K8s APIServer while all the 
1000 Flink jobs run normally as before.

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> 

[jira] [Comment Edited] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759570#comment-17759570
 ] 

Yang Wang edited comment on FLINK-32678 at 8/29/23 5:15 AM:


*Stress Test*
Run 1000 Flink Jobs with 1 JM and 1 TM for each
1. Flink version 1.15.4 with {{high-availability.use-old-ha-services=true}}
Flink JobManager has 4 leader electors(RestServer, ResourceManager, Dispatcher, 
JobManager) to periodically update the K8s ConfigMap. So the QPS of {{PUT 
ConfigMap}}  for 1000 jobs will be roughly 800 req/s ≈ 4(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is twice as much as {{PUT}}.

2. Flink version 1.17.1(same as 1.15.4 with 
{{high-availability.use-old-ha-services=false}})
Flink will only have one shared leader elector. So the QPS of {{PUT ConfigMap}} 
 for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 1000(Flink 
JobManager pods) / 5(renew interval). The the QPS of {{GET ConfigMap}} is twice 
as much as {{PUT}}.
 
3. Flink version 1.18-snapshot
Flink will only have one shared leader elector. So the QPS of {{PATCH 
ConfigMap}}  for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is same as {{PATCH}}.
 
!qps-configmap-put-115.png|width=694,height=176!
!qps-configmap-put-117.png|width=694,height=176!
!qps-configmap-patch-118.png|width=694,height=176!

>From the above two pictures, we could verify that the new leader elector in 
>1.18 only sends a quarter of the write requests of the old one in 1.15 on the 
>K8s APIServer. It will significantly reduce the stress on the K8s APIServer.

 

!qps-configmap-get-115.png|width=694,height=176!
!qps-configmap-get-117.png|width=694,height=176!
!qps-configmap-get-118.png|width=694,height=176!

We could also find that the read requests are half of the 1.17. The root cause 
is fabric8 6.6.2(FLINK-31997) has introduced the PATCH http method for updating 
the leader annotation. It will save a GET request for each update.

 

All in all, the Flink 1.18 takes less stress on the K8s APIServer while all the 
1000 Flink jobs run normally as before.


was (Author: fly_in_gis):
*Stress Test*
Run 1000 Flink Jobs with 1 JM and 1 TM for each
1. Flink version 1.15.4 with {{high-availability.use-old-ha-services=true}}
Flink JobManager has 4 leader electors(RestServer, ResourceManager, Dispatcher, 
JobManager) to periodically update the K8s ConfigMap. So the QPS of {{PUT 
ConfigMap}}  for 1000 jobs will be roughly 800 req/s ≈ 4(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval).
 
2. Flink version 1.18-snapshot
Flink will only have one shared leader elector. So the QPS of {{PATCH 
ConfigMap}}  for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval).
 
!qos-configmap-put-115.png|width=694,height=176!
!qos-configmap-patch-118.png|width=694,height=176!

>From the above two pictures, we could verify that the new leader elector in 
>1.18 only sends a quarter of the write requests of the old one in 1.15 on the 
>K8s APIServer. It will significantly reduce the stress on the K8s APIServer.

 

!qos-configmap-get-115.png|width=694,height=176!

!qos-configmap-get-118.png|width=694,height=176!

We also find that the read requests are only 1/8 of the old one. The root cause 
is fabric8 6.6.2(FLINK-31997) has introduced the PATCH http method for updating 
the leader annotation. It will save a GET request for each update.

 

All in all, the Flink 1.18 takes less stress on the K8s APIServer while all the 
1000 Flink jobs run normally as before.

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: qps-configmap-get-115.png, qps-configmap-get-117.jpg, 
> qps-configmap-get-118.png, qps-configmap-patch-118.png, 
> qps-configmap-put-115.png, qps-configmap-put-117.jpg
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * 

[jira] [Updated] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32678:
--
Attachment: (was: qos-configmap-get-115.png)

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: qps-configmap-get-115.png, qps-configmap-get-117.jpg, 
> qps-configmap-get-118.png, qps-configmap-patch-118.png, 
> qps-configmap-put-115.png, qps-configmap-put-117.jpg
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32678:
--
Attachment: (was: qos-configmap-put-115.png)

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: qps-configmap-get-115.png, qps-configmap-get-117.jpg, 
> qps-configmap-get-118.png, qps-configmap-patch-118.png, 
> qps-configmap-put-115.png, qps-configmap-put-117.jpg
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32678:
--
Attachment: qps-configmap-get-118.png
qps-configmap-get-115.png
qps-configmap-put-115.png
qps-configmap-get-117.jpg
qps-configmap-put-117.jpg
qps-configmap-patch-118.png

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: qps-configmap-get-115.png, qps-configmap-get-117.jpg, 
> qps-configmap-get-118.png, qps-configmap-patch-118.png, 
> qps-configmap-put-115.png, qps-configmap-put-117.jpg
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32678:
--
Attachment: (was: qos-configmap-patch-118.png)

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: qps-configmap-get-115.png, qps-configmap-get-117.jpg, 
> qps-configmap-get-118.png, qps-configmap-patch-118.png, 
> qps-configmap-put-115.png, qps-configmap-put-117.jpg
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32678:
--
Attachment: (was: qos-configmap-get-118.png)

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: qps-configmap-get-115.png, qps-configmap-get-117.jpg, 
> qps-configmap-get-118.png, qps-configmap-patch-118.png, 
> qps-configmap-put-115.png, qps-configmap-put-117.jpg
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759571#comment-17759571
 ] 

Yang Wang commented on FLINK-32678:
---

[~mapohl] Please ping me if you believe we still need more other tests on this 
ticket.

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: qos-configmap-get-115.png, qos-configmap-get-118.png, 
> qos-configmap-patch-118.png, qos-configmap-put-115.png
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759570#comment-17759570
 ] 

Yang Wang commented on FLINK-32678:
---

*Stress Test*
Run 1000 Flink Jobs with 1 JM and 1 TM for each
1. Flink version 1.15.4 with {{high-availability.use-old-ha-services=true}}
Flink JobManager has 4 leader electors(RestServer, ResourceManager, Dispatcher, 
JobManager) to periodically update the K8s ConfigMap. So the QPS of {{PUT 
ConfigMap}}  for 1000 jobs will be roughly 800 req/s ≈ 4(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval).
 
2. Flink version 1.18-snapshot
Flink will only have one shared leader elector. So the QPS of {{PATCH 
ConfigMap}}  for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval).
 
!qos-configmap-put-115.png|width=694,height=176!
!qos-configmap-patch-118.png|width=694,height=176!

>From the above two pictures, we could verify that the new leader elector in 
>1.18 only sends a quarter of the write requests of the old one in 1.15 on the 
>K8s APIServer. It will significantly reduce the stress on the K8s APIServer.

 

!qos-configmap-get-115.png|width=694,height=176!

!qos-configmap-get-118.png|width=694,height=176!

We also find that the read requests are only 1/8 of the old one. The root cause 
is fabric8 6.6.2(FLINK-31997) has introduced the PATCH http method for updating 
the leader annotation. It will save a GET request for each update.

 

All in all, the Flink 1.18 takes less stress on the K8s APIServer while all the 
1000 Flink jobs run normally as before.

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: qos-configmap-get-115.png, qos-configmap-get-118.png, 
> qos-configmap-patch-118.png, qos-configmap-put-115.png
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32678:
--
Attachment: qos-configmap-put-115.png

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: qos-configmap-get-115.png, qos-configmap-get-118.png, 
> qos-configmap-patch-118.png, qos-configmap-put-115.png
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32678:
--
Attachment: qos-configmap-get-118.png

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: qos-configmap-get-115.png, qos-configmap-get-118.png, 
> qos-configmap-patch-118.png, qos-configmap-put-115.png
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32678:
--
Attachment: (was: qos-configmap-get-115-1.png)

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: qos-configmap-get-115.png, qos-configmap-patch-118.png
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32678:
--
Attachment: qos-configmap-get-115-1.png

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: qos-configmap-get-115-1.png, qos-configmap-get-115.png, 
> qos-configmap-patch-118.png
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32678:
--
Attachment: qos-configmap-patch-118.png

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: qos-configmap-get-115.png, qos-configmap-patch-118.png
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32678:
--
Attachment: qos-configmap-get-115.png

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: qos-configmap-get-115.png
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32678:
--
Attachment: (was: image (1).png)

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32678:
--
Attachment: (was: image (2).png)

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32678:
--
Attachment: (was: image.png)

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32678:
--
Attachment: (was: image (3).png)

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-32678:
--
Attachment: image.png
image (1).png
image (2).png
image (3).png

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: image (1).png, image (2).png, image (3).png, image.png
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-27 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759429#comment-17759429
 ] 

Yang Wang edited comment on FLINK-32678 at 8/28/23 4:04 AM:


*Functionality Test*
1. [SUCCEED] Build the docker image with release-1.18 branch
2. [SUCCEED] Use the flink-k8s-operator to start a Flink app with HA enabled, 
check the logs, UI
3. [SUCCEED] Check HA ConfigMaps, one for leader election and one for the job 
checkpoint
4. [SUCCEED] Check the thread dump of the JobManager and verify only one leader 
elector is running(the value is 4 before 1.15 with old HA)
5. [SUCCEED] Use the command {{kubectl exec 
flink-example-statemachine-897cb6d4f-bzdv5 – /bin/sh -c 'kill 1'}}  to kill the 
JobManager and verify no more TaskManager is created(Flink should reuse the 
existing TaskManager before idle timeout).
6. [SUCCEED] Verify the Flink job recover from the latest checkpoint and keep 
running
2023-08-28 03:40:29,167 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job 
596bdc6b7ac5bcefb611c3df08d64520 from Checkpoint 101 @ 1693194000259 for 
596bdc6b7ac5bcefb611c3df08d64520 located at 
oss://flink-test/flink-k8s-ha-stress-test/flink-cp/596bdc6b7ac5bcefb611c3df08d64520/chk-101.
 
 
All the things work well after refactoring of leader-election, akka, and 
flink-shaded. I just find a log that could be improved by replacing the object 
id with some more meaningful name.
2023-08-28 03:40:18,258 INFO 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
LeaderContender has been registered under component 'resourcemanager' for 
org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver@2a19a0fe.
 

 

I am still working on the stress test and will share the result later today.


was (Author: fly_in_gis):
# Functionality Test
1. [SUCCEED] Build the docker image with release-1.18 branch
2. [SUCCEED] Use the flink-k8s-operator to start a Flink app with HA enabled, 
check the logs, UI
3. [SUCCEED] Check HA ConfigMaps, one for leader election and one for the job 
checkpoint
4. [SUCCEED] Check the thread dump of the JobManager and verify only one leader 
elector is running(the value is 4 before 1.15 with old HA)
5. [SUCCEED] Use the command {{kubectl exec 
flink-example-statemachine-897cb6d4f-bzdv5 -- /bin/sh -c 'kill 1'}}  to kill 
the JobManager and verify no more TaskManager is created(Flink should reuse the 
existing TaskManager before idle timeout).
6. [SUCCEED] Verify the Flink job recover from the latest checkpoint and keep 
running
2023-08-28 03:40:29,167 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Restoring job 
596bdc6b7ac5bcefb611c3df08d64520 from Checkpoint 101 @ 1693194000259 for 
596bdc6b7ac5bcefb611c3df08d64520 located at 
oss://flink-test/flink-k8s-ha-stress-test/flink-cp/596bdc6b7ac5bcefb611c3df08d64520/chk-101.
 
 
All the things work well after refactoring of leader-election, akka, and 
flink-shaded. I just find a log that could be improved by replacing the object 
id with some more meaningful name.
2023-08-28 03:40:18,258 INFO  
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
LeaderContender has been registered under component 'resourcemanager' for 
org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver@2a19a0fe.
 

 

I am still working on the stress test and will share the result later today.

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

2023-08-27 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759429#comment-17759429
 ] 

Yang Wang commented on FLINK-32678:
---

# Functionality Test
1. [SUCCEED] Build the docker image with release-1.18 branch
2. [SUCCEED] Use the flink-k8s-operator to start a Flink app with HA enabled, 
check the logs, UI
3. [SUCCEED] Check HA ConfigMaps, one for leader election and one for the job 
checkpoint
4. [SUCCEED] Check the thread dump of the JobManager and verify only one leader 
elector is running(the value is 4 before 1.15 with old HA)
5. [SUCCEED] Use the command {{kubectl exec 
flink-example-statemachine-897cb6d4f-bzdv5 -- /bin/sh -c 'kill 1'}}  to kill 
the JobManager and verify no more TaskManager is created(Flink should reuse the 
existing TaskManager before idle timeout).
6. [SUCCEED] Verify the Flink job recover from the latest checkpoint and keep 
running
2023-08-28 03:40:29,167 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Restoring job 
596bdc6b7ac5bcefb611c3df08d64520 from Checkpoint 101 @ 1693194000259 for 
596bdc6b7ac5bcefb611c3df08d64520 located at 
oss://flink-test/flink-k8s-ha-stress-test/flink-cp/596bdc6b7ac5bcefb611c3df08d64520/chk-101.
 
 
All the things work well after refactoring of leader-election, akka, and 
flink-shaded. I just find a log that could be improved by replacing the object 
id with some more meaningful name.
2023-08-28 03:40:18,258 INFO  
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
LeaderContender has been registered under component 'resourcemanager' for 
org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver@2a19a0fe.
 

 

I am still working on the stress test and will share the result later today.

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Major
>  Labels: release-testing
> Fix For: 1.18.0
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32678) Test FLIP-285 LeaderElection

2023-07-31 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17749096#comment-17749096
 ] 

Yang Wang commented on FLINK-32678:
---

I believe it is reasonable. I would also keep an eye on the netty and Pekko 
related logs during the stress test.

> Test FLIP-285 LeaderElection
> 
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / Coordination
>Affects Versions: 1.18.0
>Reporter: Matthias Pohl
>Assignee: Yang Wang
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.18.0
>
>
> We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-31947) Enable stdout redirect in flink-console.sh

2023-06-29 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang closed FLINK-31947.
-
Resolution: Won't Fix

> Enable stdout redirect in flink-console.sh
> --
>
> Key: FLINK-31947
> URL: https://issues.apache.org/jira/browse/FLINK-31947
> Project: Flink
>  Issue Type: Improvement
>  Components: flink-docker
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Minor
>  Labels: pull-request-available
>
>  flink-console.sh is used by Flink Kubenates bins to start containers. But 
> there is no stdout redirect as flink-dameon.sh. It will cause the program 
> that when user want to access stdout from web ui, no file is found.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31947) Enable stdout redirect in flink-console.sh

2023-06-29 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738539#comment-17738539
 ] 

Yang Wang commented on FLINK-31947:
---

Already covered by FLINK-31234.

> Enable stdout redirect in flink-console.sh
> --
>
> Key: FLINK-31947
> URL: https://issues.apache.org/jira/browse/FLINK-31947
> Project: Flink
>  Issue Type: Improvement
>  Components: flink-docker
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Minor
>  Labels: pull-request-available
>
>  flink-console.sh is used by Flink Kubenates bins to start containers. But 
> there is no stdout redirect as flink-dameon.sh. It will cause the program 
> that when user want to access stdout from web ui, no file is found.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31234) Add an option to redirect stdout/stderr for flink on kubernetes

2023-06-29 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738537#comment-17738537
 ] 

Yang Wang commented on FLINK-31234:
---

Thanks [~huwh] for your contribution.

> Add an option to redirect stdout/stderr for flink on kubernetes
> ---
>
> Key: FLINK-31234
> URL: https://issues.apache.org/jira/browse/FLINK-31234
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes
>Affects Versions: 1.17.0
>Reporter: Weihua Hu
>Assignee: Weihua Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>
> Flink on Kubernetes does not support redirecting stdout/stderr to files. This 
> is to allow users to get logs via "kubectl logs".
> But for our internal scenario, we use a kubernetes user to submit all jobs to 
> the k8s cluster and provide a platform for users to submit jobs. Users can't 
> access kubernetes directly. so we need to display logs/stdout in flink webui.
> Because the web ui retrieves the stdout file by filename, which has the same 
> prefix as \{taskmanager}.log (such as 
> flink--kubernetes-taskmanager-0-my-first-flink-cluster-taskmanager-1-4.log). 
> We can't support this with a simple custom image.
> IMO, we should add an option to redirect stdout/stderr to files. When this 
> setting is enabled.
> 1. flink-console.sh will redirect stdout/err to files.
> 2. avoid logs twices in the log file and the stdout file. we could do this by 
> using ThresholdFilter 
> ([log4j|https://logging.apache.org/log4j/2.x/manual/filters.html#ThresholdFilter]
>  and [logback|https://logback.qos.ch/manual/filters.html#thresholdFilter]) 
> with a system property.
> Of course, this option is false by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31234) Add an option to redirect stdout/stderr for flink on kubernetes

2023-06-29 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738532#comment-17738532
 ] 

Yang Wang commented on FLINK-31234:
---

Fixed via:

master: dfe6bdad0b9493eb35b98ac8326a097a945badf2

> Add an option to redirect stdout/stderr for flink on kubernetes
> ---
>
> Key: FLINK-31234
> URL: https://issues.apache.org/jira/browse/FLINK-31234
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes
>Affects Versions: 1.17.0
>Reporter: Weihua Hu
>Assignee: Weihua Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>
> Flink on Kubernetes does not support redirecting stdout/stderr to files. This 
> is to allow users to get logs via "kubectl logs".
> But for our internal scenario, we use a kubernetes user to submit all jobs to 
> the k8s cluster and provide a platform for users to submit jobs. Users can't 
> access kubernetes directly. so we need to display logs/stdout in flink webui.
> Because the web ui retrieves the stdout file by filename, which has the same 
> prefix as \{taskmanager}.log (such as 
> flink--kubernetes-taskmanager-0-my-first-flink-cluster-taskmanager-1-4.log). 
> We can't support this with a simple custom image.
> IMO, we should add an option to redirect stdout/stderr to files. When this 
> setting is enabled.
> 1. flink-console.sh will redirect stdout/err to files.
> 2. avoid logs twices in the log file and the stdout file. we could do this by 
> using ThresholdFilter 
> ([log4j|https://logging.apache.org/log4j/2.x/manual/filters.html#ThresholdFilter]
>  and [logback|https://logback.qos.ch/manual/filters.html#thresholdFilter]) 
> with a system property.
> Of course, this option is false by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-31234) Add an option to redirect stdout/stderr for flink on kubernetes

2023-06-29 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang closed FLINK-31234.
-
Resolution: Fixed

> Add an option to redirect stdout/stderr for flink on kubernetes
> ---
>
> Key: FLINK-31234
> URL: https://issues.apache.org/jira/browse/FLINK-31234
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes
>Affects Versions: 1.17.0
>Reporter: Weihua Hu
>Assignee: Weihua Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>
> Flink on Kubernetes does not support redirecting stdout/stderr to files. This 
> is to allow users to get logs via "kubectl logs".
> But for our internal scenario, we use a kubernetes user to submit all jobs to 
> the k8s cluster and provide a platform for users to submit jobs. Users can't 
> access kubernetes directly. so we need to display logs/stdout in flink webui.
> Because the web ui retrieves the stdout file by filename, which has the same 
> prefix as \{taskmanager}.log (such as 
> flink--kubernetes-taskmanager-0-my-first-flink-cluster-taskmanager-1-4.log). 
> We can't support this with a simple custom image.
> IMO, we should add an option to redirect stdout/stderr to files. When this 
> setting is enabled.
> 1. flink-console.sh will redirect stdout/err to files.
> 2. avoid logs twices in the log file and the stdout file. we could do this by 
> using ThresholdFilter 
> ([log4j|https://logging.apache.org/log4j/2.x/manual/filters.html#ThresholdFilter]
>  and [logback|https://logback.qos.ch/manual/filters.html#thresholdFilter]) 
> with a system property.
> Of course, this option is false by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-31983) Add yarn acls capability to flink containers

2023-06-27 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang closed FLINK-31983.
-
Resolution: Fixed

> Add yarn acls capability to flink containers
> 
>
> Key: FLINK-31983
> URL: https://issues.apache.org/jira/browse/FLINK-31983
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / YARN
>Reporter: Archit Goyal
>Assignee: Archit Goyal
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>
> Yarn provide application acls mechanism to be able to provide specific rights 
> to other users than the one running the job (view logs through the 
> resourcemanager/job history, kill the application)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31983) Add yarn acls capability to flink containers

2023-06-27 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737929#comment-17737929
 ] 

Yang Wang commented on FLINK-31983:
---

Fixed via:

master: c5acd8dd800dfcd2c8873c569d0028fc7d991b1c

 

Thanks [~argoyal] for your contribution.

> Add yarn acls capability to flink containers
> 
>
> Key: FLINK-31983
> URL: https://issues.apache.org/jira/browse/FLINK-31983
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / YARN
>Reporter: Archit Goyal
>Assignee: Archit Goyal
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>
> Yarn provide application acls mechanism to be able to provide specific rights 
> to other users than the one running the job (view logs through the 
> resourcemanager/job history, kill the application)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31983) Add yarn acls capability to flink containers

2023-06-27 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-31983:
--
Fix Version/s: 1.18.0

> Add yarn acls capability to flink containers
> 
>
> Key: FLINK-31983
> URL: https://issues.apache.org/jira/browse/FLINK-31983
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / YARN
>Reporter: Archit Goyal
>Assignee: Archit Goyal
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>
> Yarn provide application acls mechanism to be able to provide specific rights 
> to other users than the one running the job (view logs through the 
> resourcemanager/job history, kill the application)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-31983) Add yarn acls capability to flink containers

2023-06-27 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang reassigned FLINK-31983:
-

Assignee: Archit Goyal

> Add yarn acls capability to flink containers
> 
>
> Key: FLINK-31983
> URL: https://issues.apache.org/jira/browse/FLINK-31983
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / YARN
>Reporter: Archit Goyal
>Assignee: Archit Goyal
>Priority: Minor
>  Labels: pull-request-available
>
> Yarn provide application acls mechanism to be able to provide specific rights 
> to other users than the one running the job (view logs through the 
> resourcemanager/job history, kill the application)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] (FLINK-31673) Add E2E tests for flink jdbc driver

2023-06-27 Thread Yang Wang (Jira)


[ https://issues.apache.org/jira/browse/FLINK-31673 ]


Yang Wang deleted comment on FLINK-31673:
---

was (Author: fly_in_gis):
The new introduced e2e test is failing. Could you please have a look?

 

https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_apis/build/builds/50483/logs/100

> Add E2E tests for flink jdbc driver
> ---
>
> Key: FLINK-31673
> URL: https://issues.apache.org/jira/browse/FLINK-31673
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / JDBC, Tests
>Reporter: Benchao Li
>Assignee: Fang Yong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>
> Since jdbc driver will be used by third party projects, and we've introduced 
> a bundled jar in flink-sql-jdbc-driver-bundle, we'd better to have e2e tests 
> to verify and ensure it works fine (in case of the dependency management).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31673) Add E2E tests for flink jdbc driver

2023-06-27 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737560#comment-17737560
 ] 

Yang Wang commented on FLINK-31673:
---

The new introduced e2e test is failing. Could you please have a look?

 

https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_apis/build/builds/50483/logs/100

> Add E2E tests for flink jdbc driver
> ---
>
> Key: FLINK-31673
> URL: https://issues.apache.org/jira/browse/FLINK-31673
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / JDBC, Tests
>Reporter: Benchao Li
>Assignee: Fang Yong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>
> Since jdbc driver will be used by third party projects, and we've introduced 
> a bundled jar in flink-sql-jdbc-driver-bundle, we'd better to have e2e tests 
> to verify and ensure it works fine (in case of the dependency management).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31234) Add an option to redirect stdout/stderr for flink on kubernetes

2023-06-20 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735182#comment-17735182
 ] 

Yang Wang commented on FLINK-31234:
---

Given that the YARN/Standalone mode could access the stdout/stderr logs in the 
Flink webUI, it also certainly makes sense to have it in the K8s deployment. 
And I think the limitation that stdout/stderr files could not be rolled based 
file size is acceptable.

> Add an option to redirect stdout/stderr for flink on kubernetes
> ---
>
> Key: FLINK-31234
> URL: https://issues.apache.org/jira/browse/FLINK-31234
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes
>Affects Versions: 1.17.0
>Reporter: Weihua Hu
>Assignee: Weihua Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>
> Flink on Kubernetes does not support redirecting stdout/stderr to files. This 
> is to allow users to get logs via "kubectl logs".
> But for our internal scenario, we use a kubernetes user to submit all jobs to 
> the k8s cluster and provide a platform for users to submit jobs. Users can't 
> access kubernetes directly. so we need to display logs/stdout in flink webui.
> Because the web ui retrieves the stdout file by filename, which has the same 
> prefix as \{taskmanager}.log (such as 
> flink--kubernetes-taskmanager-0-my-first-flink-cluster-taskmanager-1-4.log). 
> We can't support this with a simple custom image.
> IMO, we should add an option to redirect stdout/stderr to files. When this 
> setting is enabled.
> 1. flink-console.sh will redirect stdout/err to files.
> 2. avoid logs twices in the log file and the stdout file. we could do this by 
> using ThresholdFilter 
> ([log4j|https://logging.apache.org/log4j/2.x/manual/filters.html#ThresholdFilter]
>  and [logback|https://logback.qos.ch/manual/filters.html#thresholdFilter]) 
> with a system property.
> Of course, this option is false by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-31234) Add an option to redirect stdout/stderr for flink on kubernetes

2023-06-20 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang reassigned FLINK-31234:
-

Assignee: Weihua Hu

> Add an option to redirect stdout/stderr for flink on kubernetes
> ---
>
> Key: FLINK-31234
> URL: https://issues.apache.org/jira/browse/FLINK-31234
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes
>Affects Versions: 1.17.0
>Reporter: Weihua Hu
>Assignee: Weihua Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>
> Flink on Kubernetes does not support redirecting stdout/stderr to files. This 
> is to allow users to get logs via "kubectl logs".
> But for our internal scenario, we use a kubernetes user to submit all jobs to 
> the k8s cluster and provide a platform for users to submit jobs. Users can't 
> access kubernetes directly. so we need to display logs/stdout in flink webui.
> Because the web ui retrieves the stdout file by filename, which has the same 
> prefix as \{taskmanager}.log (such as 
> flink--kubernetes-taskmanager-0-my-first-flink-cluster-taskmanager-1-4.log). 
> We can't support this with a simple custom image.
> IMO, we should add an option to redirect stdout/stderr to files. When this 
> setting is enabled.
> 1. flink-console.sh will redirect stdout/err to files.
> 2. avoid logs twices in the log file and the stdout file. we could do this by 
> using ThresholdFilter 
> ([log4j|https://logging.apache.org/log4j/2.x/manual/filters.html#ThresholdFilter]
>  and [logback|https://logback.qos.ch/manual/filters.html#thresholdFilter]) 
> with a system property.
> Of course, this option is false by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-28915) Make application mode could support remote DFS schema(e.g. S3, OSS, HDFS, etc.)

2023-06-11 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-28915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-28915:
--
Summary: Make application mode could support remote DFS schema(e.g. S3, 
OSS, HDFS, etc.)  (was: Flink Native k8s mode jar localtion support s3 schema )

> Make application mode could support remote DFS schema(e.g. S3, OSS, HDFS, 
> etc.)
> ---
>
> Key: FLINK-28915
> URL: https://issues.apache.org/jira/browse/FLINK-28915
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes, flink-contrib
>Reporter: hjw
>Assignee: hjw
>Priority: Major
>  Labels: pull-request-available
>
> As the Flink document show , local is the only supported scheme in Native k8s 
> deployment.
> Is there have a plan to support s3 filesystem? thx.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-31730) Support Ephemeral Storage in KubernetesConfigOptions

2023-04-11 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710855#comment-17710855
 ] 

Yang Wang edited comment on FLINK-31730 at 4/11/23 9:55 AM:


Do you think using the pod template[1] to configure the ephemeral storage is 
enough?

 

[1]. 
https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/resource-providers/native_kubernetes/#pod-template


was (Author: fly_in_gis):
Do you think using the pod template to configure the ephemeral storage is 
enough?

> Support Ephemeral Storage in KubernetesConfigOptions
> 
>
> Key: FLINK-31730
> URL: https://issues.apache.org/jira/browse/FLINK-31730
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / Kubernetes
>Affects Versions: 1.18.0, 1.17.1
>Reporter: Zhenqiu Huang
>Priority: Minor
>
> There is a common need to config flink main container with Ephemeral Storage 
> size. It will be more user friendly to support it as a flink config.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31730) Support Ephemeral Storage in KubernetesConfigOptions

2023-04-11 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710855#comment-17710855
 ] 

Yang Wang commented on FLINK-31730:
---

Do you think using the pod template to configure the ephemeral storage is 
enough?

> Support Ephemeral Storage in KubernetesConfigOptions
> 
>
> Key: FLINK-31730
> URL: https://issues.apache.org/jira/browse/FLINK-31730
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / Kubernetes
>Affects Versions: 1.18.0, 1.17.1
>Reporter: Zhenqiu Huang
>Priority: Minor
>
> There is a common need to config flink main container with Ephemeral Storage 
> size. It will be more user friendly to support it as a flink config.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31188) Expose kubernetes scheduler configOption when running flink on kubernetes

2023-02-22 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17692454#comment-17692454
 ] 

Yang Wang commented on FLINK-31188:
---

This might be a duplicate of FLINK-28825.

> Expose kubernetes scheduler configOption when running flink on kubernetes
> -
>
> Key: FLINK-31188
> URL: https://issues.apache.org/jira/browse/FLINK-31188
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes
>Reporter: Kelu Tao
>Priority: Major
>
> Now when we deploy Flink job on kubernetes, the scheduler is kubernetes 
> scheduler by default. But the custom kubernetes scheduler setting sometimes 
> is needed by users.
>  
> So can we add the config option for kubernetes scheduler setting? 
> Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30908) Fatal error in ResourceManager caused YARNSessionFIFOSecuredITCase.testDetachedMode to fail

2023-02-06 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685045#comment-17685045
 ] 

Yang Wang commented on FLINK-30908:
---

+1 for Xintong's analysis and proposal.

YARN-5999 introduced a side effect that {{CallbackHandler#onError}} will have a 
chance to be executed when stopping the AMRMAyncClient.

> Fatal error in ResourceManager caused 
> YARNSessionFIFOSecuredITCase.testDetachedMode to fail
> ---
>
> Key: FLINK-30908
> URL: https://issues.apache.org/jira/browse/FLINK-30908
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / Coordination
>Affects Versions: 1.17.0
>Reporter: Matthias Pohl
>Assignee: Xintong Song
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Attachments: mvn-1.FLINK-30908.log
>
>
> There's a build failure in {{YARNSessionFIFOSecuredITCase.testDetachedMode}} 
> which is caused by a fatal error in the ResourceManager:
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=45720=logs=245e1f2e-ba5b-5570-d689-25ae21e5302f=d04c9862-880c-52f5-574b-a7a79fef8e0f=29869
> {code}
> Feb 05 02:41:58 java.io.InterruptedIOException: Interrupted waiting to send 
> RPC request to server
> Feb 05 02:41:58 java.io.InterruptedIOException: Interrupted waiting to send 
> RPC request to server
> Feb 05 02:41:58   at org.apache.hadoop.ipc.Client.call(Client.java:1480) 
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58   at org.apache.hadoop.ipc.Client.call(Client.java:1422) 
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>  ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58   at com.sun.proxy.$Proxy31.allocate(Unknown Source) 
> ~[?:?]
> Feb 05 02:41:58   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>  ~[hadoop-yarn-common-3.2.3.jar:?]
> Feb 05 02:41:58   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) ~[?:1.8.0_292]
> Feb 05 02:41:58   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_292]
> Feb 05 02:41:58   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:1.8.0_292]
> Feb 05 02:41:58   at java.lang.reflect.Method.invoke(Method.java:498) 
> ~[?:1.8.0_292]
> Feb 05 02:41:58   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>  ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>  ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>  ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>  ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>  ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58   at com.sun.proxy.$Proxy32.allocate(Unknown Source) 
> ~[?:?]
> Feb 05 02:41:58   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:325)
>  ~[hadoop-yarn-client-3.2.3.jar:?]
> Feb 05 02:41:58   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:311)
>  [hadoop-yarn-client-3.2.3.jar:?]
> Feb 05 02:41:58 Caused by: java.lang.InterruptedException
> Feb 05 02:41:58   at 
> java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404) ~[?:1.8.0_292]
> Feb 05 02:41:58   at 
> java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:1.8.0_292]
> Feb 05 02:41:58   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1180) 
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58   at org.apache.hadoop.ipc.Client.call(Client.java:1475) 
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58   ... 17 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29671) Kubernetes e2e test fails during test initialization

2023-02-02 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683325#comment-17683325
 ] 

Yang Wang commented on FLINK-29671:
---

It seems that the minikube in the azure pipeline upgraded from v1.28.0 to 
v1.29.0. So it might be with incompatible crictl@v1.24.2.

> Kubernetes e2e test fails during test initialization
> 
>
> Key: FLINK-29671
> URL: https://issues.apache.org/jira/browse/FLINK-29671
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Test Infrastructure
>Affects Versions: 1.16.0, 1.17.0, 1.15.3
>Reporter: Matthias Pohl
>Priority: Critical
>  Labels: starter, test-stability
> Attachments: kubernetes_test_failure.log
>
>
> There are two build failures ([branch based on 
> release-1.16|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=42038=logs=bea52777-eaf8-5663-8482-18fbc3630e81=b2642e3a-5b86-574d-4c8a-f7e2842bfb14=5377]
>  and [a release-1.15 
> build|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=42073=logs=bea52777-eaf8-5663-8482-18fbc3630e81=b2642e3a-5b86-574d-4c8a-f7e2842bfb14=4684])
>  that are (not exclusively) caused by the e2e test {{Kubernetes test}} 
> failing in the initialization phase (i.e. when installing all artifacts):
> See the attached logs from the {{release-1.15}} build's CI output. No logs 
> are present as artifacts.
> *UPDATE*: The error we're seeing is actually FLINK-28269 because we failed to 
> download {{crictl}}. Looks like we reached some download limit:
> {code}
> 2022-10-16T05:58:33.0821044Z --2022-10-16 05:58:33--  
> https://objects.githubusercontent.com/github-production-release-asset-2e65be/80172100/7186c302-3766-4ed5-920a-f85c9d6334ac?X-Amz-Algorithm=AWS4-HMAC-SHA256=AKIAIWNJYAX4CSVEH53A%2F20221016%2Fus-east-1%2Fs3%2Faws4
> _request=20221016T055833Z=300=5c24f88d9891e793000a904879be5e8913d643cbff1bfdfb8d0cf3c6e7a7908b=host_id=0_id=0_id=80172100=attachment%3B%20filename%3Dcrictl-v1.24.2-linux-amd64.t
> ar.gz=application%2Foctet-stream
> 2022-10-16T05:58:33.1021704Z Resolving objects.githubusercontent.com 
> (objects.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 
> 185.199.110.133, ...
> 2022-10-16T05:58:33.1128247Z Connecting to objects.githubusercontent.com 
> (objects.githubusercontent.com)|185.199.108.133|:443... connected.
> 2022-10-16T05:58:38.5605126Z HTTP request sent, awaiting response... 503 
> Egress is over the account limit.
> 2022-10-16T05:58:38.5606342Z 2022-10-16 05:58:38 ERROR 503: Egress is over 
> the account limit..
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30813) Residual zk data when using the kubernetes session mode

2023-01-30 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682372#comment-17682372
 ] 

Yang Wang commented on FLINK-30813:
---

This might be a duplicate of FLINK-20219.

> Residual zk data when using the kubernetes session mode
> ---
>
> Key: FLINK-30813
> URL: https://issues.apache.org/jira/browse/FLINK-30813
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Reporter: liuzhuo
>Priority: Minor
>
>     If we use kubernetes session mode and use Zookeeper(ZK) as the HA 
> service, the HA data on ZK is not cleared after the session stops.
>     Because  when deleting a session, only call this method:
> {code:java}
> kubernetesClusterDescriptor.killCluster(clusterId);
> {code}
>     However, this method only deletes the deployment associated with the 
> clusterId. If ZK is used as the HA service, data on ZK will be left over when 
> the HA stops, resulting in more and more data on zk.
>  
> Maybe we need to add
> {code:java}
> ClusterClient#shutDownCluster(){code}
> or
> {code:java}
> HighAvailabilityServices#closeAndCleanupAllData(){code}
> When using session mode (native kubernetes)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30513) HA storage dir leaks on cluster termination

2023-01-12 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17675967#comment-17675967
 ] 

Yang Wang commented on FLINK-30513:
---

Given that the cluster-id is configured to k8s-app-ha-1-116, we will have 
following files in the HA storage directory. The completedCheckpoint and job 
graph files will be deleted once the job reached global terminal state(e.g. 
CANCELED, FAILED, FINISHED). So currently we only clean up the blob directory 
in the {{{}HighAvailabilityServices#closeAndCleanupAllData{}}}. And then the 
parent directory will be residual.

 

It is not a serious problem when using cloud object storage(S3, OSS, etc.). 
However, it is critical for HDFS since leaked directories will consume too much 
memory of NameNode.

 

I am +1 to delete the parent directory in 
{{HighAvailabilityServices#closeAndCleanupAllData.}}

 
{code:java}
wangyang-pc:scripts danrtsey.wy$ ossutil ls oss://flink-test-yiqi/flink-ha/
LastModifiedTime                   Size(B)  StorageClass   ETAG                 
                 ObjectName
2023-01-12 19:42:32 +0800 CST            0      Standard   
D41D8CD98F00B204E9800998ECF8427E      
oss://flink-test-yiqi/flink-ha/job-result-store/k8s-app-ha-1-116/
2023-01-12 19:44:16 +0800 CST            0      Standard   
D41D8CD98F00B204E9800998ECF8427E      
oss://flink-test-yiqi/flink-ha/k8s-app-ha-1-116/
2023-01-12 19:42:32 +0800 CST            0      Standard   
D41D8CD98F00B204E9800998ECF8427E      
oss://flink-test-yiqi/flink-ha/k8s-app-ha-1-116/blob/
2023-01-12 19:42:44 +0800 CST      5426525      Standard   
EF151FCD3D1F91C3EB512118F05D2E20      
oss://flink-test-yiqi/flink-ha/k8s-app-ha-1-116/blob/job_e0700459/blob_p-a28c3727cd8d501e6b024187013e4311079499be-b327aba6b7fee1f6b0fb4fe66a88a637
2023-01-12 19:44:16 +0800 CST        14288      Standard   
0983C3E02FBC7A904207A12295A3AF28      
oss://flink-test-yiqi/flink-ha/k8s-app-ha-1-116/completedCheckpoint4679f587e15a
2023-01-12 19:42:44 +0800 CST        34080      Standard   
B37673CA07B30B8A0FC76939BBDFA9F7      
oss://flink-test-yiqi/flink-ha/k8s-app-ha-1-116/submittedJobGraph427419c216b4 
{code}

> HA storage dir leaks on cluster termination 
> 
>
> Key: FLINK-30513
> URL: https://issues.apache.org/jira/browse/FLINK-30513
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.15.0, 1.16.0
>Reporter: Zhanghao Chen
>Priority: Major
> Attachments: image-2022-12-27-21-32-17-510.png
>
>
> *Problem*
> We found that HA storage dir leaks on cluster termination for a Flink job 
> with HA enabled. The following picture shows the HA storage dir (here on 
> HDFS) of the cluster czh-flink-test-offline (of application mode) after 
> canelling the job with flink-cancel. We are left with an empty dir, and too 
> many empty dirs will greatly hurt the stability of HDFS NameNode!
> !image-2022-12-27-21-32-17-510.png|width=582,height=158!
>  
> Furthermore, in case the user choose to retain the checkpoints on job 
> termination, we will have the completedCheckpoints leaked as well. Note that 
> we no longer need the completedCheckpoints files as we'll directly recover 
> retained CPs from the CP data dir.
> *Root Cause*
> When we run AbstractHaServices#closeAndCleanupAllData(), we cleaned up blob 
> store, but didn't clean the HA storage dir.
> *Proposal*
> Clean up the HA storage dir after cleaning up blob store in 
> AbstractHaServices#closeAndCleanupAllData().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-29705) Document the least access with RBAC setting for native K8s integration

2023-01-12 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang reassigned FLINK-29705:
-

Assignee: ouyangwulin

> Document the least access with RBAC setting for native K8s integration
> --
>
> Key: FLINK-29705
> URL: https://issues.apache.org/jira/browse/FLINK-29705
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes, Documentation
>Reporter: Yang Wang
>Assignee: ouyangwulin
>Priority: Major
>
> We should document the least access with RBAC settings[1]. And the operator 
> docs could be taken as a reference[2].
>  
> [1]. 
> [https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#rbac]
> [2]. 
> [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/rbac/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30518) [flink-operator] Kubernetes HA Service not working with standalone mode

2023-01-12 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17675907#comment-17675907
 ] 

Yang Wang commented on FLINK-30518:
---

Really sorry for the late response. For native K8s implementation with HA 
enabled, we always override the jobmanager.rpc.address to pod IP. So for 
standalone mode with HA, we also need to do this in the operator.

You could find the similar logic in the example yaml for standalone mode.
{code:java}
env:
- name: POD_IP
  valueFrom:
fieldRef:
  apiVersion: v1
  fieldPath: status.podIP
# The following args overwrite the value of jobmanager.rpc.address 
configured in the configuration config map to POD_IP.
args: ["jobmanager", "$(POD_IP)"] {code}

> [flink-operator] Kubernetes HA Service not working with standalone mode
> ---
>
> Key: FLINK-30518
> URL: https://issues.apache.org/jira/browse/FLINK-30518
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.3.0
>Reporter: Binh-Nguyen Tran
>Priority: Major
> Attachments: flink-configmap.png, screenshot-1.png
>
>
> -Since flink-conf.yaml is mounted as read-only configmap, the 
> /docker-entrypoint.sh script is not able to inject correct Pod IP to 
> `jobmanager.rpc.address`. This leads to same address (e.g flink.ns-ext) being 
> set for all Job Manager pods. This causes:-
> Setting up FlinkDeployment in Standalone mode with Kubernetes HA Service. 
> Problems:
> (1) flink-cluster-config-map always contains wrong address for all 3 
> component leaders (see screenshot, should be pod IP instead of clusterIP 
> service name)
> (2) Accessing Web UI when jobmanager.replicas > 1 is not possible with error
> {code:java}
> {"errors":["Service temporarily unavailable due to an ongoing leader 
> election. Please refresh."]} {code}
>  
> ~ flinkdeployment.yaml ~
> {code:java}
> spec:
>   mode: standalone
>   flinkConfiguration:
>     high-availability: kubernetes
>     high-availability.storageDir: "file:///opt/flink/storage"
>     ...
>   jobManager:
>     replicas: 3
>   ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30108) ZooKeeperLeaderElectionConnectionHandlingTest.testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled times out

2023-01-11 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1767#comment-1767
 ] 

Yang Wang commented on FLINK-30108:
---

It is a little strange that leaderLatch did not retry to create a new ephemeral 
node when state changed to RECONNECTED. Otherwise, the leader will be 
eventually elected. And this test should not block at 
{{{}TestingContender.awaitGrantLeadership{}}}.

> ZooKeeperLeaderElectionConnectionHandlingTest.testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled
>  times out
> -
>
> Key: FLINK-30108
> URL: https://issues.apache.org/jira/browse/FLINK-30108
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.17.0
>Reporter: Leonard Xu
>Priority: Major
>  Labels: test-stability
> Attachments: FLINK-30108.tar.gz, zookeeper-server.FLINK-30108.log
>
>
> {noformat}
> Nov 18 01:02:58 [INFO] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, 
> Time elapsed: 109.22 s - in 
> org.apache.flink.runtime.operators.hash.InPlaceMutableHashTableTest
> Nov 18 01:18:09 
> ==
> Nov 18 01:18:09 Process produced no output for 900 seconds.
> Nov 18 01:18:09 
> ==
> Nov 18 01:18:09 
> ==
> Nov 18 01:18:09 The following Java processes are running (JPS)
> Nov 18 01:18:09 
> ==
> Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
> Nov 18 01:18:09 924 Launcher
> Nov 18 01:18:09 23421 surefirebooter1178962604207099497.jar
> Nov 18 01:18:09 11885 Jps
> Nov 18 01:18:09 
> ==
> Nov 18 01:18:09 Printing stack trace of Java process 924
> Nov 18 01:18:09 
> ==
> Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
> Nov 18 01:18:09 2022-11-18 01:18:09
> Nov 18 01:18:09 Full thread dump OpenJDK 64-Bit Server VM (25.292-b10 mixed 
> mode):
> ...
> ...
> ...
> Nov 18 01:18:09 
> ==
> Nov 18 01:18:09 Printing stack trace of Java process 11885
> Nov 18 01:18:09 
> ==
> 11885: No such process
> Nov 18 01:18:09 Killing process with pid=923 and all descendants
> /__w/2/s/tools/ci/watchdog.sh: line 113:   923 Terminated  $cmd
> Nov 18 01:18:10 Process exited with EXIT CODE: 143.
> Nov 18 01:18:10 Trying to KILL watchdog (919).
> Nov 18 01:18:10 Searching for .dump, .dumpstream and related files in 
> '/__w/2/s'
> Nov 18 01:18:16 Moving 
> '/__w/2/s/flink-runtime/target/surefire-reports/2022-11-18T00-55-55_041-jvmRun3.dumpstream'
>  to target directory ('/__w/_temp/debug_files')
> Nov 18 01:18:16 Moving 
> '/__w/2/s/flink-runtime/target/surefire-reports/2022-11-18T00-55-55_041-jvmRun3.dump'
>  to target directory ('/__w/_temp/debug_files')
> The STDIO streams did not close within 10 seconds of the exit event from 
> process '/bin/bash'. This may indicate a child process inherited the STDIO 
> streams and has not yet exited.
> ##[error]Bash exited with code '143'.
> Finishing: Test - core
> {noformat}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43277=logs=0e7be18f-84f2-53f0-a32d-4a5e4a174679=7c1d86e3-35bd-5fd5-3b7c-30c126a78702



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30540) DataSinkTaskTest failed with a 143 exit code

2023-01-11 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17675408#comment-17675408
 ] 

Yang Wang commented on FLINK-30540:
---

[~renqs] Could you please help on this? I think you should also have the login 
credentials for Alibaba CI machines.

> DataSinkTaskTest failed with a 143 exit code
> 
>
> Key: FLINK-30540
> URL: https://issues.apache.org/jira/browse/FLINK-30540
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task, Tests
>Affects Versions: 1.17.0
>Reporter: Matthias Pohl
>Priority: Critical
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=44366=logs=a549b384-c55a-52c0-c451-00e0477ab6db=eef5922c-08d9-5ba3-7299-8393476594e7=8480
> We experienced a 143 exit code when finalizing {{DataSinkTaskTest}}:
> {code}
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:120)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:355)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.DefaultMaven.execute(DefaultMaven.java:155)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.cli.MavenCli.execute(MavenCli.java:584)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.cli.MavenCli.doMain(MavenCli.java:216)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.cli.MavenCli.main(MavenCli.java:160)
> Jan 01 00:58:47 [ERROR] at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> Jan 01 00:58:47 [ERROR] at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Jan 01 00:58:47 [ERROR] at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Jan 01 00:58:47 [ERROR] at java.lang.reflect.Method.invoke(Method.java:498)
> Jan 01 00:58:47 [ERROR] at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> Jan 01 00:58:47 [ERROR] at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> Jan 01 00:58:47 [ERROR] at 
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> Jan 01 00:58:47 [ERROR] at 
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Jan 01 00:58:47 [ERROR] Caused by: 
> org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?
> Jan 01 00:58:47 [ERROR] Command was /bin/sh -c cd /__w/3/s/flink-runtime && 
> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -XX:+UseG1GC -Xms256m -Xmx768m 
> -jar 
> /__w/3/s/flink-runtime/target/surefire/surefirebooter7079560860955914030.jar 
> /__w/3/s/flink-runtime/target/surefire 2023-01-01T00-54-17_721-jvmRun4 
> surefire118217553075734742tmp surefire_1430697542098749596tmp
> Jan 01 00:58:47 [ERROR] Error occurred in starting fork, check output in log
> Jan 01 00:58:47 [ERROR] Process Exit Code: 134
> Jan 01 00:58:47 [ERROR] Crashed tests:
> Jan 01 00:58:47 [ERROR] org.apache.flink.runtime.operators.DataSinkTaskTest
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:748)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.access$700(ForkStarter.java:121)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter$1.call(ForkStarter.java:393)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter$1.call(ForkStarter.java:370)
> Jan 01 00:58:47 [ERROR] at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> Jan 01 00:58:47 [ERROR] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> Jan 01 00:58:47 [ERROR] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> Jan 01 00:58:47 [ERROR] at java.lang.Thread.run(Thread.java:748)
> Jan 01 00:58:47 [ERROR] -> [Help 1]
> Jan 01 00:58:47 [ERROR] 
> Jan 01 00:58:47 [ERROR] To see the full stack trace of the errors, re-run 
> Maven with the -e switch.
> Jan 01 00:58:47 [ERROR] Re-run Maven using the -X switch to enable full debug 
> logging.
> Jan 01 00:58:47 [ERROR] 
> Jan 01 00:58:47 [ERROR] For more information about the errors and possible 
> solutions, please read the following articles:
> Jan 01 00:58:47 [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30540) DataSinkTaskTest failed with a 143 exit code

2023-01-11 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17674724#comment-17674724
 ] 

Yang Wang commented on FLINK-30540:
---

[~mapohl] So do you want me to get the heap dump from core dump file on the 
Alibaba CI machine?

> DataSinkTaskTest failed with a 143 exit code
> 
>
> Key: FLINK-30540
> URL: https://issues.apache.org/jira/browse/FLINK-30540
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task, Tests
>Affects Versions: 1.17.0
>Reporter: Matthias Pohl
>Priority: Critical
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=44366=logs=a549b384-c55a-52c0-c451-00e0477ab6db=eef5922c-08d9-5ba3-7299-8393476594e7=8480
> We experienced a 143 exit code when finalizing {{DataSinkTaskTest}}:
> {code}
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:120)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:355)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.DefaultMaven.execute(DefaultMaven.java:155)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.cli.MavenCli.execute(MavenCli.java:584)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.cli.MavenCli.doMain(MavenCli.java:216)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.cli.MavenCli.main(MavenCli.java:160)
> Jan 01 00:58:47 [ERROR] at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> Jan 01 00:58:47 [ERROR] at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Jan 01 00:58:47 [ERROR] at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Jan 01 00:58:47 [ERROR] at java.lang.reflect.Method.invoke(Method.java:498)
> Jan 01 00:58:47 [ERROR] at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> Jan 01 00:58:47 [ERROR] at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> Jan 01 00:58:47 [ERROR] at 
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> Jan 01 00:58:47 [ERROR] at 
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Jan 01 00:58:47 [ERROR] Caused by: 
> org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?
> Jan 01 00:58:47 [ERROR] Command was /bin/sh -c cd /__w/3/s/flink-runtime && 
> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -XX:+UseG1GC -Xms256m -Xmx768m 
> -jar 
> /__w/3/s/flink-runtime/target/surefire/surefirebooter7079560860955914030.jar 
> /__w/3/s/flink-runtime/target/surefire 2023-01-01T00-54-17_721-jvmRun4 
> surefire118217553075734742tmp surefire_1430697542098749596tmp
> Jan 01 00:58:47 [ERROR] Error occurred in starting fork, check output in log
> Jan 01 00:58:47 [ERROR] Process Exit Code: 134
> Jan 01 00:58:47 [ERROR] Crashed tests:
> Jan 01 00:58:47 [ERROR] org.apache.flink.runtime.operators.DataSinkTaskTest
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:748)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.access$700(ForkStarter.java:121)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter$1.call(ForkStarter.java:393)
> Jan 01 00:58:47 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter$1.call(ForkStarter.java:370)
> Jan 01 00:58:47 [ERROR] at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> Jan 01 00:58:47 [ERROR] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> Jan 01 00:58:47 [ERROR] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> Jan 01 00:58:47 [ERROR] at java.lang.Thread.run(Thread.java:748)
> Jan 01 00:58:47 [ERROR] -> [Help 1]
> Jan 01 00:58:47 [ERROR] 
> Jan 01 00:58:47 [ERROR] To see the full stack trace of the errors, re-run 
> Maven with the -e switch.
> Jan 01 00:58:47 [ERROR] Re-run Maven using the -X switch to enable full debug 
> logging.
> Jan 01 00:58:47 [ERROR] 
> Jan 01 00:58:47 [ERROR] For more information about the errors and possible 
> solutions, please read the following articles:
> Jan 01 00:58:47 [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-28554) Kubernetes-Operator allow readOnlyRootFilesystem via operatorSecurityContext

2022-12-20 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17649699#comment-17649699
 ] 

Yang Wang commented on FLINK-28554:
---

After this PR is merged, it seems that we could not update the operator config 
dynamically via {{{}kubectl edit cm flink-operator-config{}}}.

> Kubernetes-Operator allow readOnlyRootFilesystem via operatorSecurityContext
> 
>
> Key: FLINK-28554
> URL: https://issues.apache.org/jira/browse/FLINK-28554
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.0.1
>Reporter: Tim
>Assignee: Tim
>Priority: Minor
>  Labels: operator, pull-request-available
> Fix For: kubernetes-operator-1.2.0
>
>
> It would be nice if the operator would support using the 
> "readOnlyRootFilesystem" setting via the operatorSecurityContext. When using 
> the default operator template the operator won't be able to start when using 
> this setting because the config files mounted in `/opt/flink/conf` are now 
> (of course) also read-only.
> It would be nice if the default template would be written in such a way that 
> it allows adding emptyDir volumes to /opt/flink/conf via the values.yaml. 
> Which is not possible right now. Then the config files can remain editable by 
> the operator while keeping the root filesystem read-only.
> I have successfully tried that in my branch (see: 
> https://github.com/apache/flink-kubernetes-operator/compare/main...timsn:flink-kubernetes-operator:mount-single-flink-conf-files)
>  which prepares the operator template.
> After this small change to the template it is possible add emptyDir volumes 
> for the conf and tmp dirs and in the second step to enable the 
> readOnlyRootFilesystem setting via the values.yaml
> values.yaml
> {code:java}
> [...]
> operatorVolumeMounts:
>   create: true
>   data:
>     - name: flink-conf
>       mountPath: /opt/flink/conf
>       subPath: conf
>     - name: flink-tmp
>       mountPath: /tmp
> operatorVolumes:
>   create: true
>   data:
>     - name: flink-conf
>       emptyDir: {}
>     - name: flink-tmp
>       emptyDir: {}
> operatorSecurityContext:
>   readOnlyRootFilesystem: true
> [...]{code}
> I think this could be a viable way to allow this security setting and I could 
> turn this into a pull request if desired. What do you think about it? Or is 
> there even a better way to achive this I didn't think about yet?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-27925) Avoid to create watcher without the resourceVersion

2022-12-19 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang reassigned FLINK-27925:
-

Assignee: ouyangwulin

> Avoid to create watcher without the resourceVersion
> ---
>
> Key: FLINK-27925
> URL: https://issues.apache.org/jira/browse/FLINK-27925
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes
>Reporter: Aitozi
>Assignee: ouyangwulin
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-12-19-20-19-41-303.png
>
>
> Currently, we create the watcher in KubernetesResourceManager. But it do not 
> pass the resourceVersion parameter, it will trigger a request to etcd. It 
> will bring the burden to the etcd in large scale cluster (which have been 
> seen in our internal k8s cluster). More detail can be found 
> [here|https://kubernetes.io/docs/reference/using-api/api-concepts/#the-resourceversion-parameter]
>  
> I think we could use the informer to improve it (which will spawn a 
> list-watch and maintain the resourceVersion internally)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-27925) Avoid to create watcher without the resourceVersion

2022-12-19 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17649317#comment-17649317
 ] 

Yang Wang commented on FLINK-27925:
---

Thanks [~ouyangwuli] for sharing the investigation. +1 for adding the 
resourceVersion=0 when doing the list pods.

> Avoid to create watcher without the resourceVersion
> ---
>
> Key: FLINK-27925
> URL: https://issues.apache.org/jira/browse/FLINK-27925
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes
>Reporter: Aitozi
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-12-19-20-19-41-303.png
>
>
> Currently, we create the watcher in KubernetesResourceManager. But it do not 
> pass the resourceVersion parameter, it will trigger a request to etcd. It 
> will bring the burden to the etcd in large scale cluster (which have been 
> seen in our internal k8s cluster). More detail can be found 
> [here|https://kubernetes.io/docs/reference/using-api/api-concepts/#the-resourceversion-parameter]
>  
> I think we could use the informer to improve it (which will spawn a 
> list-watch and maintain the resourceVersion internally)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30184) Save TM/JM thread stack periodically

2022-11-24 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17638310#comment-17638310
 ] 

Yang Wang commented on FLINK-30184:
---

I lean towards to make this be done outside of Flink.

> Save TM/JM thread stack periodically
> 
>
> Key: FLINK-30184
> URL: https://issues.apache.org/jira/browse/FLINK-30184
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Reporter: Rui Fan
>Priority: Major
> Fix For: 1.17.0
>
>
> After FLINK-14816 FLINK-25398 and FLINK-25372 , flink user can view the 
> thread stack of TM/JM in Flink WebUI. 
> It can help flink users to find out why the Flink job is stuck, or why the 
> processing is slow. It is very useful for trouble shooting.
> However, sometimes Flink tasks get stuck or process slowly, but when the user 
> troubleshoots the problem, the job has resumed. It is difficult to find out 
> what happened to the Flink job at the time and why is it slow?
>  
> So, could we periodically save the thread stack of TM or JM in the TM log 
> directory?
> Define some configurations:
> cluster.thread-dump.interval=1min
> cluster.thread-dump.cleanup-time=48 hours



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30036) Force delete pod when k8s node is not ready

2022-11-16 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635064#comment-17635064
 ] 

Yang Wang commented on FLINK-30036:
---

After more investigation, it seems that the terminating pods are counted into 
the used quota. Then I think this ticket is a valid issue. We may need a config 
option to enable force-delete when the pod might block at terminating(e.g. node 
not ready).

I have one more concern that node not ready does not always mean the pod will 
block at terminating status. Force delete will send a SIGKILL to pod and the TM 
will not have the chance for the clean-up.

> Force delete pod when  k8s node is not ready
> 
>
> Key: FLINK-30036
> URL: https://issues.apache.org/jira/browse/FLINK-30036
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes
>Reporter: Peng Yuan
>Priority: Major
>  Labels: pull-request-available
>
> When the K8s node is in the NotReady state, the taskmanager pod scheduled on 
> it is always in the terminating state. When the flink cluster has a strict 
> quota, the terminating pod will hold the resources all the time. As a result, 
> the new taskmanager pod cannot apply for resources and cannot be started.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30036) Force delete pod when k8s node is not ready

2022-11-16 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17634766#comment-17634766
 ] 

Yang Wang commented on FLINK-30036:
---

I prefer to believe the terminating pod should not be counted into the quota. 
Given that when the node turns to ready state, the terminating pod will be 
cleaned up automatically. I lean towards not to introduce such magic logic in 
the {{{}FlinkKubeClient#stopPod{}}}.

> Force delete pod when  k8s node is not ready
> 
>
> Key: FLINK-30036
> URL: https://issues.apache.org/jira/browse/FLINK-30036
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes
>Reporter: Peng Yuan
>Priority: Major
>  Labels: pull-request-available
>
> When the K8s node is in the NotReady state, the taskmanager pod scheduled on 
> it is always in the terminating state. When the flink cluster has a strict 
> quota, the terminating pod will hold the resources all the time. As a result, 
> the new taskmanager pod cannot apply for resources and cannot be started.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-22262) Flink on Kubernetes ConfigMaps are created without OwnerReference

2022-11-06 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629585#comment-17629585
 ] 

Yang Wang edited comment on FLINK-22262 at 11/7/22 7:25 AM:


Thanks [~yunta] for sharing your problem. I think the root cause is finished 
job was submitted again unexpectedly.

If you want to completely ignore this issue, I suggest to use the job result 
store[1].

FYI: The Flink Kubernetes operator is also using the JRS to avoid duplicated 
submission.

 

[1]. 
[https://cwiki.apache.org/confluence/display/FLINK/FLIP-194%3A+Introduce+the+JobResultStore]


was (Author: fly_in_gis):
Thanks [~yunta] for sharing your problem. I think the root cause might be 
kubelet is too late to know the deployment deletion. So the kubelet simply 
start the JobManager again exactly after it exited.

If you want to completely ignore this issue, I suggest to use the job result 
store[1].

FYI: The Flink Kubernetes operator is also using the JRS to avoid duplicated 
submission.

 

[1]. 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-194%3A+Introduce+the+JobResultStore

> Flink on Kubernetes ConfigMaps are created without OwnerReference
> -
>
> Key: FLINK-22262
> URL: https://issues.apache.org/jira/browse/FLINK-22262
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.0
>Reporter: Andrea Peruffo
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, auto-deprioritized-minor
> Attachments: jm.log
>
>
> According to the documentation:
> [https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#manual-resource-cleanup]
> The ConfigMaps created along with the Flink deployment is supposed to have an 
> OwnerReference pointing to the Deployment itself, unfortunately, this doesn't 
> happen and causes all sorts of issues when the classpath and the jars of the 
> job are updated.
> i.e.:
> Without manually removing the ConfigMap of the Job I cannot update the Jars 
> of the Job.
> Can you please give guidance if there are additional caveats on manually 
> removing the ConfigMap? Any other workaround that can be used?
> Thanks in advance.
> Example ConfigMap:
> {{apiVersion: v1}}
> {{data:}}
> {{ address: akka.tcp://flink@10.0.2.13:6123/user/rpc/jobmanager_2}}
> {{ checkpointID-049: 
> rO0ABXNyADtvcmcuYXBhY2hlLmZsaW5rLnJ1bnRpbWUuc3RhdGUuUmV0cmlldmFibGVTdHJlYW1TdGF0ZUhhbmRsZQABHhjxVZcrAgABTAAYd3JhcHBlZFN0cmVhbVN0YXRlSGFuZGxldAAyTG9yZy9hcGFjaGUvZmxpbmsvcnVudGltZS9zdGF0ZS9TdHJlYW1TdGF0ZUhhbmRsZTt4cHNyADlvcmcuYXBhY2hlLmZsaW5rLnJ1bnRpbWUuc3RhdGUuZmlsZXN5c3RlbS5GaWxlU3RhdGVIYW5kbGUE3HXYYr0bswIAAkoACXN0YXRlU2l6ZUwACGZpbGVQYXRodAAfTG9yZy9hcGFjaGUvZmxpbmsvY29yZS9mcy9QYXRoO3hwAAABOEtzcgAdb3JnLmFwYWNoZS5mbGluay5jb3JlLmZzLlBhdGgAAQIAAUwAA3VyaXQADkxqYXZhL25ldC9VUkk7eHBzcgAMamF2YS5uZXQuVVJJrAF4LkOeSasDAAFMAAZzdHJpbmd0ABJMamF2YS9sYW5nL1N0cmluZzt4cHQAUC9tbnQvZmxpbmsvc3RvcmFnZS9rc2hhL3RheGktcmlkZS1mYXJlLXByb2Nlc3Nvci9jb21wbGV0ZWRDaGVja3BvaW50MDQ0YTc2OWRkNDgxeA==}}
> {{ counter: "50"}}
> {{ sessionId: 0c2b69ee-6b41-48d3-b7fd-1bf2eda94f0f}}
> {{kind: ConfigMap}}
> {{metadata:}}
> {{ annotations:}}
> {{ control-plane.alpha.kubernetes.io/leader: 
> '\{"holderIdentity":"0f25a2cc-e212-46b0-8ba9-faac0732a316","leaseDuration":15.0,"acquireTime":"2021-04-13T14:30:51.439000Z","renewTime":"2021-04-13T14:39:32.011000Z","leaderTransitions":105}'}}
> {{ creationTimestamp: "2021-04-13T14:30:51Z"}}
> {{ labels:}}
> {{ app: taxi-ride-fare-processor}}
> {{ configmap-type: high-availability}}
> {{ type: flink-native-kubernetes}}
> {{ name: 
> taxi-ride-fare-processor--jobmanager-leader}}
> {{ namespace: taxi-ride-fare}}
> {{ resourceVersion: "64100"}}
> {{ selfLink: 
> /api/v1/namespaces/taxi-ride-fare/configmaps/taxi-ride-fare-processor--jobmanager-leader}}
> {{ uid: 9f912495-382a-45de-a789-fd5ad2a2459d}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-14937) Flink WebUI could not display

2022-11-06 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629593#comment-17629593
 ] 

Yang Wang commented on FLINK-14937:
---

I am not aware of any issues of Flink webui. Maybe you need to check whether 
the npm related packages have been installed successfully in your local build.

> Flink WebUI could not display
> -
>
> Key: FLINK-14937
> URL: https://issues.apache.org/jira/browse/FLINK-14937
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Web Frontend
>Affects Versions: 1.10.0
>Reporter: Yang Wang
>Priority: Major
> Attachments: image-2019-11-25-16-21-51-727.png, 
> image-2019-11-25-16-23-10-618.png
>
>
> Both standalone and Yarn Flink webui could not display.
> !image-2019-11-25-16-21-51-727.png!
> However, the api server seems display normally.
> !image-2019-11-25-16-23-10-618.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29315) HDFSTest#testBlobServerRecovery fails on CI

2022-11-06 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629591#comment-17629591
 ] 

Yang Wang commented on FLINK-29315:
---

Maybe I could upgrade the kernel version of all the CI machines to 
5.4.220-1.el7.elrepo.x86_64. The current kernel-3.10.0 is too old.

> HDFSTest#testBlobServerRecovery fails on CI
> ---
>
> Key: FLINK-29315
> URL: https://issues.apache.org/jira/browse/FLINK-29315
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Connectors / FileSystem, Test Infrastructure, Tests
>Affects Versions: 1.16.0, 1.17.0, 1.15.2
>Reporter: Chesnay Schepler
>Priority: Blocker
>  Labels: pull-request-available
>
> The test started failing 2 days ago on different branches. I suspect 
> something's wrong with the CI infrastructure.
> {code:java}
> Sep 15 09:11:22 [ERROR] Failures: 
> Sep 15 09:11:22 [ERROR]   HDFSTest.testBlobServerRecovery Multiple Failures 
> (2 failures)
> Sep 15 09:11:22   java.lang.AssertionError: Test failed Error while 
> running command to get file permissions : java.io.IOException: Cannot run 
> program "ls": error=1, Operation not permitted
> Sep 15 09:11:22   at 
> java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:913)
> Sep 15 09:11:22   at org.apache.hadoop.util.Shell.run(Shell.java:869)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.Shell.execCommand(Shell.java:1264)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.Shell.execCommand(Shell.java:1246)
> Sep 15 09:11:22   at 
> org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1089)
> Sep 15 09:11:22   at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:697)
> Sep 15 09:11:22   at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:672)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:233)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.DiskChecker.checkDirInternal(DiskChecker.java:141)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:116)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$DataNodeDiskChecker.checkDir(DataNode.java:2580)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.checkStorageLocations(DataNode.java:2622)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2604)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2497)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:1501)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:851)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:485)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:444)
> Sep 15 09:11:22   at 
> org.apache.flink.hdfstests.HDFSTest.createHDFS(HDFSTest.java:93)
> Sep 15 09:11:22   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> Sep 15 09:11:22   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Sep 15 09:11:22   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Sep 15 09:11:22   at 
> java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> Sep 15 09:11:22   ... 67 more
> Sep 15 09:11:22 
> Sep 15 09:11:22   java.lang.NullPointerException: 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-22262) Flink on Kubernetes ConfigMaps are created without OwnerReference

2022-11-06 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629585#comment-17629585
 ] 

Yang Wang commented on FLINK-22262:
---

Thanks [~yunta] for sharing your problem. I think the root cause might be 
kubelet is too late to know the deployment deletion. So the kubelet simply 
start the JobManager again exactly after it exited.

If you want to completely ignore this issue, I suggest to use the job result 
store[1].

FYI: The Flink Kubernetes operator is also using the JRS to avoid duplicated 
submission.

 

[1]. 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-194%3A+Introduce+the+JobResultStore

> Flink on Kubernetes ConfigMaps are created without OwnerReference
> -
>
> Key: FLINK-22262
> URL: https://issues.apache.org/jira/browse/FLINK-22262
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.0
>Reporter: Andrea Peruffo
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, auto-deprioritized-minor
> Attachments: jm.log
>
>
> According to the documentation:
> [https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#manual-resource-cleanup]
> The ConfigMaps created along with the Flink deployment is supposed to have an 
> OwnerReference pointing to the Deployment itself, unfortunately, this doesn't 
> happen and causes all sorts of issues when the classpath and the jars of the 
> job are updated.
> i.e.:
> Without manually removing the ConfigMap of the Job I cannot update the Jars 
> of the Job.
> Can you please give guidance if there are additional caveats on manually 
> removing the ConfigMap? Any other workaround that can be used?
> Thanks in advance.
> Example ConfigMap:
> {{apiVersion: v1}}
> {{data:}}
> {{ address: akka.tcp://flink@10.0.2.13:6123/user/rpc/jobmanager_2}}
> {{ checkpointID-049: 
> rO0ABXNyADtvcmcuYXBhY2hlLmZsaW5rLnJ1bnRpbWUuc3RhdGUuUmV0cmlldmFibGVTdHJlYW1TdGF0ZUhhbmRsZQABHhjxVZcrAgABTAAYd3JhcHBlZFN0cmVhbVN0YXRlSGFuZGxldAAyTG9yZy9hcGFjaGUvZmxpbmsvcnVudGltZS9zdGF0ZS9TdHJlYW1TdGF0ZUhhbmRsZTt4cHNyADlvcmcuYXBhY2hlLmZsaW5rLnJ1bnRpbWUuc3RhdGUuZmlsZXN5c3RlbS5GaWxlU3RhdGVIYW5kbGUE3HXYYr0bswIAAkoACXN0YXRlU2l6ZUwACGZpbGVQYXRodAAfTG9yZy9hcGFjaGUvZmxpbmsvY29yZS9mcy9QYXRoO3hwAAABOEtzcgAdb3JnLmFwYWNoZS5mbGluay5jb3JlLmZzLlBhdGgAAQIAAUwAA3VyaXQADkxqYXZhL25ldC9VUkk7eHBzcgAMamF2YS5uZXQuVVJJrAF4LkOeSasDAAFMAAZzdHJpbmd0ABJMamF2YS9sYW5nL1N0cmluZzt4cHQAUC9tbnQvZmxpbmsvc3RvcmFnZS9rc2hhL3RheGktcmlkZS1mYXJlLXByb2Nlc3Nvci9jb21wbGV0ZWRDaGVja3BvaW50MDQ0YTc2OWRkNDgxeA==}}
> {{ counter: "50"}}
> {{ sessionId: 0c2b69ee-6b41-48d3-b7fd-1bf2eda94f0f}}
> {{kind: ConfigMap}}
> {{metadata:}}
> {{ annotations:}}
> {{ control-plane.alpha.kubernetes.io/leader: 
> '\{"holderIdentity":"0f25a2cc-e212-46b0-8ba9-faac0732a316","leaseDuration":15.0,"acquireTime":"2021-04-13T14:30:51.439000Z","renewTime":"2021-04-13T14:39:32.011000Z","leaderTransitions":105}'}}
> {{ creationTimestamp: "2021-04-13T14:30:51Z"}}
> {{ labels:}}
> {{ app: taxi-ride-fare-processor}}
> {{ configmap-type: high-availability}}
> {{ type: flink-native-kubernetes}}
> {{ name: 
> taxi-ride-fare-processor--jobmanager-leader}}
> {{ namespace: taxi-ride-fare}}
> {{ resourceVersion: "64100"}}
> {{ selfLink: 
> /api/v1/namespaces/taxi-ride-fare/configmaps/taxi-ride-fare-processor--jobmanager-leader}}
> {{ uid: 9f912495-382a-45de-a789-fd5ad2a2459d}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29871) Upgrade operator Flink version and examples to 1.16

2022-11-06 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629581#comment-17629581
 ] 

Yang Wang commented on FLINK-29871:
---

+1 for using the X.Y.1 in operator for stability.

> Upgrade operator Flink version and examples to 1.16
> ---
>
> Key: FLINK-29871
> URL: https://issues.apache.org/jira/browse/FLINK-29871
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Assignee: Gyula Fora
>Priority: Major
> Fix For: kubernetes-operator-1.3.0
>
>
> We should update our Flink dependency and the default example version to 1.16



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29892) flink-conf.yaml does not accept hash (#) in the env.java.opts property

2022-11-06 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629577#comment-17629577
 ] 

Yang Wang commented on FLINK-29892:
---

This is a known limit for current Flink options parser. Similar to FLINK-15358.

> flink-conf.yaml does not accept hash (#) in the env.java.opts property
> --
>
> Key: FLINK-29892
> URL: https://issues.apache.org/jira/browse/FLINK-29892
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.15.2
>Reporter: Sergio Sainz
>Priority: Major
>
> When adding a string with hash (#) character in env.java.opts in 
> flink-conf.yaml , the string will be truncated from the # onwards even when 
> the value is surrounded by single quotes or double quotes.
> example:
> (in flink-conf.yaml):
> env.java.opts: "-Djavax.net.ssl.trustStorePassword=my#pwd"
>  
> the value shown on the flink taskmanagers or job managers is :
> env.java.opts: -Djavax.net.ssl.trustStorePassword=my
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29898) ClusterEntrypointTest.testCloseAsyncShouldBeExecutedInShutdownHook fails on CI

2022-11-06 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629575#comment-17629575
 ] 

Yang Wang commented on FLINK-29898:
---

>From the current implementation, this issue could only happen when starting a 
>new JVM process takes too long time(30s+). I assume something might be wrong 
>with the virtual machine or disk.

Let's hold on and see whether there are more same failed pipelines.

> ClusterEntrypointTest.testCloseAsyncShouldBeExecutedInShutdownHook fails on CI
> --
>
> Key: FLINK-29898
> URL: https://issues.apache.org/jira/browse/FLINK-29898
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.17.0
>Reporter: Qingsheng Ren
>Priority: Major
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=42848=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=24c3384f-1bcb-57b3-224f-51bf973bbee8=8345]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29539) dnsPolicy in FlinkPod is not overridable

2022-10-28 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17625565#comment-17625565
 ] 

Yang Wang commented on FLINK-29539:
---

Thanks [~carloscastro] for your contribution.

> dnsPolicy in FlinkPod is not overridable 
> -
>
> Key: FLINK-29539
> URL: https://issues.apache.org/jira/browse/FLINK-29539
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Reporter: Carlos Castro
>Assignee: Carlos Castro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.17.0, 1.15.3, 1.16.1
>
>
> With this PR [https://github.com/apache/flink/pull/18119 
> |https://github.com/apache/flink/pull/18119]it stopped being possible to 
> override the dnsPolicy in the FlinkPod spec.
> To fix it, it should check first if the dnsPolicy is not null before applying 
> the default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-29539) dnsPolicy in FlinkPod is not overridable

2022-10-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang closed FLINK-29539.
-
Resolution: Fixed

Fixed via:

master: bb9f2525e6e16d00ef2f0739d9cb96c2e47e35e7

release-1.16: 6325adf40a1baa8c3ac82aa06c57425c3c6005c4

release-1.15: d9413d6bc6548ddd4c2e4d6e05db0903da064476

> dnsPolicy in FlinkPod is not overridable 
> -
>
> Key: FLINK-29539
> URL: https://issues.apache.org/jira/browse/FLINK-29539
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Reporter: Carlos Castro
>Assignee: Carlos Castro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.17.0, 1.15.3, 1.16.1
>
>
> With this PR [https://github.com/apache/flink/pull/18119 
> |https://github.com/apache/flink/pull/18119]it stopped being possible to 
> override the dnsPolicy in the FlinkPod spec.
> To fix it, it should check first if the dnsPolicy is not null before applying 
> the default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-29539) dnsPolicy in FlinkPod is not overridable

2022-10-28 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-29539:
--
Fix Version/s: 1.17.0
   1.15.3
   1.16.1

> dnsPolicy in FlinkPod is not overridable 
> -
>
> Key: FLINK-29539
> URL: https://issues.apache.org/jira/browse/FLINK-29539
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Reporter: Carlos Castro
>Assignee: Carlos Castro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.17.0, 1.15.3, 1.16.1
>
>
> With this PR [https://github.com/apache/flink/pull/18119 
> |https://github.com/apache/flink/pull/18119]it stopped being possible to 
> override the dnsPolicy in the FlinkPod spec.
> To fix it, it should check first if the dnsPolicy is not null before applying 
> the default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29749) flink info command support dynamic properties

2022-10-27 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17625374#comment-17625374
 ] 

Yang Wang commented on FLINK-29749:
---

Thanks [~jackylau] for your contribution.

> flink info command support dynamic properties
> -
>
> Key: FLINK-29749
> URL: https://issues.apache.org/jira/browse/FLINK-29749
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: jackylau
>Assignee: jackylau
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.17.0, 1.15.3, 1.16.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-29749) flink info command support dynamic properties

2022-10-27 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang closed FLINK-29749.
-
Resolution: Fixed

Fixed via:

master: f8c6a668cd2b887f33a0cf4608de2d6b95c71f03

release-1.16: 38e90428bf7e603fdd353243f1edeba3553af2a3

release-1.15: 1d29f540a0692540a01b951033a8dc04fdb74d4f

> flink info command support dynamic properties
> -
>
> Key: FLINK-29749
> URL: https://issues.apache.org/jira/browse/FLINK-29749
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: jackylau
>Assignee: jackylau
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.17.0, 1.15.3, 1.16.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29315) HDFSTest#testBlobServerRecovery fails on CI

2022-10-27 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17625129#comment-17625129
 ] 

Yang Wang commented on FLINK-29315:
---

Yes. All the CI machines have the same kernel version after last upgrade. It is 
strange that only the Alibaba001 is affected. And I have no idea what is the 
root cause for this issue.

 

BTW, simply restarting the virtual machine does not help.

> HDFSTest#testBlobServerRecovery fails on CI
> ---
>
> Key: FLINK-29315
> URL: https://issues.apache.org/jira/browse/FLINK-29315
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Connectors / FileSystem, Test Infrastructure, Tests
>Affects Versions: 1.16.0, 1.15.2
>Reporter: Chesnay Schepler
>Priority: Blocker
>  Labels: pull-request-available
>
> The test started failing 2 days ago on different branches. I suspect 
> something's wrong with the CI infrastructure.
> {code:java}
> Sep 15 09:11:22 [ERROR] Failures: 
> Sep 15 09:11:22 [ERROR]   HDFSTest.testBlobServerRecovery Multiple Failures 
> (2 failures)
> Sep 15 09:11:22   java.lang.AssertionError: Test failed Error while 
> running command to get file permissions : java.io.IOException: Cannot run 
> program "ls": error=1, Operation not permitted
> Sep 15 09:11:22   at 
> java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:913)
> Sep 15 09:11:22   at org.apache.hadoop.util.Shell.run(Shell.java:869)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.Shell.execCommand(Shell.java:1264)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.Shell.execCommand(Shell.java:1246)
> Sep 15 09:11:22   at 
> org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1089)
> Sep 15 09:11:22   at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:697)
> Sep 15 09:11:22   at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:672)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:233)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.DiskChecker.checkDirInternal(DiskChecker.java:141)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:116)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$DataNodeDiskChecker.checkDir(DataNode.java:2580)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.checkStorageLocations(DataNode.java:2622)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2604)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2497)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:1501)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:851)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:485)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:444)
> Sep 15 09:11:22   at 
> org.apache.flink.hdfstests.HDFSTest.createHDFS(HDFSTest.java:93)
> Sep 15 09:11:22   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> Sep 15 09:11:22   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Sep 15 09:11:22   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Sep 15 09:11:22   at 
> java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> Sep 15 09:11:22   ... 67 more
> Sep 15 09:11:22 
> Sep 15 09:11:22   java.lang.NullPointerException: 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-27579) The param client.timeout can not be set by dynamic properties when stopping the job

2022-10-27 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-27579:
--
Fix Version/s: 1.15.3

> The param client.timeout can not be set by dynamic properties when stopping 
> the job 
> 
>
> Key: FLINK-27579
> URL: https://issues.apache.org/jira/browse/FLINK-27579
> Project: Flink
>  Issue Type: Improvement
>  Components: Client / Job Submission
>Affects Versions: 1.16.0
>Reporter: Liu
>Assignee: Yao Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0, 1.15.3
>
>
> The default client.timeout value is one minute which may be too short when 
> stop-with-savepoint for big state jobs.
> When we stop the job by dynamic properties(-D or -yD for yarn), the 
> client.timeout is not effective.
> From the code, we can see that the dynamic properties are only effect for run 
> command. We should support it for stop command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-27579) The param client.timeout can not be set by dynamic properties when stopping the job

2022-10-27 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-27579:
--
Issue Type: Bug  (was: Improvement)

> The param client.timeout can not be set by dynamic properties when stopping 
> the job 
> 
>
> Key: FLINK-27579
> URL: https://issues.apache.org/jira/browse/FLINK-27579
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission
>Affects Versions: 1.16.0
>Reporter: Liu
>Assignee: Yao Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0, 1.15.3
>
>
> The default client.timeout value is one minute which may be too short when 
> stop-with-savepoint for big state jobs.
> When we stop the job by dynamic properties(-D or -yD for yarn), the 
> client.timeout is not effective.
> From the code, we can see that the dynamic properties are only effect for run 
> command. We should support it for stop command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29315) HDFSTest#testBlobServerRecovery fails on CI

2022-10-27 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17624967#comment-17624967
 ] 

Yang Wang commented on FLINK-29315:
---

I have upgrade the kernel version of Alibaba001 to 
{{{}5.4.220-1.el7.elrepo.x86_64{}}}. Let's keep an eye on the CI pipelines.

> HDFSTest#testBlobServerRecovery fails on CI
> ---
>
> Key: FLINK-29315
> URL: https://issues.apache.org/jira/browse/FLINK-29315
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Connectors / FileSystem, Tests
>Affects Versions: 1.16.0, 1.15.2
>Reporter: Chesnay Schepler
>Priority: Blocker
>  Labels: pull-request-available
>
> The test started failing 2 days ago on different branches. I suspect 
> something's wrong with the CI infrastructure.
> {code:java}
> Sep 15 09:11:22 [ERROR] Failures: 
> Sep 15 09:11:22 [ERROR]   HDFSTest.testBlobServerRecovery Multiple Failures 
> (2 failures)
> Sep 15 09:11:22   java.lang.AssertionError: Test failed Error while 
> running command to get file permissions : java.io.IOException: Cannot run 
> program "ls": error=1, Operation not permitted
> Sep 15 09:11:22   at 
> java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:913)
> Sep 15 09:11:22   at org.apache.hadoop.util.Shell.run(Shell.java:869)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.Shell.execCommand(Shell.java:1264)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.Shell.execCommand(Shell.java:1246)
> Sep 15 09:11:22   at 
> org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1089)
> Sep 15 09:11:22   at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:697)
> Sep 15 09:11:22   at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:672)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:233)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.DiskChecker.checkDirInternal(DiskChecker.java:141)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:116)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$DataNodeDiskChecker.checkDir(DataNode.java:2580)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.checkStorageLocations(DataNode.java:2622)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2604)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2497)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:1501)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:851)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:485)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:444)
> Sep 15 09:11:22   at 
> org.apache.flink.hdfstests.HDFSTest.createHDFS(HDFSTest.java:93)
> Sep 15 09:11:22   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> Sep 15 09:11:22   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Sep 15 09:11:22   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Sep 15 09:11:22   at 
> java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> Sep 15 09:11:22   ... 67 more
> Sep 15 09:11:22 
> Sep 15 09:11:22   java.lang.NullPointerException: 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29315) HDFSTest#testBlobServerRecovery fails on CI

2022-10-26 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17624861#comment-17624861
 ] 

Yang Wang commented on FLINK-29315:
---

I will have a look on the CI machine Alibaba001 today.

> HDFSTest#testBlobServerRecovery fails on CI
> ---
>
> Key: FLINK-29315
> URL: https://issues.apache.org/jira/browse/FLINK-29315
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Connectors / FileSystem, Tests
>Affects Versions: 1.16.0, 1.15.2
>Reporter: Chesnay Schepler
>Priority: Blocker
>  Labels: pull-request-available
>
> The test started failing 2 days ago on different branches. I suspect 
> something's wrong with the CI infrastructure.
> {code:java}
> Sep 15 09:11:22 [ERROR] Failures: 
> Sep 15 09:11:22 [ERROR]   HDFSTest.testBlobServerRecovery Multiple Failures 
> (2 failures)
> Sep 15 09:11:22   java.lang.AssertionError: Test failed Error while 
> running command to get file permissions : java.io.IOException: Cannot run 
> program "ls": error=1, Operation not permitted
> Sep 15 09:11:22   at 
> java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:913)
> Sep 15 09:11:22   at org.apache.hadoop.util.Shell.run(Shell.java:869)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.Shell.execCommand(Shell.java:1264)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.Shell.execCommand(Shell.java:1246)
> Sep 15 09:11:22   at 
> org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1089)
> Sep 15 09:11:22   at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:697)
> Sep 15 09:11:22   at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:672)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:233)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.DiskChecker.checkDirInternal(DiskChecker.java:141)
> Sep 15 09:11:22   at 
> org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:116)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$DataNodeDiskChecker.checkDir(DataNode.java:2580)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.checkStorageLocations(DataNode.java:2622)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2604)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2497)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:1501)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:851)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:485)
> Sep 15 09:11:22   at 
> org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:444)
> Sep 15 09:11:22   at 
> org.apache.flink.hdfstests.HDFSTest.createHDFS(HDFSTest.java:93)
> Sep 15 09:11:22   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> Sep 15 09:11:22   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Sep 15 09:11:22   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Sep 15 09:11:22   at 
> java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> Sep 15 09:11:22   ... 67 more
> Sep 15 09:11:22 
> Sep 15 09:11:22   java.lang.NullPointerException: 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-27579) The param client.timeout can not be set by dynamic properties when stopping the job

2022-10-26 Thread Yang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang reassigned FLINK-27579:
-

Assignee: Yao Zhang

> The param client.timeout can not be set by dynamic properties when stopping 
> the job 
> 
>
> Key: FLINK-27579
> URL: https://issues.apache.org/jira/browse/FLINK-27579
> Project: Flink
>  Issue Type: Improvement
>  Components: Client / Job Submission
>Affects Versions: 1.16.0
>Reporter: Liu
>Assignee: Yao Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0
>
>
> The default client.timeout value is one minute which may be too short when 
> stop-with-savepoint for big state jobs.
> When we stop the job by dynamic properties(-D or -yD for yarn), the 
> client.timeout is not effective.
> From the code, we can see that the dynamic properties are only effect for run 
> command. We should support it for stop command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-27579) The param client.timeout can not be set by dynamic properties when stopping the job

2022-10-26 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17624827#comment-17624827
 ] 

Yang Wang commented on FLINK-27579:
---

When working for FLINK-29749, I realize that this ticket also needs to be 
backported to release-1.15. 

> The param client.timeout can not be set by dynamic properties when stopping 
> the job 
> 
>
> Key: FLINK-27579
> URL: https://issues.apache.org/jira/browse/FLINK-27579
> Project: Flink
>  Issue Type: Improvement
>  Components: Client / Job Submission
>Affects Versions: 1.16.0
>Reporter: Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0
>
>
> The default client.timeout value is one minute which may be too short when 
> stop-with-savepoint for big state jobs.
> When we stop the job by dynamic properties(-D or -yD for yarn), the 
> client.timeout is not effective.
> From the code, we can see that the dynamic properties are only effect for run 
> command. We should support it for stop command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >