[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728363#comment-17728363 ] Weijie Guo commented on FLINK-31974: master(1.18) via 3b9f7cf8ffcd357f252f62dee62d26dbc6a76e91. release-1.17 waiting for CI. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0, 1.18.0 >Reporter: Sergio Sainz >Assignee: Gyula Fora >Priority: Major > Labels: pull-request-available > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) > ~[flink-dist-1.16.1.jar:1.16.1] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) > ~[flink-dist-1.16.1.jar:1.16.1] > ... 4 more > {code} > But , {*}in Flink 1.17.0 , Job Manager crashes{*}: > {code} > 2023-04-28
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727499#comment-17727499 ] Gyula Fora commented on FLINK-31974: No worries, I will assign it to myself and will work on this shortly. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Assignee: Weijie Guo >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) > ~[flink-dist-1.16.1.jar:1.16.1] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) > ~[flink-dist-1.16.1.jar:1.16.1] > ... 4 more > {code} > But , {*}in Flink 1.17.0 , Job Manager crashes{*}: > {code} > 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler >
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727382#comment-17727382 ] Weijie Guo commented on FLINK-31974: [~gyfora] Sorry, I am quite busy recently, feel free to re-assign this ticket if you want to pick up it. :) > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Assignee: Weijie Guo >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) > ~[flink-dist-1.16.1.jar:1.16.1] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) > ~[flink-dist-1.16.1.jar:1.16.1] > ... 4 more > {code} > But , {*}in Flink 1.17.0 , Job Manager crashes{*}: > {code} > 2023-04-28 20:50:50,534 ERROR
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727355#comment-17727355 ] Gyula Fora commented on FLINK-31974: [~Weijie Guo] are you working on this ticket? > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Assignee: Weijie Guo >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) > ~[flink-dist-1.16.1.jar:1.16.1] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) > ~[flink-dist-1.16.1.jar:1.16.1] > ... 4 more > {code} > But , {*}in Flink 1.17.0 , Job Manager crashes{*}: > {code} > 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler > [] - FATAL: Thread
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719744#comment-17719744 ] Xintong Song commented on FLINK-31974: -- Thanks all for the explanation and patience. It seems there's a commonly tendency towards the retry-by-default approach. I also consulted a few colleagues from our Kubernetes team about this. They also share the opinion that there might be more error types that can be resolved by retrying than a whitelist could possibly handle. The only concern they mentioned is that keeping retrying may make the Kubernetes API Server harder to recover from outages, which I believe can be addressed with backoff and guardrails as [~mbalassi] mentioned. I'd respect the opinion of the majority, withdraw my proposal, and +1 for [~gyfora]'s proposal. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Assignee: Weijie Guo >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at >
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719727#comment-17719727 ] Matthias Pohl commented on FLINK-31974: --- [~sergiosp] I guess it's not necessary to provide the logs anymore. The problem is understood and the discussion went on, already. On the discussion about how to handle errors in this part of the code: tbh, initially I leaned towards [~xtsong]'s proposal where he suggested to make the error handling as strict as possible through a whitelist and avoid adding yet another configuration parameter with the idea in mind that Flink's deployment environment should be in a healthy state without any mis-configuration. But as the discussion moved on, I started to acknowledge that it's too strict in quite a few scenarios. I also get [~gyfora]'s point that we're not that restrictive in other places of the code base, either. One concern I have with the error whitelisting, though, is that the error classification could become "complex". The error [~sergiosp] shared was about hitting quota limits. The error type we're seeing is a Forbidden error (unfortunately, without the error code being logged but I would assume 403 analogously to the HTTP error code). I could imagine this error type also being returned in other cases (e.g. wrong service account being used). The former error is something we want to retry in certain scenarios but the latter one (based on my understanding) would be one that could be considered a general infrastructure issue and, as a consequence, could be treated as a fatal error. It looks like it would require error message parsing to identify the type of error. How confident are we about the stability of those error messages? It looks like they are derived from the k8s HTTP responses and, therefore, might be stable among different Kubernetes versions. But generally, relying on error messages for deriving Flink's behavior feels not right. Is this a valid concern? In this sense, I started to favor what was proposed by [~gyfora] in the discussion. I might be wrong here because I'm not that familiar with the k8s API. I wanted to share this, anyway. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Assignee: Weijie Guo >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719580#comment-17719580 ] Xintong Song commented on FLINK-31974: -- cc [~wangyang0918] > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Assignee: Weijie Guo >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) > ~[flink-dist-1.16.1.jar:1.16.1] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) > ~[flink-dist-1.16.1.jar:1.16.1] > ... 4 more > {code} > But , {*}in Flink 1.17.0 , Job Manager crashes{*}: > {code} > 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler > [] - FATAL: Thread
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719525#comment-17719525 ] Sergio Sainz commented on FLINK-31974: -- Hi [~mapohl] - let me setup a new cluster later on to get the full logs. Below please find the thread dump from the Flink 1.17.0 crash: {code:java} 2023-04-28 20:50:50,305 INFO org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Received resource requirements from job 0a97c80a173b7ebb619c5b030b607520: [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, numberOfRequiredSlots=1}] ... 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler [] - FATAL: Thread 'flink-akka.actor.default-dispatcher-15' produced an uncaught exception. Stopping the process... java.util.concurrent.CompletionException: org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1/api/v1/namespaces/env-my-namespace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is forbidden: exceeded quota: my-namespace-realtime-server-resource-quota, requested: limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown Source) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?] at java.lang.Thread.run(Unknown Source) ~[?:?] Caused by: org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1/api/v1/namespaces/env-my-namespace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is forbidden: exceeded quota: my-namespace-realtime-server-resource-quota, requested: limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) ~[flink-dist-1.17.0.jar:1.17.0] ... 4 more 2023-04-28 20:50:50,602 ERROR org.apache.flink.util.FatalExitExceptionHandler [] - Thread dump: "main" prio=5 Id=1 WAITING on java.util.concurrent.CompletableFuture$Signaller@2897b146 at java.base@11.0.19/jdk.internal.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.CompletableFuture$Signaller@2897b146 at java.base@11.0.19/java.util.concurrent.locks.LockSupport.park(Unknown Source) at java.base@11.0.19/java.util.concurrent.CompletableFuture$Signaller.block(Unknown Source) at java.base@11.0.19/java.util.concurrent.ForkJoinPool.managedBlock(Unknown Source) at java.base@11.0.19/java.util.concurrent.CompletableFuture.waitingGet(Unknown Source) at
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719355#comment-17719355 ] Thomas Weise commented on FLINK-31974: -- There are many cases where errors are transient. This specific case is actually quite obvious, the resource availability on a large cluster is changing constantly. A pod may not be scheduled now but few seconds later. Other k8s related issues can also be transient, for example a failed request due to rate limiting will likely succeed soon after and we would actually make things worse by not following a backoff/retry strategy and simply letting the job fail. I'm also leaning more towards retry by default strategy and identify the cases that should be fatal error. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Assignee: Weijie Guo >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) > ~[flink-dist-1.16.1.jar:1.16.1] >
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719331#comment-17719331 ] Xintong Song commented on FLINK-31974: -- [~gyfora], bq. Flink treats only very few errors fatal. IO errors, connector (source/sink ) errors etc all cause job restarts and in many cases "Flink cannot recover from by itself". You actually expect the error to be temporary and hopefully not get it after the restart. So I think it would be generally inconsistent with the current error handling behaviour if resource manager errors would simply let the job die fatally and not retry in the same way. I think the difference here is that, for IO errors and connector errors, it affects the job but not the Flink cluster / deployment. Thinking of a session cluster, we should not fail the cluster for an error from a single job. But for resource manager interacting with Kubernetes API server, this is a cluster behavior and conceptually we don't distinguish resources for individual jobs until the slots are allocated. Moreover, it's possible that multiple jobs share the same resource (pod). One could argue that in application mode the cluster / deployment is equivalent to the job. However, the cluster mode (session / application) is transparent to the resource manager. bq. Flink jobs/clusters should be resilient and keep retrying in case of errors and should not give up especially for streaming workloads. This is different from the feedback that I get from our production. But I can understand if that's what some of the users want. So I guess maybe it worth a configuration option as you suggested. [~mbalassi], +1 to what you said about the specific case. I think there's a consensus on reaching quota limit should not be treated as fatal errors. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Assignee: Weijie Guo >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) >
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719275#comment-17719275 ] Márton Balassi commented on FLINK-31974: In the specific case I much prefer the behaviour exhibited by 1.16.1. Resource quota not being available changes dynamically, if the JobManager kept retrying (ideally with a backoff) it is not unreasonable to expect that eventually it could succeed in most real-world scenarios. Adding some guardrails around this (if a minimum parallelism is not satisfied fail instead, if a max timeout is reached fail etc) to avoid ending up with many small jobs competing for insufficient resources and wasting capacity would be acceptable to me, but outright failing on the first try is more a bug than a feature imho. :) > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Assignee: Weijie Guo >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719249#comment-17719249 ] Gyula Fora commented on FLINK-31974: cc [~mbalassi] [~mxm] [~thw] > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Assignee: Weijie Guo >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) > ~[flink-dist-1.16.1.jar:1.16.1] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) > ~[flink-dist-1.16.1.jar:1.16.1] > ... 4 more > {code} > But , {*}in Flink 1.17.0 , Job Manager crashes{*}: > {code} > 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler > [] - FATAL: Thread
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719247#comment-17719247 ] Gyula Fora commented on FLINK-31974: Flink treats only very few errors fatal. IO errors, connector (source/sink ) errors etc all cause job restarts and in many cases "Flink cannot recover from by itself". You actually expect the error to be temporary and hopefully not get it after the restart. So I think it would be generally inconsistent with the current error handling behaviour if resource manager errors would simply let the job die fatally and not retry in the same way. So I am mostly looking at this from the user perspective. Flink jobs/clusters should be resilient and keep retrying in case of errors and should not give up especially for streaming workloads. This is how it works now and this what most users expect I think. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Assignee: Weijie Guo >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at >
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719228#comment-17719228 ] Xintong Song commented on FLINK-31974: -- [~gyfora], IMO, errors that Flink cannot recover from by itself should be considered fatal. E.g., for a permission issue, if not provided the details that it's due to reaching the quota limit, I don't see how Flink can fix that by itself. I would be fine with Flink trying to improve how it handles various errors based on understanding of what the errors mean. However, I'd be hesitate about to simply retry for arbitrary errors. bq. because more often than not these are actually temporary TBH, my observations are to the contrary. Might because of differences between our production environments. bq. At least this should be configurable Normally, I'd avoid introducing new configuration unless absolutely necessary. In this case, if you believe it worths the complexity not to trigger a re-deployment upon arbitrary errors, I'd be fine with making it configurable. I'm still trying to understand why re-deployment upon API server outage is a big deal. Is it because the outage happens a lot in your production environment? bq. retry everything based on the restart strategy I believe restart strategy only controls behaviors upon job failures. An error thrown from the interactions between the resource manager and the kubernetes cluster would not invoke the restart strategy. Unless you mean waiting for the resource allocation timeout. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Assignee: Weijie Guo >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at >
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719213#comment-17719213 ] Gyula Fora commented on FLINK-31974: [~xtsong] what errors would you consider actually fatal in Kubernetes world? >From my perspective I would like to treat almost every kubernetes error non >fatal. At least this should be configurable because as you say some may prefer >shuttind down the flink jobs (fatal) and some (we for instance) would like to >retry everything based on the restart strategy. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Assignee: Weijie Guo >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) > ~[flink-dist-1.16.1.jar:1.16.1] > at >
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719209#comment-17719209 ] Xintong Song commented on FLINK-31974: -- Not sure about never giving fatal exceptions. I personally would lean towards a whitelist approach, where Flink only handles a certain set of errors that are known to be non-fatal, and by default fail for whatever errors that it doesn't recognize and doesn't know how to handle. My concern for keeping retrying by default is that, when there's a large Kubernetes cluster with lot's of applications, this approach would exacerbate the burden on the Kubernetes API server and sometimes make the temporary outage even harder to recovery. I've seen that for many times in production. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Assignee: Weijie Guo >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) >
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719196#comment-17719196 ] Gyula Fora commented on FLINK-31974: Somewhat of a side comment: I think in native kubernetes integration case we should basically never give these fatal exceptions. Even if there is a missing serviceaccount/permission/timeout we should keep retrying because more often than not these are actually temporary (even if they need some time to be resolved by the operating platform team). A job fatal error requires a complete redeployment which is not what most users want. For standalone this may be different but there we will get different errors. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Assignee: Weijie Guo >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) > ~[flink-dist-1.16.1.jar:1.16.1] > at >
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719187#comment-17719187 ] Xintong Song commented on FLINK-31974: -- [~mapohl], There're two paths for JobMaster to handle the situation that resources are not obtained. - With the timeout - {{JobMasterGateway#notifyNotEnoughResourcesAvailable}} The second path is for the job to fail earlier rather than waiting for the timeout, if SlotManager knows that the resource cannot be obtained and it makes no sense to wait, e.g., in a standalone cluster. I think the question is, after we identify the specific error that suggest a quota exceeding, how do we pass this information all the way from {{KubernetesResourceManagerDriver}} to {{ResourceManager}} and to {{SlotManager}}. I think it shouldn't be complex to complete the missing part of the path. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at >
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719176#comment-17719176 ] Matthias Pohl commented on FLINK-31974: --- Sounds good to me, too. Just for me to understand: With "we can pass this information to JobMaster" you mean letting the SlotManager implementations deal with the timeout you mentioned that occurs when we fail to create a new worker. That way, we only need to identify the {{KubernetesClientException}} in {{ResourceManagerDriver#requestResource}} and print a warning to make the user aware. I'm asking because I struggled to find a code path between the JobMaster and the ResourceManager that would enable us to inform the JobMaster about this specific error which is how I understood "passing the information" in the first place. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) >
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719144#comment-17719144 ] Weijie Guo commented on FLINK-31974: Thanks Xintong for the analysis and proposal, It makes sense to me for relying JobMaster to decide whether to fail the job or not. IMO, The exception mentioned in the ticket should not arbitrarily cause JM to crash, especially for batch workload. If this is reasonable, I'm willing to fix it. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) > ~[flink-dist-1.16.1.jar:1.16.1] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) >
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719116#comment-17719116 ] Xintong Song commented on FLINK-31974: -- Thanks [~sergiosp] for reporting, and thanks [~mapohl] for looking into this. bq. The k8s cluster doesn't provide the resources so that the Flink cluster would be able to handle the parallelism of the submitted job. This is not always true. For streaming workloads in reactive mode, it is expected that not all requested resources can be obtained, and as long as the minimum resource requirements are fulfilled the job can be executed. Also for batch workloads, ideally a job can be executed with a single slot, because tasks don't have to be executed at the same time. Moreover, there's a timeout at the JobMaster side that will fail the job if resources cannot be fulfilled within a certain time, with the execution mode and minimum resource requirements taken into consideration. In most cases, the phenomenon for not obtaining a resource is that Flink can create the meta of desired pod at K8s API Server and will keep waiting for the K8s cluster to schedule and bring up the pod. However, in this case it throws an exception, which was not covered by the current implementation. I think we may identify the specific error and not treat it as fatal error. Instead, we can pass this information to JobMaster via {{JobMasterGateway#notifyNotEnoughResourcesAvailable}} and rely on JobMaster to decide whether should fail the job. WDYT? > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) >
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718964#comment-17718964 ] Matthias Pohl commented on FLINK-31974: --- I'm still wondering what the desired behavior in that case is. The k8s cluster doesn't provide the resources so that the Flink cluster would be able to handle the parallelism of the submitted job. In my opinion, it feels like the fatal error is correct. [~xtsong] what's your take on that one? > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) > ~[flink-dist-1.16.1.jar:1.16.1] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) >
[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718958#comment-17718958 ] Matthias Pohl commented on FLINK-31974: --- Thanks for reporting. This is caused by the changes that were introduced with FLINK-30908. You should see an additional error message "Error completing resource request" in the logs. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > --- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > aused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) > ~[flink-dist-1.16.1.jar:1.16.1] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) > ~[flink-dist-1.16.1.jar:1.16.1] > ... 4 more > > > But , {*}in Flink 1.17.0 , Job Manager crashes{*}: > 2023-04-28