[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719329#comment-17719329 ] Sriram Ganesh commented on FLINK-31135: --- In my case, it is not a permission issue. I couldn't repro the issue. My hunch is there could be a network issue during that time. Thanks, [~Swathi Chandrashekar] [~zhihaochen]. I am closing it. > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > Attachments: dump_cm.yaml, > flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip, > image-2023-04-19-09-48-19-089.png, image-2023-05-03-13-47-51-440.png, > image-2023-05-03-13-50-54-783.png, image-2023-05-03-13-51-21-685.png, > jobmanager_log.txt, > parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml > > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719148#comment-17719148 ] SwathiChandrashekar commented on FLINK-31135: - That's great [~zhihaochen] :) [~zhihaochen] , [~sriramgr] , [~mxm] I don't have the permission to close the issue. Can you please close the issue if no other action needed. > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > Attachments: dump_cm.yaml, > flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip, > image-2023-04-19-09-48-19-089.png, image-2023-05-03-13-47-51-440.png, > image-2023-05-03-13-50-54-783.png, image-2023-05-03-13-51-21-685.png, > jobmanager_log.txt, > parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml > > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719145#comment-17719145 ] Zhihao Chen commented on FLINK-31135: - It's working as expected now after we fixed our S3 deletion issue. Thanks for your help! > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > Attachments: dump_cm.yaml, > flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip, > image-2023-04-19-09-48-19-089.png, image-2023-05-03-13-47-51-440.png, > image-2023-05-03-13-50-54-783.png, image-2023-05-03-13-51-21-685.png, > jobmanager_log.txt, > parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml > > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719102#comment-17719102 ] Zhihao Chen commented on FLINK-31135: - [~Swathi Chandrashekar] thank you for pointing it out! I believe there are some S3 permission issues from our side. I've missed the error information. I'll fix it from our side and let you know if it's all good. Please feel free to close this ticket. > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > Attachments: dump_cm.yaml, > flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip, > image-2023-04-19-09-48-19-089.png, image-2023-05-03-13-47-51-440.png, > image-2023-05-03-13-50-54-783.png, image-2023-05-03-13-51-21-685.png, > jobmanager_log.txt, > parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml > > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718815#comment-17718815 ] SwathiChandrashekar commented on FLINK-31135: - [~zhihaochen] , thanks for the attachment of the JM logs where the job is first submitted. Earlier JM logs did not help much to root cause, as it had recovered a job which already had multiple checkpoints and further checkpoints trigger post that failed due to >1MB issue and needs to be manually cleaned up, if succeeded. The new JM attachment helped to deduce the issue as it starts with job submission. After the initial 5 checkpoints, when the cleanup was happening, flink threw the following error : {code:java} {"@timestamp":"2023-05-02T07:21:34.047Z","ecs.version":"1.2.0","log.level":"WARN","message":"Fail to subsume the old checkpoint.","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointSubsumeHelper","error.type":"java.util.concurrent.ExecutionException","error.message":"java.io.IOException: /parked-logs-ingestion-644b80/ha/parked-logs-ingestion-644b80/completedCheckpointf2bebc94bd32 could not be deleted for unknown reasons.","error.stack_trace":"java.util.concurrent.ExecutionException: java.io.IOException: /parked-logs-ingestion-644b80/ha/parked-logs-ingestion-644b80/completedCheckpointf2bebc94bd32 could not be deleted for unknown reasons.\n\tat java.base/java.util.concurrent.CompletableFuture.reportGet(Unknown Source)\n\tat java.base/java.util.concurrent.CompletableFuture.get(Unknown Source)\n\tat org.apache.flink.kubernetes.highavailability.KubernetesStateHandleStore.releaseAndTryRemove(KubernetesStateHandleStore.java:526)\n\tat org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.tryRemove(DefaultCompletedCheckpointStore.java:242)\n\tat org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.tryRemoveCompletedCheckpoint(DefaultCompletedCheckpointStore.java:227)\n\tat org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.lambda$addCheckpointAndSubsumeOldestOne$0(DefaultCompletedCheckpointStore.java:145)\n\tat org.apache.flink.runtime.checkpoint.CheckpointSubsumeHelper.subsume(CheckpointSubsumeHelper.java:70)\n\tat org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.addCheckpointAndSubsumeOldestOne(DefaultCompletedCheckpointStore.java:141)\n\tat org.apache.flink.runtime.checkpoint.CheckpointCoordinator.addCompletedCheckpointToStoreAndSubsumeOldest(CheckpointCoordinator.java:1382)\n\tat org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1249)\n\tat org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1134)\n\tat org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)\n\tat org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)\n\tat java.base/java.lang.Thread.run(Unknown Source)\nCaused by: java.io.IOException: /parked-logs-ingestion-644b80/ha/parked-logs-ingestion-644b80/completedCheckpointf2bebc94bd32 could not be deleted for unknown reasons.\n\tat org.apache.flink.fs.s3presto.FlinkS3PrestoFileSystem.deleteObject(FlinkS3PrestoFileSystem.java:135)\n\tat org.apache.flink.fs.s3presto.FlinkS3PrestoFileSystem.delete(FlinkS3PrestoFileSystem.java:66)\n\tat org.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.delete(PluginFileSystemFactory.java:155)\n\tat org.apache.flink.runtime.state.filesystem.FileStateHandle.discardState(FileStateHandle.java:89)\n\tat org.apache.flink.runtime.state.RetrievableStreamStateHandle.discardState(RetrievableStreamStateHandle.java:76)\n\tat org.apache.flink.kubernetes.highavailability.KubernetesStateHandleStore.lambda$releaseAndTryRemove$12(KubernetesStateHandleStore.java:510)\n\tat java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(Unknown Source)\n\tat java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)\n\tat java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)\n\tat org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperation$1(FutureUtils.java:201)\n\tat java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)\n\tat java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)\n\tat java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown Source)\n\t... 3 more\n"} {code} Flink is encountering a IO Exception when Flink is trying to delete a file in S3. It says file .
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718781#comment-17718781 ] Zhihao Chen commented on FLINK-31135: - hey [~Swathi Chandrashekar], thank you for looking into it. {quote}This error was populated for all the checkpoints due to state inconsistency which resulted in storing lot of checkpoints in S3, which eventually caused the size of the configMap > 1MB ] {quote} I don't think that's the case. Instead, the none of the checkpoint record in the CM was ever cleaned up. When the CM reaches the 1MB size limiation, the error "Flink was not able to determine whether the metadata was successfully persisted." starts to happen. I have another flink job running here as an example. Configmap: [^parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml] The checkpoint ids are as "checkpointID-001", "checkpointID-002", ... "checkpointID-0001040". in a consecutive way. Worth reminding the IDs are from "1" to "1040". The configmap has reached the 1MB size limitation. The "Flink was not able to determine whether the metadata was successfully persisted." actually happens when the CM attached the record "1040". Please see the logs below. The bottom one is first error log, which complains about the record "1041". I think that makes sense as it's not recorded in the CM, hence Flink can't determine if the metadata was successfully persisted. !image-2023-05-03-13-47-51-440.png|width=1579,height=861! The flink dashboard log also reflects the assumption. !image-2023-05-03-13-51-21-685.png|width=1473,height=783! My guess is that Flink never cleaned any of the record in CM at all for our cases. > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, > image-2023-05-03-13-47-51-440.png, image-2023-05-03-13-50-54-783.png, > image-2023-05-03-13-51-21-685.png, jobmanager_log.txt, > parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml > > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) >
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718686#comment-17718686 ] SwathiChandrashekar commented on FLINK-31135: - [~zhihaochen] , looks like for jobId: 07bfdfef145a87c2071965081aaff548 , it tried to recover the job according to JM logs and tried to recover the checkpoints 1012 - 1015. The pointers were present in configMap : parked-logs-ingestion-16818796-0c2923-07bfdfef145a87c2071965081aaff548-config-map". The configMap of this was already 1Mb and any checkpoint which is triggered failed in this case with the following error. {"@timestamp":"2023-04-26T23:46:54.190Z","ecs.version":"1.2.0","log.level":"WARN","message":"An error occurred while writing checkpoint 11211 to the underlying metadata store. Flink was not able to determine whether the metadata was successfully persisted. The corresponding state located at 's3://eureka-flink-data-prod/parked-logs-ingestion-16818796-0c2923/checkpoints/07bfdfef145a87c2071965081aaff548/{*}chk-11211' won't be discarded and needs to be cleaned up manually.{*}","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"} {"@timestamp":"2023-04-26T23:46:54.248Z","ecs.version":"1.2.0","log.level":"WARN","message":"{*}Error while processing AcknowledgeCheckpoint{*} message","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.jobmaster.JobMaster","error.type":"org.apache.flink.runtime.checkpoint.CheckpointException","error.message":"Could not complete the pending checkpoint 11211. Failure reason: Failure to finalize checkpoint.","error.stack_trace":"org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete the pending checkpoint 11211. Failure reason: Failure to finalize checkpoint.\n\tat org.apache.flink.runtime.checkpoint.CheckpointCoordinator.addCompletedCheckpointToStoreAndSubsumeOldest(CheckpointCoordinator.java:1404)\n\tat org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1249)\n\tat org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1134)\n\tat org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)\n\tat org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)\n\tat java.base/java.lang.Thread.run(Unknown Source)\nCaused by: org.apache.flink.runtime.persistence.PossibleInconsistentStateException: io.fabric8.kubernetes.client.KubernetesClientException: *Failure executing: PUT at: https://10.32.228.1/api/v1/namespaces/parked-logs-ingestion-16818796-0c2923/configmaps/parked-logs-ingestion-16818796-0c2923-07bfdfef145a87c2071965081aaff548-config-map. Message: ConfigMap \"parked-logs-ingestion-16818796-0c2923-07bfdfef145a87c2071965081aaff548-config-map\" is invalid: []: Too long: must have at most 1048576 bytes. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must have at most 1048576 bytes, reason=FieldValueTooLong* These were the only logs present from the JM and from the log attached, couldnt find the logs when the previous checkpoints were taken. As in this new JM logs, the JobId was recovered, so possible that there was a JM restart and JobId was also restarted. > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, > jobmanager_log.txt > > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[],
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718663#comment-17718663 ] SwathiChandrashekar commented on FLINK-31135: - [~zhihaochen] , looks flink was getting inconsistent state exception, while trying to write the checkpoint to S3. Hence, it was throwing the following error at all checkpoints and asking to clean manually and hence was not considered for cleanup. {"@timestamp":"2023-04-26T23:53:30.704Z","ecs.version":"1.2.0","log.level":"WARN","message":"An error occurred while writing checkpoint 11217 to the underlying metadata store. Flink was not able to determine whether the metadata was successfully persisted. The corresponding state located at 's3://eureka-flink-data-prod/parked-logs-ingestion-16818796-0c2923/checkpoints/07bfdfef145a87c2071965081aaff548/chk-11217' won't be discarded and needs to be cleaned up manually.","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"} Link : https://github.com/apache/flink/blob/release-1.15/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1389 This error was populated for all the checkpoints due to some state inconsistency which resulted in storing lot of checkpoints in S3, which eventually caused the size of the configMap > 1MB > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, > jobmanager_log.txt > > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at >
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717014#comment-17717014 ] Zhihao Chen commented on FLINK-31135: - [~Swathi Chandrashekar], please see the attached log from JM with this issue. I didn't find the error message of discard completed checkpoint tho. [^jobmanager_log.txt] > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, > jobmanager_log.txt > > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716845#comment-17716845 ] SwathiChandrashekar commented on FLINK-31135: - I missed your previous comment. The configuration your using to retain the checkpoints seemscorrect. Can you please check the JM logs once if there's error while cleaning the checkpoints ? [https://github.com/apache/flink/blob/release-1.15/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointsCleaner.java#L85] . In 1.15, irrespective of whether the cleanup was successful or not, the no. of checkpoints to clean is always decremented. The JM logs might help to understand why the cleanup failed. Or if your using custom clean up logic, might need to check that once. > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png > > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716479#comment-17716479 ] Zhihao Chen commented on FLINK-31135: - Hi [~Swathi Chandrashekar], can I ask do we have any update on this? > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png > > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713796#comment-17713796 ] Zhihao Chen commented on FLINK-31135: - Hi [~Swathi Chandrashekar] , in my case, the state.checkpoints.num-retained for our flink jobs is always set as 5, but looks like that's not respected tho. Please see the code snippet from the flinkdeployment via flink-kubernetes-operator. {code:java} // code placeholder apiVersion: v1 items: - apiVersion: flink.apache.org/v1beta1 kind: FlinkDeployment metadata: creationTimestamp: "2023-04-04T03:02:25Z" finalizers: - flinkdeployments.flink.apache.org/finalizer generation: 2 labels: instanceId: parked-logs-ingestion-16805773-a96408 jobName: parked-logs-ingestion-16805773 name: parked-logs-ingestion-16805773-a96408 namespace: parked-logs-ingestion-16805773-a96408 resourceVersion: "533476748" uid: 182b9c7e-74cc-490b-8045-9fddaa7b8aa9 spec: flinkConfiguration: execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION execution.checkpointing.interval: "6" execution.checkpointing.max-concurrent-checkpoints: "1" execution.checkpointing.min-pause: 5s execution.checkpointing.mode: EXACTLY_ONCE execution.checkpointing.prefer-checkpoint-for-recovery: "true" execution.checkpointing.timeout: 60min high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory high-availability.storageDir: s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/ha jobmanager.memory.process.size: 1024m metrics.reporter.stsd.factory.class: org.apache.flink.metrics.statsd.StatsDReporterFactory metrics.reporter.stsd.host: localhost metrics.reporter.stsd.interval: 30 SECONDS metrics.reporter.stsd.port: "8125" metrics.reporters: stsd metrics.scope.jm: jobmanager metrics.scope.jm.job: jobmanager. metrics.scope.operator: taskmanager.. metrics.scope.task: taskmanager.. metrics.scope.tm: taskmanager metrics.scope.tm.job: taskmanager. metrics.system-resource: "true" metrics.system-resource-probing-interval: "3" restart-strategy: fixed-delay restart-strategy.fixed-delay.attempts: "2147483647" state.backend: hashmap state.checkpoint-storage: filesystem state.checkpoints.dir: s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/checkpoints state.checkpoints.num-retained: "5" state.savepoints.dir: s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/savepoints taskmanager.memory.managed.size: "0" taskmanager.memory.network.fraction: "0.1" taskmanager.memory.network.max: 1000m taskmanager.memory.network.min: 64m taskmanager.memory.process.size: 2048m taskmanager.numberOfTaskSlots: "10" web.cancel.enable: "false" flinkVersion: v1_15 {code} I got the same issue before we switched to the flink-kubenertes-operator. That time we were use flink standalone deployment on Kubernetes. We set state.checkpoints.num-retained as 5, but hit the same issue. > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > Attachments: dump_cm.yaml > > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at >
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713605#comment-17713605 ] Sriram Ganesh commented on FLINK-31135: --- [~Swathi Chandrashekar] - In my case, state.checkpoints.num-retained is the default which is 1. I tried to reproduce this issue. I couldn't. > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > Attachments: dump_cm.yaml > > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713582#comment-17713582 ] SwathiChandrashekar commented on FLINK-31135: - [~mxm] , [~sriramgr] , [~zhihaochen] let us know if we can mark this issue as resolved or let me know if any further investigation pending > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > Attachments: dump_cm.yaml > > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713581#comment-17713581 ] SwathiChandrashekar commented on FLINK-31135: - Thanks [~zhihaochen] . The maximum configmap supported by kubernetes is 1MB. The configmap which you shared was a job specific config map which is used to retain all the checkpoint pointers per job. Since you have configured the retained checkpoints ( state.checkpoints.num-retained ) to a very high value, hence this issue has been hit. Please try to reduce the state.checkpoints.num-retained configuration and you should hit the issue again. > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > Attachments: dump_cm.yaml > > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712878#comment-17712878 ] Zhihao Chen commented on FLINK-31135: - Hi [~Swathi Chandrashekar] , please see the attached configmap file: [^dump_cm.yaml] ^The error shown in Flink dashboard is as:^ ^*Checkpoint Detail:*^ *Path:* - *Discarded:* - *Checkpoint Type:* aligned checkpoint *Failure Message:* io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.32.228.1/api/v1/namespaces/parked-logs-ingestion-16805773-a96408/configmaps/parked-logs-ingestion-16805773-a96408-110331249bb495a4d23b4d69849c8224-config-map. Message: ConfigMap "parked-logs-ingestion-16805773-a96408-110331249bb495a4d23b4d69849c8224-config-map" is invalid: []: Too long: must have at most 1048576 bytes. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must have at most 1048576 bytes, reason=FieldValueTooLong, additionalProperties={})], group=null, kind=ConfigMap, name=parked-logs-ingestion-16805773-a96408-110331249bb495a4d23b4d69849c8224-config-map, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=ConfigMap "parked-logs-ingestion-16805773-a96408-110331249bb495a4d23b4d69849c8224-config-map" is invalid: []: Too long: must have at most 1048576 bytes, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}). > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > Attachments: dump_cm.yaml > > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at >
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712861#comment-17712861 ] SwathiChandrashekar commented on FLINK-31135: - Thanks [~zhihaochen] , can you please share the configmap which hit the issue ? > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712854#comment-17712854 ] Zhihao Chen commented on FLINK-31135: - I have encountered the same issue. Actually, it's an ongoing issue for us. I believe it has nothing to do with the Flink-Kubernetes-operator as it happened with Flink Standalone Kubernetes deployment and Flink-kubernetes-operator deployment. I have checked our configuration but didn't find anything interesting. > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701681#comment-17701681 ] Maximilian Michels commented on FLINK-31135: Oh, just realized this is unrelated to FLINK-31345 but a separate config map issue. Reopening :) > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701679#comment-17701679 ] Maximilian Michels commented on FLINK-31135: This has been addressed in FLINK-31345. > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698917#comment-17698917 ] Sriram Ganesh commented on FLINK-31135: --- [~Swathi Chandrashekar] - it is a pod template config map. it is not job-specific config map. > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698367#comment-17698367 ] SwathiChandrashekar commented on FLINK-31135: - For job specific config map, its not a flink operator issue and we do not pass the checkpoints data to CR. Whenever we create JM HA in flink in kubernetes, the flink creates certain config maps ( dispatcher config map, RM leader config map, etc ). Similarly whenever we create a job, a job config map ( job master config map ) is created per job by the flink which has keeps track of the pointers to the actual checkpoint data. So, when the retained.checkpoints are configured to a higher value, many entries will be added in this case in this configMap , which could contribute to the configMap size. But, to this specific issue which is mentioned, we are not sure which configMap cause the issue which would help us to investigate further > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698330#comment-17698330 ] ramkrishna.s.vasudevan commented on FLINK-31135: So are we adding all the checkpoints data back to the CR? > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698321#comment-17698321 ] SwathiChandrashekar commented on FLINK-31135: - [~sriramgr] , which is the configmap which failed in this scenario ? If it is a pod-template configmap, then it indirectly depends on the user applied CR as the user defines entries of this file ( most of them ). If it is a job specific config map ( which has the meta info of all the checkpoints pointers ) , if the retained checkpoints are very high, then I believe, we can hit this issue. > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working
[ https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691119#comment-17691119 ] Sriram Ganesh commented on FLINK-31135: --- [~gyfora] - Please add your thoughts. > ConfigMap DataSize went > 1 MB and cluster stopped working > -- > > Key: FLINK-31135 > URL: https://issues.apache.org/jira/browse/FLINK-31135 > Project: Flink > Issue Type: Bug >Affects Versions: kubernetes-operator-1.2.0 >Reporter: Sriram Ganesh >Priority: Major > > I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs > failed with the below error. It seems the config map size went beyond 1 MB > (default size). > Since it is managed by the operator and config maps are not updated with any > manual intervention, I suspect it could be an operator issue. > > {code:java} > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: PUT at: > https:///api/v1/namespaces//configmaps/-config-map. Message: > ConfigMap "-config-map" is invalid: []: Too long: must have at most > 1048576 bytes. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must > have at most 1048576 bytes, reason=FieldValueTooLong, > additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=ConfigMap "-config-map" is invalid: []: Too long: must have at > most 1048576 bytes, metadata=ListMeta(_continue=null, > remainingItemCount=null, resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130) > ~[flink-dist-1.15.2.jar:1.15.2] > at > io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41) > ~[flink-dist-1.15.2.jar:1.15.2] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325) > ~[flink-dist-1.15.2.jar:1.15.2] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > ~[?:?] > ... 3 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)