[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-04 Thread Sriram Ganesh (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719329#comment-17719329
 ] 

Sriram Ganesh commented on FLINK-31135:
---

In my case, it is not a permission issue. I couldn't repro the issue. My hunch 
is there could be a network issue during that time. 

Thanks, [~Swathi Chandrashekar] [~zhihaochen]. I am closing it.

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, 
> flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip,
>  image-2023-04-19-09-48-19-089.png, image-2023-05-03-13-47-51-440.png, 
> image-2023-05-03-13-50-54-783.png, image-2023-05-03-13-51-21-685.png, 
> jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-03 Thread SwathiChandrashekar (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719148#comment-17719148
 ] 

SwathiChandrashekar commented on FLINK-31135:
-

That's great [~zhihaochen] :)

[~zhihaochen] , [~sriramgr] , [~mxm] I don't have the permission to close the 
issue. Can you please close the issue if no other action needed. 

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, 
> flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip,
>  image-2023-04-19-09-48-19-089.png, image-2023-05-03-13-47-51-440.png, 
> image-2023-05-03-13-50-54-783.png, image-2023-05-03-13-51-21-685.png, 
> jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-03 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719145#comment-17719145
 ] 

Zhihao Chen commented on FLINK-31135:
-

It's working as expected now after we fixed our S3 deletion issue. Thanks for 
your help!

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, 
> flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip,
>  image-2023-04-19-09-48-19-089.png, image-2023-05-03-13-47-51-440.png, 
> image-2023-05-03-13-50-54-783.png, image-2023-05-03-13-51-21-685.png, 
> jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-03 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719102#comment-17719102
 ] 

Zhihao Chen commented on FLINK-31135:
-

[~Swathi Chandrashekar] thank you for pointing it out! I believe there are some 
S3 permission issues from our side. I've missed the error information. I'll fix 
it from our side and let you know if it's all good. Please feel free to close 
this ticket.

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, 
> flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip,
>  image-2023-04-19-09-48-19-089.png, image-2023-05-03-13-47-51-440.png, 
> image-2023-05-03-13-50-54-783.png, image-2023-05-03-13-51-21-685.png, 
> jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-03 Thread SwathiChandrashekar (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718815#comment-17718815
 ] 

SwathiChandrashekar commented on FLINK-31135:
-

[~zhihaochen] , thanks for the attachment of the JM logs where the job is first 
submitted. Earlier JM logs did not help much to root cause, as it had recovered 
a job which already had multiple checkpoints and further checkpoints trigger 
post that failed due to >1MB issue and needs to be manually cleaned up, if 
succeeded.

The new JM attachment helped to deduce the issue as it starts with job 
submission.

After the initial 5 checkpoints, when the cleanup was happening, flink threw 
the following error :
{code:java}
{"@timestamp":"2023-05-02T07:21:34.047Z","ecs.version":"1.2.0","log.level":"WARN","message":"Fail
 to subsume the old 
checkpoint.","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointSubsumeHelper","error.type":"java.util.concurrent.ExecutionException","error.message":"java.io.IOException:
 
/parked-logs-ingestion-644b80/ha/parked-logs-ingestion-644b80/completedCheckpointf2bebc94bd32
 could not be deleted for unknown 
reasons.","error.stack_trace":"java.util.concurrent.ExecutionException: 
java.io.IOException: 
/parked-logs-ingestion-644b80/ha/parked-logs-ingestion-644b80/completedCheckpointf2bebc94bd32
 could not be deleted for unknown reasons.\n\tat 
java.base/java.util.concurrent.CompletableFuture.reportGet(Unknown 
Source)\n\tat java.base/java.util.concurrent.CompletableFuture.get(Unknown 
Source)\n\tat 
org.apache.flink.kubernetes.highavailability.KubernetesStateHandleStore.releaseAndTryRemove(KubernetesStateHandleStore.java:526)\n\tat
 
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.tryRemove(DefaultCompletedCheckpointStore.java:242)\n\tat
 
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.tryRemoveCompletedCheckpoint(DefaultCompletedCheckpointStore.java:227)\n\tat
 
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.lambda$addCheckpointAndSubsumeOldestOne$0(DefaultCompletedCheckpointStore.java:145)\n\tat
 
org.apache.flink.runtime.checkpoint.CheckpointSubsumeHelper.subsume(CheckpointSubsumeHelper.java:70)\n\tat
 
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.addCheckpointAndSubsumeOldestOne(DefaultCompletedCheckpointStore.java:141)\n\tat
 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.addCompletedCheckpointToStoreAndSubsumeOldest(CheckpointCoordinator.java:1382)\n\tat
 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1249)\n\tat
 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1134)\n\tat
 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)\n\tat
 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)\n\tat
 java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)\n\tat java.base/java.lang.Thread.run(Unknown Source)\nCaused by: 
java.io.IOException: 
/parked-logs-ingestion-644b80/ha/parked-logs-ingestion-644b80/completedCheckpointf2bebc94bd32
 could not be deleted for unknown reasons.\n\tat 
org.apache.flink.fs.s3presto.FlinkS3PrestoFileSystem.deleteObject(FlinkS3PrestoFileSystem.java:135)\n\tat
 
org.apache.flink.fs.s3presto.FlinkS3PrestoFileSystem.delete(FlinkS3PrestoFileSystem.java:66)\n\tat
 
org.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.delete(PluginFileSystemFactory.java:155)\n\tat
 
org.apache.flink.runtime.state.filesystem.FileStateHandle.discardState(FileStateHandle.java:89)\n\tat
 
org.apache.flink.runtime.state.RetrievableStreamStateHandle.discardState(RetrievableStreamStateHandle.java:76)\n\tat
 
org.apache.flink.kubernetes.highavailability.KubernetesStateHandleStore.lambda$releaseAndTryRemove$12(KubernetesStateHandleStore.java:510)\n\tat
 java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown 
Source)\n\tat java.base/java.util.concurrent.CompletableFuture.complete(Unknown 
Source)\n\tat 
org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperation$1(FutureUtils.java:201)\n\tat
 java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown 
Source)\n\t... 3 more\n"} {code}
Flink is encountering a IO Exception when Flink is trying to delete a file in 
S3. It says file .

[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718781#comment-17718781
 ] 

Zhihao Chen commented on FLINK-31135:
-

hey [~Swathi Chandrashekar], thank you for looking into it.
{quote}This error was populated for all the checkpoints due to state 
inconsistency which resulted in storing lot of checkpoints in S3, which 
eventually caused the size of the configMap > 1MB ]
{quote}
I don't think that's the case. Instead, the none of the checkpoint record in 
the CM was ever cleaned up. When the CM reaches the 1MB size limiation, the 
error "Flink was not able to determine whether the metadata was successfully 
persisted." starts to happen.


I have another flink job running here as an example.

Configmap: 
[^parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml]
The checkpoint ids are as "checkpointID-001", 
"checkpointID-002", ... "checkpointID-0001040". in 
a consecutive way. Worth reminding the IDs are from "1" to "1040". The 
configmap has reached the 1MB size limitation.
 
The "Flink was not able to determine whether the metadata was successfully 
persisted." actually happens when the CM attached the record "1040". Please see 
the logs below. The bottom one is first error log, which complains about the 
record "1041". I think that makes sense as it's not recorded in the CM, hence 
Flink can't determine if the metadata was successfully persisted.
!image-2023-05-03-13-47-51-440.png|width=1579,height=861!
 
The flink dashboard log also reflects the assumption.
!image-2023-05-03-13-51-21-685.png|width=1473,height=783!
 
 
My guess is that Flink never cleaned any of the record in CM at all for our 
cases.
 

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> image-2023-05-03-13-47-51-440.png, image-2023-05-03-13-50-54-783.png, 
> image-2023-05-03-13-51-21-685.png, jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  

[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread SwathiChandrashekar (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718686#comment-17718686
 ] 

SwathiChandrashekar commented on FLINK-31135:
-

[~zhihaochen] , looks like for

jobId: 07bfdfef145a87c2071965081aaff548 , it tried to recover the job according 
to JM logs and tried to recover the checkpoints 1012 - 1015. The pointers were 
present in configMap :

parked-logs-ingestion-16818796-0c2923-07bfdfef145a87c2071965081aaff548-config-map".

The configMap of this was already 1Mb and any checkpoint which is triggered 
failed in this case with the following error.

{"@timestamp":"2023-04-26T23:46:54.190Z","ecs.version":"1.2.0","log.level":"WARN","message":"An
 error occurred while writing checkpoint 11211 to the underlying metadata 
store. Flink was not able to determine whether the metadata was successfully 
persisted. The corresponding state located at 
's3://eureka-flink-data-prod/parked-logs-ingestion-16818796-0c2923/checkpoints/07bfdfef145a87c2071965081aaff548/{*}chk-11211'
 won't be discarded and needs to be cleaned up 
manually.{*}","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"}
 
{"@timestamp":"2023-04-26T23:46:54.248Z","ecs.version":"1.2.0","log.level":"WARN","message":"{*}Error
 while processing AcknowledgeCheckpoint{*} 
message","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.jobmaster.JobMaster","error.type":"org.apache.flink.runtime.checkpoint.CheckpointException","error.message":"Could
 not complete the pending checkpoint 11211. Failure reason: Failure to finalize 
checkpoint.","error.stack_trace":"org.apache.flink.runtime.checkpoint.CheckpointException:
 Could not complete the pending checkpoint 11211. Failure reason: Failure to 
finalize checkpoint.\n\tat 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.addCompletedCheckpointToStoreAndSubsumeOldest(CheckpointCoordinator.java:1404)\n\tat
 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1249)\n\tat
 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1134)\n\tat
 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)\n\tat
 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)\n\tat
 java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)\n\tat java.base/java.lang.Thread.run(Unknown Source)\nCaused by: 
org.apache.flink.runtime.persistence.PossibleInconsistentStateException: 
io.fabric8.kubernetes.client.KubernetesClientException: *Failure executing: PUT 
at: 
https://10.32.228.1/api/v1/namespaces/parked-logs-ingestion-16818796-0c2923/configmaps/parked-logs-ingestion-16818796-0c2923-07bfdfef145a87c2071965081aaff548-config-map.
 Message: ConfigMap 
\"parked-logs-ingestion-16818796-0c2923-07bfdfef145a87c2071965081aaff548-config-map\"
 is invalid: []: Too long: must have at most 1048576 bytes. Received status: 
Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must have 
at most 1048576 bytes, reason=FieldValueTooLong*

These were the only logs present from the JM and from the log attached, couldnt 
find the logs when the previous checkpoints were taken.  As in this new JM 
logs, the JobId was recovered, so possible that there was a JM restart and 
JobId was also restarted.

 

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], 

[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread SwathiChandrashekar (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718663#comment-17718663
 ] 

SwathiChandrashekar commented on FLINK-31135:
-

[~zhihaochen] , looks flink was getting inconsistent state exception, while 
trying to write the checkpoint to S3. Hence, it was throwing the following 
error at all checkpoints and asking to clean manually and hence was not 
considered for cleanup.

{"@timestamp":"2023-04-26T23:53:30.704Z","ecs.version":"1.2.0","log.level":"WARN","message":"An
 error occurred while writing checkpoint 11217 to the underlying metadata 
store. Flink was not able to determine whether the metadata was successfully 
persisted. The corresponding state located at 
's3://eureka-flink-data-prod/parked-logs-ingestion-16818796-0c2923/checkpoints/07bfdfef145a87c2071965081aaff548/chk-11217'
 won't be discarded and needs to be cleaned up 
manually.","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"}

Link : 
https://github.com/apache/flink/blob/release-1.15/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1389

This error was populated for all the checkpoints due to some state 
inconsistency which resulted in storing lot of checkpoints in S3, which 
eventually caused the size of the configMap > 1MB

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> 

[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-27 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717014#comment-17717014
 ] 

Zhihao Chen commented on FLINK-31135:
-

[~Swathi Chandrashekar], please see the attached log from JM with this issue. I 
didn't find the error message of discard completed checkpoint tho.

[^jobmanager_log.txt]

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-26 Thread SwathiChandrashekar (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716845#comment-17716845
 ] 

SwathiChandrashekar commented on FLINK-31135:
-

I missed your previous comment. The configuration your using to retain the 
checkpoints seemscorrect. Can you please check the JM logs once if there's 
error while cleaning the checkpoints ?

[https://github.com/apache/flink/blob/release-1.15/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointsCleaner.java#L85]
 . 

In 1.15, irrespective of whether the cleanup was successful or not, the no. of 
checkpoints to clean is always decremented.

The JM logs might help to understand why the cleanup failed. Or if your using 
custom clean up logic, might need to check that once.

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-25 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716479#comment-17716479
 ] 

Zhihao Chen commented on FLINK-31135:
-

Hi [~Swathi Chandrashekar], can I ask do we have any update on this?

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-18 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713796#comment-17713796
 ] 

Zhihao Chen commented on FLINK-31135:
-

Hi [~Swathi Chandrashekar] , in my case, the state.checkpoints.num-retained for 
our flink jobs is always set as 5, but looks like that's not respected tho. 
Please see the code snippet from the flinkdeployment via 
flink-kubernetes-operator. 

 

 
{code:java}
// code placeholder

apiVersion: v1
items:
- apiVersion: flink.apache.org/v1beta1
  kind: FlinkDeployment
  metadata:
    creationTimestamp: "2023-04-04T03:02:25Z"
    finalizers:
    - flinkdeployments.flink.apache.org/finalizer
    generation: 2
    labels:
      instanceId: parked-logs-ingestion-16805773-a96408
      jobName: parked-logs-ingestion-16805773
    name: parked-logs-ingestion-16805773-a96408
    namespace: parked-logs-ingestion-16805773-a96408
    resourceVersion: "533476748"
    uid: 182b9c7e-74cc-490b-8045-9fddaa7b8aa9
  spec:
    flinkConfiguration:
      execution.checkpointing.externalized-checkpoint-retention: 
RETAIN_ON_CANCELLATION
      execution.checkpointing.interval: "6"
      execution.checkpointing.max-concurrent-checkpoints: "1"
      execution.checkpointing.min-pause: 5s
      execution.checkpointing.mode: EXACTLY_ONCE
      execution.checkpointing.prefer-checkpoint-for-recovery: "true"
      execution.checkpointing.timeout: 60min
      high-availability: 
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
      high-availability.storageDir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/ha
      jobmanager.memory.process.size: 1024m
      metrics.reporter.stsd.factory.class: 
org.apache.flink.metrics.statsd.StatsDReporterFactory
      metrics.reporter.stsd.host: localhost
      metrics.reporter.stsd.interval: 30 SECONDS
      metrics.reporter.stsd.port: "8125"
      metrics.reporters: stsd
      metrics.scope.jm: jobmanager
      metrics.scope.jm.job: jobmanager.
      metrics.scope.operator: taskmanager..
      metrics.scope.task: taskmanager..
      metrics.scope.tm: taskmanager
      metrics.scope.tm.job: taskmanager.
      metrics.system-resource: "true"
      metrics.system-resource-probing-interval: "3"
      restart-strategy: fixed-delay
      restart-strategy.fixed-delay.attempts: "2147483647"
      state.backend: hashmap
      state.checkpoint-storage: filesystem
      state.checkpoints.dir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/checkpoints
      state.checkpoints.num-retained: "5"
      state.savepoints.dir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/savepoints
      taskmanager.memory.managed.size: "0"
      taskmanager.memory.network.fraction: "0.1"
      taskmanager.memory.network.max: 1000m
      taskmanager.memory.network.min: 64m
      taskmanager.memory.process.size: 2048m
      taskmanager.numberOfTaskSlots: "10"
      web.cancel.enable: "false"
    flinkVersion: v1_15

 {code}
 

I got the same issue before we switched to the flink-kubenertes-operator. That 
time we were use flink standalone deployment on Kubernetes. We set 
state.checkpoints.num-retained as 5, but hit the same issue.

 

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> 

[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-18 Thread Sriram Ganesh (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713605#comment-17713605
 ] 

Sriram Ganesh commented on FLINK-31135:
---

[~Swathi Chandrashekar] - In my case, state.checkpoints.num-retained is the 
default which is 1. I tried to reproduce this issue. I couldn't. 

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-18 Thread SwathiChandrashekar (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713582#comment-17713582
 ] 

SwathiChandrashekar commented on FLINK-31135:
-

[~mxm] , [~sriramgr] , [~zhihaochen] let us know if we can mark this issue as 
resolved or let me know if any further investigation pending

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-18 Thread SwathiChandrashekar (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713581#comment-17713581
 ] 

SwathiChandrashekar commented on FLINK-31135:
-

Thanks [~zhihaochen] . The maximum configmap supported by kubernetes is 1MB.

The configmap which you shared was a job specific config map which is used to 
retain all the checkpoint pointers per job. Since you have configured the 
retained checkpoints ( state.checkpoints.num-retained ) to a very high value, 
hence this issue has been hit.

Please try to reduce the  state.checkpoints.num-retained configuration and you 
should hit the issue again.

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-16 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712878#comment-17712878
 ] 

Zhihao Chen commented on FLINK-31135:
-

Hi [~Swathi Chandrashekar] , please see the attached configmap file:

[^dump_cm.yaml]

 

^The error shown in Flink dashboard is as:^

^*Checkpoint Detail:*^
*Path:* - *Discarded:* - *Checkpoint Type:* aligned checkpoint *Failure 
Message:* io.fabric8.kubernetes.client.KubernetesClientException: Failure 
executing: PUT at: 
https://10.32.228.1/api/v1/namespaces/parked-logs-ingestion-16805773-a96408/configmaps/parked-logs-ingestion-16805773-a96408-110331249bb495a4d23b4d69849c8224-config-map.
 Message: ConfigMap 
"parked-logs-ingestion-16805773-a96408-110331249bb495a4d23b4d69849c8224-config-map"
 is invalid: []: Too long: must have at most 1048576 bytes. Received status: 
Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must have 
at most 1048576 bytes, reason=FieldValueTooLong, additionalProperties={})], 
group=null, kind=ConfigMap, 
name=parked-logs-ingestion-16805773-a96408-110331249bb495a4d23b4d69849c8224-config-map,
 retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
message=ConfigMap 
"parked-logs-ingestion-16805773-a96408-110331249bb495a4d23b4d69849c8224-config-map"
 is invalid: []: Too long: must have at most 1048576 bytes, 
metadata=ListMeta(_continue=null, remainingItemCount=null, 
resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, 
status=Failure, additionalProperties={}).

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> 

[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-16 Thread SwathiChandrashekar (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712861#comment-17712861
 ] 

SwathiChandrashekar commented on FLINK-31135:
-

Thanks [~zhihaochen] , can you please share the configmap which hit the issue ?

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-16 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712854#comment-17712854
 ] 

Zhihao Chen commented on FLINK-31135:
-

I have encountered the same issue. Actually, it's an ongoing issue for us. I 
believe it has nothing to do with the Flink-Kubernetes-operator as it happened 
with Flink Standalone Kubernetes deployment and Flink-kubernetes-operator 
deployment.

 

I have checked our configuration but didn't find anything interesting.

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-03-17 Thread Maximilian Michels (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701681#comment-17701681
 ] 

Maximilian Michels commented on FLINK-31135:


Oh, just realized this is unrelated to FLINK-31345 but a separate config map 
issue. Reopening :)

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-03-17 Thread Maximilian Michels (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701679#comment-17701679
 ] 

Maximilian Michels commented on FLINK-31135:


This has been addressed in FLINK-31345.

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-03-10 Thread Sriram Ganesh (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698917#comment-17698917
 ] 

Sriram Ganesh commented on FLINK-31135:
---

[~Swathi Chandrashekar] - it is a pod template config map. it is not 
job-specific config map.

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-03-09 Thread SwathiChandrashekar (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698367#comment-17698367
 ] 

SwathiChandrashekar commented on FLINK-31135:
-

For job specific config map, its not a flink operator issue and we do not pass 
the checkpoints data to CR. 

Whenever we create JM HA in flink in kubernetes, the flink creates certain 
config maps ( dispatcher config map, RM leader config map, etc ).

Similarly whenever we create a job, a job config map ( job master config map ) 
is created per job by the flink which has keeps track of the pointers to the 
actual checkpoint data.

So, when the retained.checkpoints are configured to a higher value, many 
entries will be added in this case in this configMap , which could contribute 
to the configMap size. 

But, to this specific issue which is mentioned, we are not sure which configMap 
cause the issue which would help us to investigate further

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-03-09 Thread ramkrishna.s.vasudevan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698330#comment-17698330
 ] 

ramkrishna.s.vasudevan commented on FLINK-31135:


So are we adding all the checkpoints data back to the CR? 

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-03-09 Thread SwathiChandrashekar (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698321#comment-17698321
 ] 

SwathiChandrashekar commented on FLINK-31135:
-

[~sriramgr] , which is the configmap which failed in this scenario ?

If it is a pod-template configmap, then it indirectly depends on the user 
applied CR as the user defines entries of this file ( most of them ).

If it is a job specific config map ( which has the meta info of all the 
checkpoints pointers ) , if the retained checkpoints are very high, then I 
believe, we can hit this issue. 

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-02-20 Thread Sriram Ganesh (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691119#comment-17691119
 ] 

Sriram Ganesh commented on FLINK-31135:
---

[~gyfora] - Please add your thoughts.

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)