[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-03 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719145#comment-17719145
 ] 

Zhihao Chen commented on FLINK-31135:
-

It's working as expected now after we fixed our S3 deletion issue. Thanks for 
your help!

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, 
> flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip,
>  image-2023-04-19-09-48-19-089.png, image-2023-05-03-13-47-51-440.png, 
> image-2023-05-03-13-50-54-783.png, image-2023-05-03-13-51-21-685.png, 
> jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-03 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719102#comment-17719102
 ] 

Zhihao Chen commented on FLINK-31135:
-

[~Swathi Chandrashekar] thank you for pointing it out! I believe there are some 
S3 permission issues from our side. I've missed the error information. I'll fix 
it from our side and let you know if it's all good. Please feel free to close 
this ticket.

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, 
> flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip,
>  image-2023-04-19-09-48-19-089.png, image-2023-05-03-13-47-51-440.png, 
> image-2023-05-03-13-50-54-783.png, image-2023-05-03-13-51-21-685.png, 
> jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718781#comment-17718781
 ] 

Zhihao Chen edited comment on FLINK-31135 at 5/3/23 4:03 AM:
-

hey [~Swathi Chandrashekar], thank you for looking into it.
{quote}This error was populated for all the checkpoints due to state 
inconsistency which resulted in storing lot of checkpoints in S3, which 
eventually caused the size of the configMap > 1MB ]
{quote}
I don't think that's the case. Instead, none of the checkpoint records in the 
CM was ever cleaned up. The error "Flink was not able to determine whether the 
metadata was successfully persisted" starts to happen when the CM reaches the 
1MB size limitation.

I have another flink job running here as an example.

Configmap: 
[^parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml]
The checkpoint ids are as "checkpointID-001", 
"checkpointID-002", ... "checkpointID-0001040". in 
a consecutive way. Worth reminding the IDs are from "1" to "1040". The 
configmap has reached the 1MB size limitation.
 
The "Flink was not able to determine whether the metadata was successfully 
persisted." actually happens when the CM attached the record "1040". Please see 
the logs below. The bottom one is first error log, which complains about the 
record "1041". I think that makes sense as it's not recorded in the CM, hence 
Flink can't determine if the metadata was successfully persisted.
!image-2023-05-03-13-47-51-440.png|width=1465,height=799!
 
The flink dashboard log also reflects the assumption.
!image-2023-05-03-13-51-21-685.png|width=1473,height=783!
 JM 
log:[^flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip]
 
My guess is that Flink never cleaned any of the records in CM at all for our 
cases.
 


was (Author: JIRAUSER299871):
hey [~Swathi Chandrashekar], thank you for looking into it.
{quote}This error was populated for all the checkpoints due to state 
inconsistency which resulted in storing lot of checkpoints in S3, which 
eventually caused the size of the configMap > 1MB ]
{quote}
I don't think that's the case. Instead, none of the checkpoint records in the 
CM was ever cleaned up. The error "Flink was not able to determine whether the 
metadata was successfully persisted" starts to happen when the CM reaches the 
1MB size limitation.

I have another flink job running here as an example.

Configmap: 
[^parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml]
The checkpoint ids are as "checkpointID-001", 
"checkpointID-002", ... "checkpointID-0001040". in 
a consecutive way. Worth reminding the IDs are from "1" to "1040". The 
configmap has reached the 1MB size limitation.
 
The "Flink was not able to determine whether the metadata was successfully 
persisted." actually happens when the CM attached the record "1040". Please see 
the logs below. The bottom one is first error log, which complains about the 
record "1041". I think that makes sense as it's not recorded in the CM, hence 
Flink can't determine if the metadata was successfully persisted.
!image-2023-05-03-13-47-51-440.png|width=1579,height=861!
 
The flink dashboard log also reflects the assumption.
!image-2023-05-03-13-51-21-685.png|width=1473,height=783!
 JM 
log:[^flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip]
 
My guess is that Flink never cleaned any of the records in CM at all for our 
cases.
 

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, 
> flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip,
>  image-2023-04-19-09-48-19-089.png, image-2023-05-03-13-47-51-440.png, 
> image-2023-05-03-13-50-54-783.png, image-2023-05-03-13-51-21-685.png, 
> jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> 

[jira] [Comment Edited] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718781#comment-17718781
 ] 

Zhihao Chen edited comment on FLINK-31135 at 5/3/23 4:03 AM:
-

hey [~Swathi Chandrashekar], thank you for looking into it.
{quote}This error was populated for all the checkpoints due to state 
inconsistency which resulted in storing lot of checkpoints in S3, which 
eventually caused the size of the configMap > 1MB ]
{quote}
I don't think that's the case. Instead, none of the checkpoint records in the 
CM was ever cleaned up. The error "Flink was not able to determine whether the 
metadata was successfully persisted" starts to happen when the CM reaches the 
1MB size limitation.

I have another flink job running here as an example.

Configmap: 
[^parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml]
The checkpoint ids are as "checkpointID-001", 
"checkpointID-002", ... "checkpointID-0001040". in 
a consecutive way. Worth reminding the IDs are from "1" to "1040". The 
configmap has reached the 1MB size limitation.
 
The "Flink was not able to determine whether the metadata was successfully 
persisted." actually happens when the CM attached the record "1040". Please see 
the logs below. The bottom one is first error log, which complains about the 
record "1041". I think that makes sense as it's not recorded in the CM, hence 
Flink can't determine if the metadata was successfully persisted.
!image-2023-05-03-13-47-51-440.png|width=1579,height=861!
 
The flink dashboard log also reflects the assumption.
!image-2023-05-03-13-51-21-685.png|width=1473,height=783!
 JM 
log:[^flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip]
 
My guess is that Flink never cleaned any of the records in CM at all for our 
cases.
 


was (Author: JIRAUSER299871):
hey [~Swathi Chandrashekar], thank you for looking into it.
{quote}This error was populated for all the checkpoints due to state 
inconsistency which resulted in storing lot of checkpoints in S3, which 
eventually caused the size of the configMap > 1MB ]
{quote}
I don't think that's the case. Instead, none of the checkpoint records in the 
CM was ever cleaned up. The error "Flink was not able to determine whether the 
metadata was successfully persisted" starts to happen when the CM reaches the 
1MB size limitation.

I have another flink job running here as an example.

Configmap: 
[^parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml]
The checkpoint ids are as "checkpointID-001", 
"checkpointID-002", ... "checkpointID-0001040". in 
a consecutive way. Worth reminding the IDs are from "1" to "1040". The 
configmap has reached the 1MB size limitation.
 
The "Flink was not able to determine whether the metadata was successfully 
persisted." actually happens when the CM attached the record "1040". Please see 
the logs below. The bottom one is first error log, which complains about the 
record "1041". I think that makes sense as it's not recorded in the CM, hence 
Flink can't determine if the metadata was successfully persisted.
!image-2023-05-03-13-47-51-440.png|width=1579,height=861!
 
The flink dashboard log also reflects the assumption.
!image-2023-05-03-13-51-21-685.png|width=1473,height=783!
 JM log:
 
My guess is that Flink never cleaned any of the records in CM at all for our 
cases.
 

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, 
> flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip,
>  image-2023-04-19-09-48-19-089.png, image-2023-05-03-13-47-51-440.png, 
> image-2023-05-03-13-50-54-783.png, image-2023-05-03-13-51-21-685.png, 
> jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received 

[jira] [Updated] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihao Chen updated FLINK-31135:

Attachment: 
flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, 
> flink--kubernetes-application-0-parked-logs-ingestion-644b80-b4bc58747-lc865.log.zip,
>  image-2023-04-19-09-48-19-089.png, image-2023-05-03-13-47-51-440.png, 
> image-2023-05-03-13-50-54-783.png, image-2023-05-03-13-51-21-685.png, 
> jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718781#comment-17718781
 ] 

Zhihao Chen edited comment on FLINK-31135 at 5/3/23 4:02 AM:
-

hey [~Swathi Chandrashekar], thank you for looking into it.
{quote}This error was populated for all the checkpoints due to state 
inconsistency which resulted in storing lot of checkpoints in S3, which 
eventually caused the size of the configMap > 1MB ]
{quote}
I don't think that's the case. Instead, none of the checkpoint records in the 
CM was ever cleaned up. The error "Flink was not able to determine whether the 
metadata was successfully persisted" starts to happen when the CM reaches the 
1MB size limitation.

I have another flink job running here as an example.

Configmap: 
[^parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml]
The checkpoint ids are as "checkpointID-001", 
"checkpointID-002", ... "checkpointID-0001040". in 
a consecutive way. Worth reminding the IDs are from "1" to "1040". The 
configmap has reached the 1MB size limitation.
 
The "Flink was not able to determine whether the metadata was successfully 
persisted." actually happens when the CM attached the record "1040". Please see 
the logs below. The bottom one is first error log, which complains about the 
record "1041". I think that makes sense as it's not recorded in the CM, hence 
Flink can't determine if the metadata was successfully persisted.
!image-2023-05-03-13-47-51-440.png|width=1579,height=861!
 
The flink dashboard log also reflects the assumption.
!image-2023-05-03-13-51-21-685.png|width=1473,height=783!
 JM log:
 
My guess is that Flink never cleaned any of the records in CM at all for our 
cases.
 


was (Author: JIRAUSER299871):
hey [~Swathi Chandrashekar], thank you for looking into it.
{quote}This error was populated for all the checkpoints due to state 
inconsistency which resulted in storing lot of checkpoints in S3, which 
eventually caused the size of the configMap > 1MB ]
{quote}
I don't think that's the case. Instead, none of the checkpoint records in the 
CM was ever cleaned up. The error "Flink was not able to determine whether the 
metadata was successfully persisted" starts to happen when the CM reaches the 
1MB size limitation.

I have another flink job running here as an example.

Configmap: 
[^parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml]
The checkpoint ids are as "checkpointID-001", 
"checkpointID-002", ... "checkpointID-0001040". in 
a consecutive way. Worth reminding the IDs are from "1" to "1040". The 
configmap has reached the 1MB size limitation.
 
The "Flink was not able to determine whether the metadata was successfully 
persisted." actually happens when the CM attached the record "1040". Please see 
the logs below. The bottom one is first error log, which complains about the 
record "1041". I think that makes sense as it's not recorded in the CM, hence 
Flink can't determine if the metadata was successfully persisted.
!image-2023-05-03-13-47-51-440.png|width=1579,height=861!
 
The flink dashboard log also reflects the assumption.
!image-2023-05-03-13-51-21-685.png|width=1473,height=783!
 
 
My guess is that Flink never cleaned any of the record in CM at all for our 
cases.
 

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> image-2023-05-03-13-47-51-440.png, image-2023-05-03-13-50-54-783.png, 
> image-2023-05-03-13-51-21-685.png, jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> 

[jira] [Comment Edited] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718781#comment-17718781
 ] 

Zhihao Chen edited comment on FLINK-31135 at 5/3/23 3:59 AM:
-

hey [~Swathi Chandrashekar], thank you for looking into it.
{quote}This error was populated for all the checkpoints due to state 
inconsistency which resulted in storing lot of checkpoints in S3, which 
eventually caused the size of the configMap > 1MB ]
{quote}
I don't think that's the case. Instead, none of the checkpoint records in the 
CM was ever cleaned up. The error "Flink was not able to determine whether the 
metadata was successfully persisted" starts to happen when the CM reaches the 
1MB size limitation.

I have another flink job running here as an example.

Configmap: 
[^parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml]
The checkpoint ids are as "checkpointID-001", 
"checkpointID-002", ... "checkpointID-0001040". in 
a consecutive way. Worth reminding the IDs are from "1" to "1040". The 
configmap has reached the 1MB size limitation.
 
The "Flink was not able to determine whether the metadata was successfully 
persisted." actually happens when the CM attached the record "1040". Please see 
the logs below. The bottom one is first error log, which complains about the 
record "1041". I think that makes sense as it's not recorded in the CM, hence 
Flink can't determine if the metadata was successfully persisted.
!image-2023-05-03-13-47-51-440.png|width=1579,height=861!
 
The flink dashboard log also reflects the assumption.
!image-2023-05-03-13-51-21-685.png|width=1473,height=783!
 
 
My guess is that Flink never cleaned any of the record in CM at all for our 
cases.
 


was (Author: JIRAUSER299871):
hey [~Swathi Chandrashekar], thank you for looking into it.
{quote}This error was populated for all the checkpoints due to state 
inconsistency which resulted in storing lot of checkpoints in S3, which 
eventually caused the size of the configMap > 1MB ]
{quote}
I don't think that's the case. Instead, the none of the checkpoint record in 
the CM was ever cleaned up. When the CM reaches the 1MB size limiation, the 
error "Flink was not able to determine whether the metadata was successfully 
persisted." starts to happen.


I have another flink job running here as an example.

Configmap: 
[^parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml]
The checkpoint ids are as "checkpointID-001", 
"checkpointID-002", ... "checkpointID-0001040". in 
a consecutive way. Worth reminding the IDs are from "1" to "1040". The 
configmap has reached the 1MB size limitation.
 
The "Flink was not able to determine whether the metadata was successfully 
persisted." actually happens when the CM attached the record "1040". Please see 
the logs below. The bottom one is first error log, which complains about the 
record "1041". I think that makes sense as it's not recorded in the CM, hence 
Flink can't determine if the metadata was successfully persisted.
!image-2023-05-03-13-47-51-440.png|width=1579,height=861!
 
The flink dashboard log also reflects the assumption.
!image-2023-05-03-13-51-21-685.png|width=1473,height=783!
 
 
My guess is that Flink never cleaned any of the record in CM at all for our 
cases.
 

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> image-2023-05-03-13-47-51-440.png, image-2023-05-03-13-50-54-783.png, 
> image-2023-05-03-13-51-21-685.png, jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> 

[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718781#comment-17718781
 ] 

Zhihao Chen commented on FLINK-31135:
-

hey [~Swathi Chandrashekar], thank you for looking into it.
{quote}This error was populated for all the checkpoints due to state 
inconsistency which resulted in storing lot of checkpoints in S3, which 
eventually caused the size of the configMap > 1MB ]
{quote}
I don't think that's the case. Instead, the none of the checkpoint record in 
the CM was ever cleaned up. When the CM reaches the 1MB size limiation, the 
error "Flink was not able to determine whether the metadata was successfully 
persisted." starts to happen.


I have another flink job running here as an example.

Configmap: 
[^parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml]
The checkpoint ids are as "checkpointID-001", 
"checkpointID-002", ... "checkpointID-0001040". in 
a consecutive way. Worth reminding the IDs are from "1" to "1040". The 
configmap has reached the 1MB size limitation.
 
The "Flink was not able to determine whether the metadata was successfully 
persisted." actually happens when the CM attached the record "1040". Please see 
the logs below. The bottom one is first error log, which complains about the 
record "1041". I think that makes sense as it's not recorded in the CM, hence 
Flink can't determine if the metadata was successfully persisted.
!image-2023-05-03-13-47-51-440.png|width=1579,height=861!
 
The flink dashboard log also reflects the assumption.
!image-2023-05-03-13-51-21-685.png|width=1473,height=783!
 
 
My guess is that Flink never cleaned any of the record in CM at all for our 
cases.
 

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> image-2023-05-03-13-47-51-440.png, image-2023-05-03-13-50-54-783.png, 
> image-2023-05-03-13-51-21-685.png, jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  

[jira] [Updated] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihao Chen updated FLINK-31135:

Attachment: image-2023-05-03-13-50-54-783.png

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> image-2023-05-03-13-47-51-440.png, image-2023-05-03-13-50-54-783.png, 
> image-2023-05-03-13-51-21-685.png, jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihao Chen updated FLINK-31135:

Attachment: image-2023-05-03-13-51-21-685.png

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> image-2023-05-03-13-47-51-440.png, image-2023-05-03-13-50-54-783.png, 
> image-2023-05-03-13-51-21-685.png, jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihao Chen updated FLINK-31135:

Attachment: image-2023-05-03-13-47-51-440.png

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> image-2023-05-03-13-47-51-440.png, jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihao Chen updated FLINK-31135:

Attachment: 
parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihao Chen updated FLINK-31135:

Attachment: (was: image-2023-05-03-13-26-51-992.png)

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihao Chen updated FLINK-31135:

Attachment: (was: image-2023-05-03-13-27-58-256.png)

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> image-2023-05-03-13-26-51-992.png, jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihao Chen updated FLINK-31135:

Attachment: (was: image-2023-05-03-13-27-44-449.png)

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> image-2023-05-03-13-26-51-992.png, jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihao Chen updated FLINK-31135:

Attachment: (was: image-2023-05-03-13-27-50-513.png)

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> image-2023-05-03-13-26-51-992.png, jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihao Chen updated FLINK-31135:

Attachment: image-2023-05-03-13-27-50-513.png

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> image-2023-05-03-13-26-51-992.png, image-2023-05-03-13-27-44-449.png, 
> image-2023-05-03-13-27-50-513.png, image-2023-05-03-13-27-58-256.png, 
> jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihao Chen updated FLINK-31135:

Attachment: image-2023-05-03-13-27-58-256.png

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> image-2023-05-03-13-26-51-992.png, image-2023-05-03-13-27-44-449.png, 
> image-2023-05-03-13-27-50-513.png, image-2023-05-03-13-27-58-256.png, 
> jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihao Chen updated FLINK-31135:

Attachment: image-2023-05-03-13-26-51-992.png

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> image-2023-05-03-13-26-51-992.png, image-2023-05-03-13-27-44-449.png, 
> jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-05-02 Thread Zhihao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihao Chen updated FLINK-31135:

Attachment: image-2023-05-03-13-27-44-449.png

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> image-2023-05-03-13-26-51-992.png, image-2023-05-03-13-27-44-449.png, 
> jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-27 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717014#comment-17717014
 ] 

Zhihao Chen commented on FLINK-31135:
-

[~Swathi Chandrashekar], please see the attached log from JM with this issue. I 
didn't find the error message of discard completed checkpoint tho.

[^jobmanager_log.txt]

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-27 Thread Zhihao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihao Chen updated FLINK-31135:

Attachment: jobmanager_log.txt

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-25 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716479#comment-17716479
 ] 

Zhihao Chen commented on FLINK-31135:
-

Hi [~Swathi Chandrashekar], can I ask do we have any update on this?

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-18 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713796#comment-17713796
 ] 

Zhihao Chen edited comment on FLINK-31135 at 4/18/23 11:49 PM:
---

Hi [~Swathi Chandrashekar] , in my case, the state.checkpoints.num-retained for 
our flink jobs is always set as 5, but looks like that's not respected tho. 
Please see the code snippet from the flinkdeployment via 
flink-kubernetes-operator. 

 

 
{code:java}
// code placeholder

apiVersion: v1
items:
- apiVersion: flink.apache.org/v1beta1
  kind: FlinkDeployment
  metadata:
    creationTimestamp: "2023-04-04T03:02:25Z"
    finalizers:
    - flinkdeployments.flink.apache.org/finalizer
    generation: 2
    labels:
      instanceId: parked-logs-ingestion-16805773-a96408
      jobName: parked-logs-ingestion-16805773
    name: parked-logs-ingestion-16805773-a96408
    namespace: parked-logs-ingestion-16805773-a96408
    resourceVersion: "533476748"
    uid: 182b9c7e-74cc-490b-8045-9fddaa7b8aa9
  spec:
    flinkConfiguration:
      execution.checkpointing.externalized-checkpoint-retention: 
RETAIN_ON_CANCELLATION
      execution.checkpointing.interval: "6"
      execution.checkpointing.max-concurrent-checkpoints: "1"
      execution.checkpointing.min-pause: 5s
      execution.checkpointing.mode: EXACTLY_ONCE
      execution.checkpointing.prefer-checkpoint-for-recovery: "true"
      execution.checkpointing.timeout: 60min
      high-availability: 
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
      high-availability.storageDir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/ha
      jobmanager.memory.process.size: 1024m
      metrics.reporter.stsd.factory.class: 
org.apache.flink.metrics.statsd.StatsDReporterFactory
      metrics.reporter.stsd.host: localhost
      metrics.reporter.stsd.interval: 30 SECONDS
      metrics.reporter.stsd.port: "8125"
      metrics.reporters: stsd
      metrics.scope.jm: jobmanager
      metrics.scope.jm.job: jobmanager.
      metrics.scope.operator: taskmanager..
      metrics.scope.task: taskmanager..
      metrics.scope.tm: taskmanager
      metrics.scope.tm.job: taskmanager.
      metrics.system-resource: "true"
      metrics.system-resource-probing-interval: "3"
      restart-strategy: fixed-delay
      restart-strategy.fixed-delay.attempts: "2147483647"
      state.backend: hashmap
      state.checkpoint-storage: filesystem
      state.checkpoints.dir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/checkpoints
      state.checkpoints.num-retained: "5"
      state.savepoints.dir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/savepoints
      taskmanager.memory.managed.size: "0"
      taskmanager.memory.network.fraction: "0.1"
      taskmanager.memory.network.max: 1000m
      taskmanager.memory.network.min: 64m
      taskmanager.memory.process.size: 2048m
      taskmanager.numberOfTaskSlots: "10"
      web.cancel.enable: "false"
    flinkVersion: v1_15

 {code}
in UI:

!image-2023-04-19-09-48-19-089.png|width=2730,height=1786!

I got the same issue before we switched to the flink-kubenertes-operator. At 
that time we were using flink standalone deployment on Kubernetes. We set 
state.checkpoints.num-retained as 5, but hit the same issue.

 


was (Author: JIRAUSER299871):
Hi [~Swathi Chandrashekar] , in my case, the state.checkpoints.num-retained for 
our flink jobs is always set as 5, but looks like that's not respected tho. 
Please see the code snippet from the flinkdeployment via 
flink-kubernetes-operator. 

 

 
{code:java}
// code placeholder

apiVersion: v1
items:
- apiVersion: flink.apache.org/v1beta1
  kind: FlinkDeployment
  metadata:
    creationTimestamp: "2023-04-04T03:02:25Z"
    finalizers:
    - flinkdeployments.flink.apache.org/finalizer
    generation: 2
    labels:
      instanceId: parked-logs-ingestion-16805773-a96408
      jobName: parked-logs-ingestion-16805773
    name: parked-logs-ingestion-16805773-a96408
    namespace: parked-logs-ingestion-16805773-a96408
    resourceVersion: "533476748"
    uid: 182b9c7e-74cc-490b-8045-9fddaa7b8aa9
  spec:
    flinkConfiguration:
      execution.checkpointing.externalized-checkpoint-retention: 
RETAIN_ON_CANCELLATION
      execution.checkpointing.interval: "6"
      execution.checkpointing.max-concurrent-checkpoints: "1"
      execution.checkpointing.min-pause: 5s
      execution.checkpointing.mode: EXACTLY_ONCE
      execution.checkpointing.prefer-checkpoint-for-recovery: "true"
      execution.checkpointing.timeout: 60min
      high-availability: 
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
      high-availability.storageDir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/ha
      jobmanager.memory.process.size: 1024m
      metrics.reporter.stsd.factory.class: 

[jira] [Comment Edited] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-18 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713796#comment-17713796
 ] 

Zhihao Chen edited comment on FLINK-31135 at 4/18/23 11:49 PM:
---

Hi [~Swathi Chandrashekar] , in my case, the state.checkpoints.num-retained for 
our flink jobs is always set as 5, but looks like that's not respected tho. 
Please see the code snippet from the flinkdeployment via 
flink-kubernetes-operator. 

 

 
{code:java}
// code placeholder

apiVersion: v1
items:
- apiVersion: flink.apache.org/v1beta1
  kind: FlinkDeployment
  metadata:
    creationTimestamp: "2023-04-04T03:02:25Z"
    finalizers:
    - flinkdeployments.flink.apache.org/finalizer
    generation: 2
    labels:
      instanceId: parked-logs-ingestion-16805773-a96408
      jobName: parked-logs-ingestion-16805773
    name: parked-logs-ingestion-16805773-a96408
    namespace: parked-logs-ingestion-16805773-a96408
    resourceVersion: "533476748"
    uid: 182b9c7e-74cc-490b-8045-9fddaa7b8aa9
  spec:
    flinkConfiguration:
      execution.checkpointing.externalized-checkpoint-retention: 
RETAIN_ON_CANCELLATION
      execution.checkpointing.interval: "6"
      execution.checkpointing.max-concurrent-checkpoints: "1"
      execution.checkpointing.min-pause: 5s
      execution.checkpointing.mode: EXACTLY_ONCE
      execution.checkpointing.prefer-checkpoint-for-recovery: "true"
      execution.checkpointing.timeout: 60min
      high-availability: 
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
      high-availability.storageDir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/ha
      jobmanager.memory.process.size: 1024m
      metrics.reporter.stsd.factory.class: 
org.apache.flink.metrics.statsd.StatsDReporterFactory
      metrics.reporter.stsd.host: localhost
      metrics.reporter.stsd.interval: 30 SECONDS
      metrics.reporter.stsd.port: "8125"
      metrics.reporters: stsd
      metrics.scope.jm: jobmanager
      metrics.scope.jm.job: jobmanager.
      metrics.scope.operator: taskmanager..
      metrics.scope.task: taskmanager..
      metrics.scope.tm: taskmanager
      metrics.scope.tm.job: taskmanager.
      metrics.system-resource: "true"
      metrics.system-resource-probing-interval: "3"
      restart-strategy: fixed-delay
      restart-strategy.fixed-delay.attempts: "2147483647"
      state.backend: hashmap
      state.checkpoint-storage: filesystem
      state.checkpoints.dir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/checkpoints
      state.checkpoints.num-retained: "5"
      state.savepoints.dir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/savepoints
      taskmanager.memory.managed.size: "0"
      taskmanager.memory.network.fraction: "0.1"
      taskmanager.memory.network.max: 1000m
      taskmanager.memory.network.min: 64m
      taskmanager.memory.process.size: 2048m
      taskmanager.numberOfTaskSlots: "10"
      web.cancel.enable: "false"
    flinkVersion: v1_15

 {code}
in UI:

!image-2023-04-19-09-48-19-089.png|width=590,height=386!

I got the same issue before we switched to the flink-kubenertes-operator. At 
that time we were using flink standalone deployment on Kubernetes. We set 
state.checkpoints.num-retained as 5, but hit the same issue.

 


was (Author: JIRAUSER299871):
Hi [~Swathi Chandrashekar] , in my case, the state.checkpoints.num-retained for 
our flink jobs is always set as 5, but looks like that's not respected tho. 
Please see the code snippet from the flinkdeployment via 
flink-kubernetes-operator. 

 

 
{code:java}
// code placeholder

apiVersion: v1
items:
- apiVersion: flink.apache.org/v1beta1
  kind: FlinkDeployment
  metadata:
    creationTimestamp: "2023-04-04T03:02:25Z"
    finalizers:
    - flinkdeployments.flink.apache.org/finalizer
    generation: 2
    labels:
      instanceId: parked-logs-ingestion-16805773-a96408
      jobName: parked-logs-ingestion-16805773
    name: parked-logs-ingestion-16805773-a96408
    namespace: parked-logs-ingestion-16805773-a96408
    resourceVersion: "533476748"
    uid: 182b9c7e-74cc-490b-8045-9fddaa7b8aa9
  spec:
    flinkConfiguration:
      execution.checkpointing.externalized-checkpoint-retention: 
RETAIN_ON_CANCELLATION
      execution.checkpointing.interval: "6"
      execution.checkpointing.max-concurrent-checkpoints: "1"
      execution.checkpointing.min-pause: 5s
      execution.checkpointing.mode: EXACTLY_ONCE
      execution.checkpointing.prefer-checkpoint-for-recovery: "true"
      execution.checkpointing.timeout: 60min
      high-availability: 
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
      high-availability.storageDir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/ha
      jobmanager.memory.process.size: 1024m
      metrics.reporter.stsd.factory.class: 

[jira] [Comment Edited] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-18 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713796#comment-17713796
 ] 

Zhihao Chen edited comment on FLINK-31135 at 4/18/23 11:48 PM:
---

Hi [~Swathi Chandrashekar] , in my case, the state.checkpoints.num-retained for 
our flink jobs is always set as 5, but looks like that's not respected tho. 
Please see the code snippet from the flinkdeployment via 
flink-kubernetes-operator. 

 

 
{code:java}
// code placeholder

apiVersion: v1
items:
- apiVersion: flink.apache.org/v1beta1
  kind: FlinkDeployment
  metadata:
    creationTimestamp: "2023-04-04T03:02:25Z"
    finalizers:
    - flinkdeployments.flink.apache.org/finalizer
    generation: 2
    labels:
      instanceId: parked-logs-ingestion-16805773-a96408
      jobName: parked-logs-ingestion-16805773
    name: parked-logs-ingestion-16805773-a96408
    namespace: parked-logs-ingestion-16805773-a96408
    resourceVersion: "533476748"
    uid: 182b9c7e-74cc-490b-8045-9fddaa7b8aa9
  spec:
    flinkConfiguration:
      execution.checkpointing.externalized-checkpoint-retention: 
RETAIN_ON_CANCELLATION
      execution.checkpointing.interval: "6"
      execution.checkpointing.max-concurrent-checkpoints: "1"
      execution.checkpointing.min-pause: 5s
      execution.checkpointing.mode: EXACTLY_ONCE
      execution.checkpointing.prefer-checkpoint-for-recovery: "true"
      execution.checkpointing.timeout: 60min
      high-availability: 
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
      high-availability.storageDir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/ha
      jobmanager.memory.process.size: 1024m
      metrics.reporter.stsd.factory.class: 
org.apache.flink.metrics.statsd.StatsDReporterFactory
      metrics.reporter.stsd.host: localhost
      metrics.reporter.stsd.interval: 30 SECONDS
      metrics.reporter.stsd.port: "8125"
      metrics.reporters: stsd
      metrics.scope.jm: jobmanager
      metrics.scope.jm.job: jobmanager.
      metrics.scope.operator: taskmanager..
      metrics.scope.task: taskmanager..
      metrics.scope.tm: taskmanager
      metrics.scope.tm.job: taskmanager.
      metrics.system-resource: "true"
      metrics.system-resource-probing-interval: "3"
      restart-strategy: fixed-delay
      restart-strategy.fixed-delay.attempts: "2147483647"
      state.backend: hashmap
      state.checkpoint-storage: filesystem
      state.checkpoints.dir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/checkpoints
      state.checkpoints.num-retained: "5"
      state.savepoints.dir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/savepoints
      taskmanager.memory.managed.size: "0"
      taskmanager.memory.network.fraction: "0.1"
      taskmanager.memory.network.max: 1000m
      taskmanager.memory.network.min: 64m
      taskmanager.memory.process.size: 2048m
      taskmanager.numberOfTaskSlots: "10"
      web.cancel.enable: "false"
    flinkVersion: v1_15

 {code}
in UI:

!image-2023-04-19-09-48-19-089.png!

I got the same issue before we switched to the flink-kubenertes-operator. That 
time we were use flink standalone deployment on Kubernetes. We set 
state.checkpoints.num-retained as 5, but hit the same issue.

 


was (Author: JIRAUSER299871):
Hi [~Swathi Chandrashekar] , in my case, the state.checkpoints.num-retained for 
our flink jobs is always set as 5, but looks like that's not respected tho. 
Please see the code snippet from the flinkdeployment via 
flink-kubernetes-operator. 

 

 
{code:java}
// code placeholder

apiVersion: v1
items:
- apiVersion: flink.apache.org/v1beta1
  kind: FlinkDeployment
  metadata:
    creationTimestamp: "2023-04-04T03:02:25Z"
    finalizers:
    - flinkdeployments.flink.apache.org/finalizer
    generation: 2
    labels:
      instanceId: parked-logs-ingestion-16805773-a96408
      jobName: parked-logs-ingestion-16805773
    name: parked-logs-ingestion-16805773-a96408
    namespace: parked-logs-ingestion-16805773-a96408
    resourceVersion: "533476748"
    uid: 182b9c7e-74cc-490b-8045-9fddaa7b8aa9
  spec:
    flinkConfiguration:
      execution.checkpointing.externalized-checkpoint-retention: 
RETAIN_ON_CANCELLATION
      execution.checkpointing.interval: "6"
      execution.checkpointing.max-concurrent-checkpoints: "1"
      execution.checkpointing.min-pause: 5s
      execution.checkpointing.mode: EXACTLY_ONCE
      execution.checkpointing.prefer-checkpoint-for-recovery: "true"
      execution.checkpointing.timeout: 60min
      high-availability: 
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
      high-availability.storageDir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/ha
      jobmanager.memory.process.size: 1024m
      metrics.reporter.stsd.factory.class: 

[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-18 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713796#comment-17713796
 ] 

Zhihao Chen commented on FLINK-31135:
-

Hi [~Swathi Chandrashekar] , in my case, the state.checkpoints.num-retained for 
our flink jobs is always set as 5, but looks like that's not respected tho. 
Please see the code snippet from the flinkdeployment via 
flink-kubernetes-operator. 

 

 
{code:java}
// code placeholder

apiVersion: v1
items:
- apiVersion: flink.apache.org/v1beta1
  kind: FlinkDeployment
  metadata:
    creationTimestamp: "2023-04-04T03:02:25Z"
    finalizers:
    - flinkdeployments.flink.apache.org/finalizer
    generation: 2
    labels:
      instanceId: parked-logs-ingestion-16805773-a96408
      jobName: parked-logs-ingestion-16805773
    name: parked-logs-ingestion-16805773-a96408
    namespace: parked-logs-ingestion-16805773-a96408
    resourceVersion: "533476748"
    uid: 182b9c7e-74cc-490b-8045-9fddaa7b8aa9
  spec:
    flinkConfiguration:
      execution.checkpointing.externalized-checkpoint-retention: 
RETAIN_ON_CANCELLATION
      execution.checkpointing.interval: "6"
      execution.checkpointing.max-concurrent-checkpoints: "1"
      execution.checkpointing.min-pause: 5s
      execution.checkpointing.mode: EXACTLY_ONCE
      execution.checkpointing.prefer-checkpoint-for-recovery: "true"
      execution.checkpointing.timeout: 60min
      high-availability: 
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
      high-availability.storageDir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/ha
      jobmanager.memory.process.size: 1024m
      metrics.reporter.stsd.factory.class: 
org.apache.flink.metrics.statsd.StatsDReporterFactory
      metrics.reporter.stsd.host: localhost
      metrics.reporter.stsd.interval: 30 SECONDS
      metrics.reporter.stsd.port: "8125"
      metrics.reporters: stsd
      metrics.scope.jm: jobmanager
      metrics.scope.jm.job: jobmanager.
      metrics.scope.operator: taskmanager..
      metrics.scope.task: taskmanager..
      metrics.scope.tm: taskmanager
      metrics.scope.tm.job: taskmanager.
      metrics.system-resource: "true"
      metrics.system-resource-probing-interval: "3"
      restart-strategy: fixed-delay
      restart-strategy.fixed-delay.attempts: "2147483647"
      state.backend: hashmap
      state.checkpoint-storage: filesystem
      state.checkpoints.dir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/checkpoints
      state.checkpoints.num-retained: "5"
      state.savepoints.dir: 
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/savepoints
      taskmanager.memory.managed.size: "0"
      taskmanager.memory.network.fraction: "0.1"
      taskmanager.memory.network.max: 1000m
      taskmanager.memory.network.min: 64m
      taskmanager.memory.process.size: 2048m
      taskmanager.numberOfTaskSlots: "10"
      web.cancel.enable: "false"
    flinkVersion: v1_15

 {code}
 

I got the same issue before we switched to the flink-kubenertes-operator. That 
time we were use flink standalone deployment on Kubernetes. We set 
state.checkpoints.num-retained as 5, but hit the same issue.

 

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> 

[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-16 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712878#comment-17712878
 ] 

Zhihao Chen commented on FLINK-31135:
-

Hi [~Swathi Chandrashekar] , please see the attached configmap file:

[^dump_cm.yaml]

 

^The error shown in Flink dashboard is as:^

^*Checkpoint Detail:*^
*Path:* - *Discarded:* - *Checkpoint Type:* aligned checkpoint *Failure 
Message:* io.fabric8.kubernetes.client.KubernetesClientException: Failure 
executing: PUT at: 
https://10.32.228.1/api/v1/namespaces/parked-logs-ingestion-16805773-a96408/configmaps/parked-logs-ingestion-16805773-a96408-110331249bb495a4d23b4d69849c8224-config-map.
 Message: ConfigMap 
"parked-logs-ingestion-16805773-a96408-110331249bb495a4d23b4d69849c8224-config-map"
 is invalid: []: Too long: must have at most 1048576 bytes. Received status: 
Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must have 
at most 1048576 bytes, reason=FieldValueTooLong, additionalProperties={})], 
group=null, kind=ConfigMap, 
name=parked-logs-ingestion-16805773-a96408-110331249bb495a4d23b4d69849c8224-config-map,
 retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
message=ConfigMap 
"parked-logs-ingestion-16805773-a96408-110331249bb495a4d23b4d69849c8224-config-map"
 is invalid: []: Too long: must have at most 1048576 bytes, 
metadata=ListMeta(_continue=null, remainingItemCount=null, 
resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, 
status=Failure, additionalProperties={}).

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> 

[jira] [Updated] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-16 Thread Zhihao Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihao Chen updated FLINK-31135:

Attachment: dump_cm.yaml

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: dump_cm.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-16 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712854#comment-17712854
 ] 

Zhihao Chen edited comment on FLINK-31135 at 4/17/23 1:48 AM:
--

I have encountered the same issue. Actually, it's an ongoing issue for us. I 
believe it has nothing to do with the Flink-Kubernetes-operator as it happened 
with both Flink Standalone Kubernetes deployment and Flink-kubernetes-operator 
deployment.

 

I have checked our configuration but didn't find anything interesting.


was (Author: JIRAUSER299871):
I have encountered the same issue. Actually, it's an ongoing issue for us. I 
believe it has nothing to do with the Flink-Kubernetes-operator as it happened 
with Flink Standalone Kubernetes deployment and Flink-kubernetes-operator 
deployment.

 

I have checked our configuration but didn't find anything interesting.

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

2023-04-16 Thread Zhihao Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712854#comment-17712854
 ] 

Zhihao Chen commented on FLINK-31135:
-

I have encountered the same issue. Actually, it's an ongoing issue for us. I 
believe it has nothing to do with the Flink-Kubernetes-operator as it happened 
with Flink Standalone Kubernetes deployment and Flink-kubernetes-operator 
deployment.

 

I have checked our configuration but didn't find anything interesting.

> ConfigMap DataSize went > 1 MB and cluster stopped working
> --
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.2.0
>Reporter: Sriram Ganesh
>Priority: Major
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https:///api/v1/namespaces//configmaps/-config-map. Message: 
> ConfigMap "-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-6610) WebServer could not be created,when set the "jobmanager.web.submit.enable" to false

2017-05-17 Thread zhihao chen (JIRA)
zhihao chen created FLINK-6610:
--

 Summary: WebServer could not be created,when set the 
"jobmanager.web.submit.enable" to false
 Key: FLINK-6610
 URL: https://issues.apache.org/jira/browse/FLINK-6610
 Project: Flink
  Issue Type: Bug
  Components: Webfrontend
Affects Versions: 1.3.0
Reporter: zhihao chen
Assignee: zhihao chen


WebServer could not be created,when set the "jobmanager.web.submit.enable" to 
false  
because the WebFrontendBootstrap will check uploadDir not allow be null 
this.uploadDir = Preconditions.checkNotNull(directory);
{code}
2017-05-17 15:15:46,938 ERROR 
org.apache.flink.runtime.webmonitor.WebMonitorUtils   - WebServer could 
not be created
java.lang.NullPointerException
at 
org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:58)
at 
org.apache.flink.runtime.webmonitor.utils.WebFrontendBootstrap.(WebFrontendBootstrap.java:73)
at 
org.apache.flink.runtime.webmonitor.WebRuntimeMonitor.(WebRuntimeMonitor.java:359)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.flink.runtime.webmonitor.WebMonitorUtils.startWebRuntimeMonitor(WebMonitorUtils.java:135)
at 
org.apache.flink.runtime.clusterframework.BootstrapTools.createWebMonitorIfConfigured(BootstrapTools.java:242)
at 
org.apache.flink.yarn.YarnApplicationMasterRunner.runApplicationMaster(YarnApplicationMasterRunner.java:352)
at 
org.apache.flink.yarn.YarnApplicationMasterRunner$1.call(YarnApplicationMasterRunner.java:195)
at 
org.apache.flink.yarn.YarnApplicationMasterRunner$1.call(YarnApplicationMasterRunner.java:192)
at 
org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
at 
org.apache.flink.yarn.YarnApplicationMasterRunner.run(YarnApplicationMasterRunner.java:192)
at 
org.apache.flink.yarn.YarnApplicationMasterRunner.main(YarnApplicationMasterRunner.java:116)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6477) The first time to click Taskmanager cannot get the actual data

2017-05-10 Thread zhihao chen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004314#comment-16004314
 ] 

zhihao chen commented on FLINK-6477:


I made a test, compared the first request and the second time, found the first 
time can not get the metric of the data, the second time can.

I have an idea, do not know if it is feasible, please help to check it
We visit the TM steps like this, 
* overview-> Task managers -> Task Manager Metrics

could we send the request to get metrics data in the second step?

> The first time to click Taskmanager cannot get the actual data
> --
>
> Key: FLINK-6477
> URL: https://issues.apache.org/jira/browse/FLINK-6477
> Project: Flink
>  Issue Type: Bug
>  Components: Webfrontend
>Affects Versions: 1.2.0
>Reporter: zhihao chen
>Assignee: zhihao chen
> Attachments: errDisplay.jpg
>
>
> Flink web page first click Taskmanager to get less than the actual data, when 
> the parameter “jobmanager.web.refresh-interval” is set to a larger value, eg: 
> 180, if you do not manually refresh the page you need to wait time after 
> the timeout normal display



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5901) DAG can not show properly in IE

2017-05-09 Thread zhihao chen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003962#comment-16003962
 ] 

zhihao chen edited comment on FLINK-5901 at 5/10/17 3:10 AM:
-

[~StephanEwen][~WangTao]
Encountered the same problem, and confirmed in accordance with 
[FLINK-5902|https://issues.apache.org/jira/browse/FLINK-5902] , but can not 
solve this problem, 
i found we used the foreignObject element to draw svg vector map, maybe the 
reason.

E9 Mode, IE10 Mode, and IE11 Mode (All Versions)
The foreignObject element is not supported.
[https://msdn.microsoft.com/en-us/library/hh834675%28v=vs.85%29.aspx]


was (Author: chenzio):
[~StephanEwen][~WangTao]
Encountered the same problem, and confirmed in accordance with 
[FLINK-5902|https://issues.apache.org/jira/browse/FLINK-5902] , but can not 
solve this problem, 
i found we used the foreignObject element to draw svg vector map, maybe the 
reason.

[2.1.24 [SVG11] Section 23.3, The 'foreignObject' 
element|https://msdn.microsoft.com/en-us/library/hh834675%28v=vs.85%29.aspx]

> DAG can not show properly in IE
> ---
>
> Key: FLINK-5901
> URL: https://issues.apache.org/jira/browse/FLINK-5901
> Project: Flink
>  Issue Type: Bug
>  Components: Webfrontend
> Environment: IE 11
>Reporter: Tao Wang
>Priority: Critical
> Attachments: using chrom(same job).png, using IE.png
>
>
> The DAG of running jobs can not show properly in IE11(I am using 
> 11.0.9600.18059, but assuming same with IE9). The description of task is 
> not shown within the rectangle.
> Chrome is well. I pasted the screeshot under IE and Chrome below.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5901) DAG can not show properly in IE

2017-05-09 Thread zhihao chen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003962#comment-16003962
 ] 

zhihao chen commented on FLINK-5901:


[~StephanEwen][~WangTao]
Encountered the same problem, and confirmed in accordance with 
[FLINK-5902|https://issues.apache.org/jira/browse/FLINK-5902] , but can not 
solve this problem, 
i found we used the foreignObject element to draw svg vector map, maybe the 
reason.

[2.1.24 [SVG11] Section 23.3, The 'foreignObject' 
element|https://msdn.microsoft.com/en-us/library/hh834675%28v=vs.85%29.aspx]

> DAG can not show properly in IE
> ---
>
> Key: FLINK-5901
> URL: https://issues.apache.org/jira/browse/FLINK-5901
> Project: Flink
>  Issue Type: Bug
>  Components: Webfrontend
> Environment: IE 11
>Reporter: Tao Wang
>Priority: Critical
> Attachments: using chrom(same job).png, using IE.png
>
>
> The DAG of running jobs can not show properly in IE11(I am using 
> 11.0.9600.18059, but assuming same with IE9). The description of task is 
> not shown within the rectangle.
> Chrome is well. I pasted the screeshot under IE and Chrome below.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6477) The first time to click Taskmanager cannot get the actual data

2017-05-09 Thread zhihao chen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003951#comment-16003951
 ] 

zhihao chen commented on FLINK-6477:


hi Chesnay Schepler:
   yes, this problem only visible on the first access, and we could get the 
metrics data from the TM at the first time,
but if we repeat the requests as the follow, the display is normal , not very 
understand why.
{code}
.controller 'SingleTaskManagerController', ($scope, $stateParams, 
SingleTaskManagerService, $interval, flinkConfig) ->
  $scope.metrics = {}
  SingleTaskManagerService.loadMetrics($stateParams.taskmanagerid).then (data) 
->
$scope.metrics = data[0]
  SingleTaskManagerService.loadMetrics($stateParams.taskmanagerid).then (data) 
->
$scope.metrics = data[0]
{code}


> The first time to click Taskmanager cannot get the actual data
> --
>
> Key: FLINK-6477
> URL: https://issues.apache.org/jira/browse/FLINK-6477
> Project: Flink
>  Issue Type: Bug
>  Components: Webfrontend
>Affects Versions: 1.2.0
>Reporter: zhihao chen
>Assignee: zhihao chen
> Attachments: errDisplay.jpg
>
>
> Flink web page first click Taskmanager to get less than the actual data, when 
> the parameter “jobmanager.web.refresh-interval” is set to a larger value, eg: 
> 180, if you do not manually refresh the page you need to wait time after 
> the timeout normal display



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (FLINK-6477) The first time to click Taskmanager cannot get the actual data

2017-05-07 Thread zhihao chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihao chen updated FLINK-6477:
---
Attachment: errDisplay.jpg

> The first time to click Taskmanager cannot get the actual data
> --
>
> Key: FLINK-6477
> URL: https://issues.apache.org/jira/browse/FLINK-6477
> Project: Flink
>  Issue Type: Bug
>  Components: Web Client
>Affects Versions: 1.2.0
>Reporter: zhihao chen
>Assignee: zhihao chen
> Attachments: errDisplay.jpg
>
>
> Flink web page first click Taskmanager to get less than the actual data, when 
> the parameter “jobmanager.web.refresh-interval” is set to a larger value, eg: 
> 180, if you do not manually refresh the page you need to wait time after 
> the timeout normal display



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (FLINK-6477) The first time to click Taskmanager cannot get the actual data

2017-05-07 Thread zhihao chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihao chen updated FLINK-6477:
---
Affects Version/s: 1.2.0
  Component/s: Web Client

> The first time to click Taskmanager cannot get the actual data
> --
>
> Key: FLINK-6477
> URL: https://issues.apache.org/jira/browse/FLINK-6477
> Project: Flink
>  Issue Type: Bug
>  Components: Web Client
>Affects Versions: 1.2.0
>Reporter: zhihao chen
>Assignee: zhihao chen
>
> Flink web page first click Taskmanager to get less than the actual data, when 
> the parameter “jobmanager.web.refresh-interval” is set to a larger value, eg: 
> 180, if you do not manually refresh the page you need to wait time after 
> the timeout normal display



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6477) The first time to click Taskmanager cannot get the actual data

2017-05-07 Thread zhihao chen (JIRA)
zhihao chen created FLINK-6477:
--

 Summary: The first time to click Taskmanager cannot get the actual 
data
 Key: FLINK-6477
 URL: https://issues.apache.org/jira/browse/FLINK-6477
 Project: Flink
  Issue Type: Bug
Reporter: zhihao chen
Assignee: zhihao chen


Flink web page first click Taskmanager to get less than the actual data, when 
the parameter “jobmanager.web.refresh-interval” is set to a larger value, eg: 
180, if you do not manually refresh the page you need to wait time after 
the timeout normal display



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)