OK – I will increase the value to something higher and see how it does in recovering. Thank you for your help!
___ Julian Cardarelli CEO T (800) 961-1549 ejul...@thentia.com LinkedIn DISCLAIMER Neither Thentia Corporation, nor its directors, officers, shareholders, representatives, employees, non-arms length companies, subsidiaries, parent, affiliated brands and/or agencies are licensed to provide legal advice. This e-mail may contain among other things legal information. We disclaim any and all responsibility for the content of this e-mail. YOU MUST NOT rely on any of our communications as legal advice. Only a licensed legal professional may give you advice. Our communications are never provided as legal advice, because we are not licensed to provide legal advice nor do we possess the knowledge, skills or capacity to provide legal advice. We disclaim any and all responsibility related to any action you might take based upon our communications and emphasize the need for you to never rely on our communications as the basis of any claim or proceeding. CONFIDENTIALITY This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual(s) named. If you are not the named addressee(s) you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. From: Guowei Ma <guowei....@gmail.com> Sent: Thursday, September 2, 2021 11:32 PM To: Julian Cardarelli <jul...@thentia.com> Cc: user <user@flink.apache.org> Subject: [External] Re: Flink on Kubernetes Hi, Julian I notice that your configuration includes "restart-strategy.fixed-delay.attempts: 10". It means that the job would fail after 10 times failure. So maybe it leads to the job not restarting again and you could increase this value. But I am not sure if this is the root cause. So if this does not work I think you could share the log at that time and the flink version you use. Best, Guowei On Fri, Sep 3, 2021 at 2:00 AM Julian Cardarelli <jul...@thentia.com<mailto:jul...@thentia.com>> wrote: Hello – We have implemented Flink on Kubernetes with Google Cloud Storage in high availability configuration as per the below configmap. Everything appears to be working normally, state is being saved to GCS. However, every now and then – perhaps weekly or every other week, all of the submitted jobs are lost and the cluster appears completely reset. Perhaps GKE is doing maintenance or something of this nature, but the point being that the cluster does not resume from this activity in an operational state with all jobs placed into running status. Is there something we are missing? Thanks! -jc apiVersion: v1 kind: ConfigMap metadata: name: flink-config labels: app: flink data: flink-conf.yaml: |+ jobmanager.rpc.address: flink-jobmanager taskmanager.numberOfTaskSlots: 1 blob.server.port: 6124 jobmanager.rpc.port: 6123 taskmanager.rpc.port: 6122 jobmanager.heap.size: 1024m taskmanager.memory.process.size: 1024m kubernetes.cluster-id: cluster1 high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory high-availability.storageDir: gs://storage-uswest.yyy.com/kubernetes-flink<https://can01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fstorage-uswest.yyy.com%2Fkubernetes-flink&data=04%7C01%7Cjulian%40thentia.com%7C27156a30f4d74f0083a608d96e8b7831%7Caaed208b28414c339a4df5008ba71d0d%7C0%7C0%7C637662367559754751%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=NBB%2Fik2NdJzpbjvPHxFhB6%2BndkgLJ8qa7tLqUX%2FMbZk%3D&reserved=0> state.backend: filesystem state.checkpoints.dir: gs://storage-uswest.yyy.com/kubernetes-checkpoint<https://can01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fstorage-uswest.yyy.com%2Fkubernetes-checkpoint&data=04%7C01%7Cjulian%40thentia.com%7C27156a30f4d74f0083a608d96e8b7831%7Caaed208b28414c339a4df5008ba71d0d%7C0%7C0%7C637662367559754751%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=7rj%2BTRqnzaGYtKSbs60NWQfjM7BqjdGYSpbYzyr0xsM%3D&reserved=0> state.savepoints.dir: gs://storage-uswest.yyy.com/kubernetes-savepoint<https://can01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fstorage-uswest.yyy.com%2Fkubernetes-savepoint&data=04%7C01%7Cjulian%40thentia.com%7C27156a30f4d74f0083a608d96e8b7831%7Caaed208b28414c339a4df5008ba71d0d%7C0%7C0%7C637662367559764703%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=f7Xkp3Rckj%2BieleEaX2lfNBcK4MQy3SHQ6if2stKLgI%3D&reserved=0> execution.checkpointing.interval: 3min execution.checkpointing.externalized-checkpoint-retention: DELETE_ON_CANCELLATION execution.checkpointing.max-concurrent-checkpoints: 1 execution.checkpointing.min-pause: 0 execution.checkpointing.mode: EXACTLY_ONCE execution.checkpointing.timeout: 10min execution.checkpointing.tolerable-failed-checkpoints: 0 execution.checkpointing.unaligned: false restart-strategy: fixed-delay restart-strategy.fixed-delay.attempts: 10 restart-strategy.fixed-delay.delay 10s log4j.properties: |+ log4j.rootLogger=INFO, file log4j.logger.akka=INFO log4j.logger.org.apache.kafka=INFO log4j.logger.org.apache.hadoop=INFO log4j.logger.org.apache.zookeeper=INFO log4j.appender.file=org.apache.log4j.FileAppender log4j.appender.file.file=${log.file} log4j.appender.file.layout=org.apache.log4j.PatternLayout log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} %-5p %-60c %x - %m%n log4j.logger.org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline=ERROR, file ___ Julian Cardarelli CEO T (800) 961-1549<tel:(800)%20961-1549> E jul...@thentia.com<mailto:jul...@thentia.com> LinkedIn<https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fjulian-cardarelli%2F&data=04%7C01%7Cjulian%40thentia.com%7C27156a30f4d74f0083a608d96e8b7831%7Caaed208b28414c339a4df5008ba71d0d%7C0%7C0%7C637662367559764703%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HeMQDKZRcTQp0I2oUIjWFYe85AgW00bJVf7sdXPIFWE%3D&reserved=0> [Thentia Website]<https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.thentia.com%2F%3Futm_source%3Dsignature%26utm_medium%3Dbanner%26utm_campaign%3Devergreen&data=04%7C01%7Cjulian%40thentia.com%7C27156a30f4d74f0083a608d96e8b7831%7Caaed208b28414c339a4df5008ba71d0d%7C0%7C0%7C637662367559764703%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=FuDVRVXAX853W7JcfiUWpoQO7%2FkIMSqEAOhvkOp7dG0%3D&reserved=0> DISCLAIMER Neither Thentia Corporation, nor its directors, officers, shareholders, representatives, employees, non-arms length companies, subsidiaries, parent, affiliated brands and/or agencies are licensed to provide legal advice. This e-mail may contain among other things legal information. We disclaim any and all responsibility for the content of this e-mail. YOU MUST NOT rely on any of our communications as legal advice. Only a licensed legal professional may give you advice. Our communications are never provided as legal advice, because we are not licensed to provide legal advice nor do we possess the knowledge, skills or capacity to provide legal advice. We disclaim any and all responsibility related to any action you might take based upon our communications and emphasize the need for you to never rely on our communications as the basis of any claim or proceeding. CONFIDENTIALITY This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual(s) named. If you are not the named addressee(s) you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast, a leader in email security and cyber resilience. Mimecast integrates email defenses with brand protection, security awareness training, web security, compliance and other essential capabilities. Mimecast helps protect large and small organizations from malicious activity, human error and technology failure; and to lead the movement toward building a more resilient world. To find out more, visit our website. Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast, a leader in email security and cyber resilience. Mimecast integrates email defenses with brand protection, security awareness training, web security, compliance and other essential capabilities. Mimecast helps protect large and small organizations from malicious activity, human error and technology failure; and to lead the movement toward building a more resilient world. To find out more, visit our website.