Hi, Julian I notice that your configuration includes "restart-strategy.fixed-delay.attempts: 10". It means that the job would fail after 10 times failure. So maybe it leads to the job not restarting again and you could increase this value. But I am not sure if this is the root cause. So if this does not work I think you could share the log at that time and the flink version you use.
Best, Guowei On Fri, Sep 3, 2021 at 2:00 AM Julian Cardarelli <[email protected]> wrote: > Hello – > > > > We have implemented Flink on Kubernetes with Google Cloud Storage in high > availability configuration as per the below configmap. Everything appears > to be working normally, state is being saved to GCS. > > > > However, every now and then – perhaps weekly or every other week, all of > the submitted jobs are lost and the cluster appears completely reset. > Perhaps GKE is doing maintenance or something of this nature, but the point > being that the cluster does not resume from this activity in an operational > state with all jobs placed into running status. > > > > Is there something we are missing? Thanks! > > -jc > > > > > > apiVersion: v1 > > kind: ConfigMap > > metadata: > > name: flink-config > > labels: > > app: flink > > data: > > flink-conf.yaml: |+ > > jobmanager.rpc.address: flink-jobmanager > > taskmanager.numberOfTaskSlots: 1 > > blob.server.port: 6124 > > jobmanager.rpc.port: 6123 > > taskmanager.rpc.port: 6122 > > jobmanager.heap.size: 1024m > > taskmanager.memory.process.size: 1024m > > kubernetes.cluster-id: cluster1 > > high-availability: > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory > > high-availability.storageDir: gs:// > storage-uswest.yyy.com/kubernetes-flink > > state.backend: filesystem > > state.checkpoints.dir: gs:// > storage-uswest.yyy.com/kubernetes-checkpoint > > state.savepoints.dir: gs://storage-uswest.yyy.com/kubernetes-savepoint > > execution.checkpointing.interval: 3min > > execution.checkpointing.externalized-checkpoint-retention: > DELETE_ON_CANCELLATION > > execution.checkpointing.max-concurrent-checkpoints: 1 > > execution.checkpointing.min-pause: 0 > > execution.checkpointing.mode: EXACTLY_ONCE > > execution.checkpointing.timeout: 10min > > execution.checkpointing.tolerable-failed-checkpoints: 0 > > execution.checkpointing.unaligned: false > > restart-strategy: fixed-delay > > restart-strategy.fixed-delay.attempts: 10 > > restart-strategy.fixed-delay.delay 10s > > > > log4j.properties: |+ > > log4j.rootLogger=INFO, file > > log4j.logger.akka=INFO > > log4j.logger.org.apache.kafka=INFO > > log4j.logger.org.apache.hadoop=INFO > > log4j.logger.org.apache.zookeeper=INFO > > log4j.appender.file=org.apache.log4j.FileAppender > > log4j.appender.file.file=${log.file} > > log4j.appender.file.layout=org.apache.log4j.PatternLayout > > log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd > HH:mm:ss,SSS} %-5p %-60c %x - %m%n > > > log4j.logger.org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline=ERROR, > file > > > > > ___ > Julian Cardarelli > CEO > T *(800) 961-1549* <(800)%20961-1549> > E *[email protected]* <[email protected]> > *LinkedIn* <https://www.linkedin.com/in/julian-cardarelli/> > [image: Thentia Website] > <https://www.thentia.com/?utm_source=signature&utm_medium=banner&utm_campaign=evergreen> > DISCLAIMER > > Neither Thentia Corporation, nor its directors, officers, shareholders, > representatives, employees, non-arms length companies, subsidiaries, > parent, affiliated brands and/or agencies are licensed to provide legal > advice. This e-mail may contain among other things legal information. We > disclaim any and all responsibility for the content of this e-mail. YOU > MUST NOT rely on any of our communications as legal advice. Only a licensed > legal professional may give you advice. Our communications are never > provided as legal advice, because we are not licensed to provide legal > advice nor do we possess the knowledge, skills or capacity to provide legal > advice. We disclaim any and all responsibility related to any action you > might take based upon our communications and emphasize the need for you to > never rely on our communications as the basis of any claim or proceeding. > > CONFIDENTIALITY > > This email and any files transmitted with it are confidential and > intended solely for the use of the individual or entity to whom they are > addressed. If you have received this email in error please notify the > system manager. This message contains confidential information and is > intended only for the individual(s) named. If you are not the named > addressee(s) you should not disseminate, distribute or copy this e-mail. > Please notify the sender immediately by e-mail if you have received this > e-mail by mistake and delete this e-mail from your system. If you are not > the intended recipient you are notified that disclosing, copying, > distributing or taking any action in reliance on the contents of this > information is strictly prohibited. > > > *Disclaimer* > > The information contained in this communication from the sender is > confidential. It is intended solely for use by the recipient and others > authorized to receive it. If you are not the recipient, you are hereby > notified that any disclosure, copying, distribution or taking action in > relation of the contents of this information is strictly prohibited and may > be unlawful. > > This email has been scanned for viruses and malware, and may have been > automatically archived by Mimecast, a leader in email security and cyber > resilience. Mimecast integrates email defenses with brand protection, > security awareness training, web security, compliance and other essential > capabilities. Mimecast helps protect large and small organizations from > malicious activity, human error and technology failure; and to lead the > movement toward building a more resilient world. To find out more, visit > our website. >
