I think your attached exception has been fixed via FLINK-22597[1]. Could you please have a try with the latest version.
Moreover, it is not the desired Flink behavior that TaskManager could not retrieve the new JobManager address and re-register successfully. I think you need to share the staled TaskManager logs so that we could move forward the debugging. [1]. https://issues.apache.org/jira/browse/FLINK-22597 Best, Yang Jerome Li <l...@vmware.com> 于2021年5月27日周四 上午4:54写道: > Hi Yang, > > > > Thanks for getting back to me. > > > > By “restart master node”, I mean do “kubctl get nodes” to find the node’s > role as master and “ssh” into one of master nodes as ubuntu user. Then run > “sudo /sbin/reboot -f” to restart the master node. > > > > It looks like The JobManager would cancel the running job and log this > after that. > > 2021-05-26 18:28:37,997 [INFO] > org.apache.flink.runtime.executiongraph.ExecutionGraph - Discarding > the results produced by task execution 34eb9f5009dc7cf07117e720e7d393de. > > 2021-05-26 18:28:37,999 [INFO] > org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore - > Suspending > > 2021-05-26 18:28:37,999 [INFO] > org.apache.flink.kubernetes.highavailability.KubernetesCheckpointIDCounter > - Shutting down. > > 2021-05-26 18:28:38,000 [INFO] > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job > 74fc5c858e50f5efc91db9ee16c17a8c has been suspended. > > 2021-05-26 18:28:38,007 [INFO] > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Suspending > SlotPool. > > 2021-05-26 18:28:38,007 [INFO] > org.apache.flink.runtime.jobmaster.JobMaster - Close > ResourceManager connection 5bac86fb0b5c984ef429225b8de82cc0: JobManager is > no longer the leader.. > > 2021-05-26 18:28:38,019 [INFO] > org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl - JobManager > runner for job hogger (74fc5c858e50f5efc91db9ee16c17a8c) was granted > leadership with session id 14b9004a-3807-42e8-ac03-c0d77efe5611 at > akka.tcp://flink@hoggerflink-jobmanager:6123/user/rpc/jobmanager_2. > > 2021-05-26 18:28:38,292 [INFO] > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > > 2021-05-26 18:28:38,292 [INFO] > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > > 2021-05-26 18:28:38,292 [INFO] > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > > 2021-05-26 18:28:38,293 [INFO] > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > > 2021-05-26 18:28:38,293 [INFO] > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > > 2021-05-26 18:28:38,293 [INFO] > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > > 2021-05-26 18:28:38,293 [INFO] > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > > 2021-05-26 18:28:38,293 [INFO] > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.LocalFencedMessage until processing > is started. > > 2021-05-26 18:28:38,293 [INFO] > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.LocalFencedMessage until processing > is started. > > 2021-05-26 18:28:38,295 [INFO] > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.LocalFencedMessage until processing > is started. > > 2021-05-26 18:28:38,295 [INFO] > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.LocalFencedMessage until processing > is started. > > 2021-05-26 18:28:38,295 [INFO] > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > > 2021-05-26 18:28:38,296 [INFO] > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > > 2021-05-26 18:28:38,296 [INFO] > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.LocalFencedMessage until processing > is started. > > 2021-05-26 18:28:38,299 [ERROR] > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal > error occurred in the cluster entrypoint. > > org.apache.flink.util.FlinkException: JobMaster for job > 74fc5c858e50f5efc91db9ee16c17a8c failed. > > at > org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:887) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.dispatcher.Dispatcher.dispatcherJobFailed(Dispatcher.java:465) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$3(Dispatcher.java:426) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) > ~[?:?] > > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) > ~[?:?] > > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) > ~[?:?] > > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at > akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.actor.Actor.aroundReceive(Actor.scala:517) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.actor.Actor.aroundReceive$(Actor.scala:515) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at > akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > [flink-dist_2.12-1.12.2.jar:1.12.2] > > Caused by: org.apache.flink.util.FlinkException: Could not start the job > manager. > > at > org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.lambda$handleException$7(JobManagerRunnerImpl.java:456) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) > ~[?:?] > > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) > ~[?:?] > > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > ~[?:?] > > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) > ~[?:?] > > at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1044) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.dispatch.OnComplete.internal(Future.scala:263) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.dispatch.OnComplete.internal(Future.scala:261) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:573) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.actor.ActorRef.tell(ActorRef.scala:126) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleCallAsync(AkkaRpcActor.java:423) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:210) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:100) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > ... 21 more > > Caused by: java.util.concurrent.CompletionException: > java.lang.IllegalStateException: DefaultLeaderRetrievalService can only be > started once. > > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) > ~[?:?] > > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) > ~[?:?] > > at > java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:704) > ~[?:?] > > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > ~[?:?] > > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) > ~[?:?] > > at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1044) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.dispatch.OnComplete.internal(Future.scala:263) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.dispatch.OnComplete.internal(Future.scala:261) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:573) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at akka.actor.ActorRef.tell(ActorRef.scala:126) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleCallAsync(AkkaRpcActor.java:423) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:210) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:100) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > ... 21 more > > Caused by: java.lang.IllegalStateException: DefaultLeaderRetrievalService > can only be started once. > > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService.start(DefaultLeaderRetrievalService.java:89) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.jobmaster.JobMaster.startJobMasterServices(JobMaster.java:891) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:864) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.jobmaster.JobMaster.lambda$start$1(JobMaster.java:381) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleCallAsync(AkkaRpcActor.java:419) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:210) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:100) > ~[flink-dist_2.12-1.12.2.jar:1.12.2] > > ... 21 more > > 2021-05-26 18:28:38,310 [INFO] org.apache.flink.runtime.blob.BlobServer > - Stopped BLOB server at 0.0.0.0:6124 > > > > Eventually, it gets back to work but sometime not. Some of the taskmanager > not cannot identify the jobmanager address. I have to manually restart the > staled taskmanager. > > > > Is this the desired Flink behaviors? Or is it a bug? Or if I am missing > something? > > > > Best, > > Jerome > > > > > > *From: *Yang Wang <danrtsey...@gmail.com> > *Date: *Tuesday, May 25, 2021 at 1:03 AM > *To: *Jerome Li <l...@vmware.com> > *Cc: *user@flink.apache.org <user@flink.apache.org> > *Subject: *Re: Jobmanager Crashes with Kubernetes HA When Restart > Kubernetes Master Node > > By "restart master node", do you mean to restart the K8s master > component(e.g. APIServer, ETCD, etc.)? > > > > Even though the master components are restarted, the Flink JobManager and > TaskManager should eventually get to work. > > Could you please share the JobManager logs so that we could debug why it > crashed. > > > > > > Best, > > Yang > > > > Jerome Li <l...@vmware.com> 于2021年5月25日周二 上午3:43写道: > > Hi, > > > > I am running Flink v1.12.2 in Standalone mode on Kubernetes. I set > Kubernetes native as HA. > > > > The HA works well when either jobmanager or taskmanager pod lost or > crashes. > > > > But, when I restart master node, jobmanager pod will always crash and > restart. This results in the entire Flink cluster restart and most of > taskmanager pod will restart as well. > > > > I didn’t see this issue when using zookeeper as HA. Not sure if this is a > bug should be handle or there is some work around. > > > > > > Below is my Flink setting > > Job-Manager > > flink-conf.yaml: > > ---- > > jobmanager.rpc.address: streakerflink-jobmanager > > > > high-availability: > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory > > high-availability.cluster-id: /streaker > > high-availability.jobmanager.port: 6123 > > high-availability.storageDir: > hdfs://hdfs-namenode-0.hdfs-namenode:8020/flink > > kubernetes.cluster-id: streaker > > > > rest.address: streakerflink-jobmanager > > rest.bind-port: 8081 > > rest.port: 8081 > > > > state.checkpoints.dir: > hdfs://hdfs-namenode-0.hdfs-namenode:8020/flink/streaker > > > > blob.server.port: 6124 > > metrics.internal.query-service.port: 6125 > > metrics.reporters: prom > > metrics.reporter.prom.class: > org.apache.flink.metrics.prometheus.PrometheusReporter > > metrics.reporter.prom.port: 9999 > > > > restart-strategy: fixed-delay > > restart-strategy.fixed-delay.attempts: 2147483647 > > restart-strategy.fixed-delay.delay: 5 s > > > > jobmanager.memory.process.size: 1768m > > > > parallelism.default: 1 > > > > task.cancellation.timeout: 2000 > > > > web.log.path: /opt/flink/log/output.log > > jobmanager.web.log.path: /opt/flink/log/output.log > > > > web.submit.enable: false > > > > Task-Manager > > flink-conf.yaml: > > ---- > > jobmanager.rpc.address: streakerflink-jobmanager > > > > high-availability: > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory > > high-availability.cluster-id: /streaker > > high-availability.storageDir: > hdfs://hdfs-namenode-0.hdfs-namenode:8020/flink > > kubernetes.cluster-id: streaker > > > > taskmanager.network.bind-policy: ip > > > > taskmanager.data.port: 6121 > > taskmanager.rpc.port: 6122 > > > > restart-strategy: fixed-delay > > restart-strategy.fixed-delay.attempts: 2147483647 > > restart-strategy.fixed-delay.delay: 5 s > > > > taskmanager.memory.task.heap.size: 9728m > > taskmanager.memory.framework.off-heap.size: 512m > > taskmanager.memory.managed.size: 512m > > taskmanager.memory.jvm-metaspace.size: 256m > > taskmanager.memory.jvm-overhead.max: 3g > > taskmanager.memory.jvm-overhead.fraction: 0.035 > > taskmanager.memory.network.fraction: 0.03 > > taskmanager.memory.network.max: 3g > > taskmanager.numberOfTaskSlots: 1 > > > > taskmanager.jvm-exit-on-oom: true > > > > metrics.internal.query-service.port: 6125 > > metrics.reporters: prom > > metrics.reporter.prom.class: > org.apache.flink.metrics.prometheus.PrometheusReporter > > metrics.reporter.prom.port: 9999 > > > > web.log.path: /opt/flink/log/output.log > > taskmanager.log.path: /opt/flink/log/output.log > > > > task.cancellation.timeout: 2000 > > > > Any help will be appreciated! > > > > Thanks, > > Jerome > >