To the community, Our Flink task managers are currently experiencing frequent swapping in and out due to Karpenter autoscaling. This is causing task managers to repeatedly encounter RemoteTransportException (RTE), leading to job failures.
We are looking for the ideal configuration to prevent these RTEs and ensure that the Flink Job Manager can automatically retry and recover, rather than failing the entire job. We have already tried various approaches without success. Could you please provide guidance on the optimal configuration parameters to address this issue? We are open to exploring all possible solutions. Configurations $internal.flink.version v1_16 akka.ask.timeout 601s akka.framesize 41943040b akka.lookup.timeout 30s akka.ssl.enabled false akka.tcp.timeout 610s blob.server.port 6124 blob.service.ssl.enabled false classloader.parent-first-patterns.additional org.apache.flink.statefun;org.apache.kafka classloader.resolve-order parent-first cluster.evenly-spread-out-slots true cluster.health-check.checkpoint-progress.enabled true execution.checkpointing.interval 60s execution.target kubernetes-session fs.s3a.aws.credentials.provider com.amazonaws.auth.WebIdentityTokenCredentialsProvider fs.s3a.multipart.size 10M heartbeat.rpc-failure-threshold 5 high-availability org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory high-availability.cluster-id ABC high-availability.jobmanager.port 6123 high-availability.storageDir ABC internal.cluster.execution-mode NORMAL jobmanager.execution.failover-strategy region jobmanager.memory.heap.size 66437775360b jobmanager.memory.jvm-metaspace.size 1073741824b jobmanager.memory.jvm-overhead.max 1073741824b jobmanager.memory.jvm-overhead.min 1073741824b jobmanager.memory.off-heap.size 134217728b jobmanager.memory.process.size 64g jobmanager.rpc.address ABC jobmanager.rpc.port 6123 jobstore.expiration-time 86400 kubernetes.cluster-id ABC kubernetes.container.image ABC kubernetes.container.image.pull-policy Always kubernetes.flink.conf.dir /opt/flink/conf kubernetes.internal.jobmanager.entrypoint.class org.apache.flink.kubernetes.entrypoint.KubernetesSessionClusterEntrypoint kubernetes.internal.taskmanager.replicas 2 kubernetes.jobmanager.annotations flinkdeployment.flink.apache.org/generation:18 kubernetes.jobmanager.cpu 16.0 kubernetes.jobmanager.owner.reference ABC kubernetes.jobmanager.replicas 2 kubernetes.namespace ABC kubernetes.pod-template-file.jobmanager ABC kubernetes.pod-template-file.taskmanager ABC kubernetes.rest-service.exposed.type ClusterIP kubernetes.service-account ABC kubernetes.taskmanager.cpu 22.0 metrics.internal.query-service.port 6120 metrics.reporter.prom.factory.class org.apache.flink.metrics.prometheus.PrometheusReporterFactory metrics.reporter.prom.port 9999 metrics.reporter.prom.whitelist _Custom_Source.0 metrics.reporters prom parallelism.default 16 resourcemanager.taskmanager-registration.timeout 15min resourcemanager.taskmanager-timeout 600000 rest.address 10.69.99.191 rest.connection-timeout 120000 rest.flamegraph.enabled true restart-strategy exponential-delay restart-strategy.exponential-delay.initial-backoff 60s restart-strategy.exponential-delay.jitter-factor 0.1 restart-strategy.exponential-delay.max-backoff 2min restart-strategy.exponential-delay.reset-backoff-threshold 10min restart-strategy.fixed-delay.attempts 10 s3.entropy.key _entropy_ s3.entropy.length 4 security.ssl.internal.enabled false security.ssl.rest.enabled false security.ssl.verify-hostname false slot.idle.timeout 300000 slot.request.timeout 600000 slotmanager.redundant-taskmanager-num 2 state.backend rocksdb state.backend.fs.checkpointdir ABC state.backend.incremental true state.backend.local-recovery false state.backend.rocksdb.block.cache-size 4mb state.backend.rocksdb.localdir /data/flink/rocksdb state.backend.rocksdb.memory.managed true state.backend.rocksdb.memory.write-buffer-ratio 0.70 state.backend.rocksdb.timer-service.factory ROCKSDB state.backend.rocksdb.writebuffer.size 16mb state.checkpoints.dir ABC state.savepoints.dir ABC taskmanager.data.port 6121 taskmanager.data.ssl.enabled false taskmanager.heartbeat.interval 60000 taskmanager.heartbeat.timeout 180000 taskmanager.memory.framework.heap.size 256mb taskmanager.memory.framework.off-heap.size 256mb taskmanager.memory.jvm-metaspace.size 1gb taskmanager.memory.jvm-overhead.max 10gb taskmanager.memory.jvm-overhead.min 10gb taskmanager.memory.managed.size 40gb taskmanager.memory.network.max 20gb taskmanager.memory.network.min 20gb taskmanager.memory.process.size 176g taskmanager.memory.task.off-heap.size 1gb taskmanager.network.memory.exclusive-buffers-request-timeout-ms 120000 taskmanager.network.request-backoff.max 120000 taskmanager.network.retries 10 taskmanager.numberOfTaskSlots 44 taskmanager.registration.timeout 30 min taskmanager.rpc.port 6122 web.cancel.enable false web.tmpdir ABC version OpenJDK 64-Bit Server VM - Azul Systems, Inc. - 11/11.0.21.0.102+1-LTS arch amd64 options -Dlog4j2.formatMsgNoLookups=true -Dsfdc.enableFips=false --patch-module=jdk.jfr=/opt/tools/Linux/jdk/openjdk_11.0.21.0.102_11.69.52_x64/lib/sfdc/patches/jdk.jfr.jar --patch-module=java.base=/opt/tools/Linux/jdk/openjdk_11.0.21.0.102_11.69.52_x64/lib/sfdc/patches/java.base.jar --patch-module=java.desktop=/opt/tools/Linux/jdk/openjdk_11.0.21.0.102_11.69.52_x64/lib/sfdc/patches/java.desktop.jar --patch-module=java.xml=/opt/tools/Linux/jdk/openjdk_11.0.21.0.102_11.69.52_x64/lib/sfdc/patches/java.xml.jar --add-exports=java.base/sun.security.internal.spec=org.bouncycastle.fips.core --add-exports=java.base/sun.security.provider=org.bouncycastle.fips.core -XX:CompileCommand=quiet -XX:CompileCommand=exclude,sfdc/security/sfdcrng/RdRandSecureRandomSpi.generatePRF -Dsfdc.crl.check=false --module-path=/opt/tools/Linux/jdk/openjdk_11.0.21.0.102_11.69.52_x64/lib/sfdc/modules -Djava.security.properties=/opt/tools/Linux/jdk/openjdk_11.0.21.0.102_11.69.52_x64/conf/security/sfdc.java.security -Xmx66437775360 -Xms66437775360 -XX:MaxMetaspaceSize=1073741824 -Dlog.file=ABC -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml Thank you for your time and expertise. Amrit Sarkar Engineer | Search and Kubernetes https://seamadic.com/ Twitter https://twitter.com/sarkaramrit2 LinkedIn: https://www.linkedin.com/in/sarkaramrit2 Medium: https://medium.com/@sarkaramrit2