To the community,

Our Flink task managers are currently experiencing frequent swapping in and
out due to Karpenter autoscaling. This is causing task managers to
repeatedly encounter RemoteTransportException (RTE), leading to job
failures.

We are looking for the ideal configuration to prevent these RTEs and ensure
that the Flink Job Manager can automatically retry and recover, rather than
failing the entire job. We have already tried various approaches without
success.

Could you please provide guidance on the optimal configuration parameters
to address this issue? We are open to exploring all possible solutions.

Configurations
$internal.flink.version v1_16
akka.ask.timeout 601s
akka.framesize 41943040b
akka.lookup.timeout 30s
akka.ssl.enabled false
akka.tcp.timeout 610s
blob.server.port 6124
blob.service.ssl.enabled false
classloader.parent-first-patterns.additional
org.apache.flink.statefun;org.apache.kafka
classloader.resolve-order parent-first
cluster.evenly-spread-out-slots true
cluster.health-check.checkpoint-progress.enabled true
execution.checkpointing.interval 60s
execution.target kubernetes-session
fs.s3a.aws.credentials.provider
com.amazonaws.auth.WebIdentityTokenCredentialsProvider
fs.s3a.multipart.size 10M
heartbeat.rpc-failure-threshold 5
high-availability
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
high-availability.cluster-id ABC
high-availability.jobmanager.port 6123
high-availability.storageDir ABC
internal.cluster.execution-mode NORMAL
jobmanager.execution.failover-strategy region
jobmanager.memory.heap.size 66437775360b
jobmanager.memory.jvm-metaspace.size 1073741824b
jobmanager.memory.jvm-overhead.max 1073741824b
jobmanager.memory.jvm-overhead.min 1073741824b
jobmanager.memory.off-heap.size 134217728b
jobmanager.memory.process.size 64g
jobmanager.rpc.address ABC
jobmanager.rpc.port 6123
jobstore.expiration-time 86400
kubernetes.cluster-id ABC
kubernetes.container.image ABC
kubernetes.container.image.pull-policy Always
kubernetes.flink.conf.dir /opt/flink/conf
kubernetes.internal.jobmanager.entrypoint.class
org.apache.flink.kubernetes.entrypoint.KubernetesSessionClusterEntrypoint
kubernetes.internal.taskmanager.replicas 2
kubernetes.jobmanager.annotations
flinkdeployment.flink.apache.org/generation:18
kubernetes.jobmanager.cpu 16.0
kubernetes.jobmanager.owner.reference ABC
kubernetes.jobmanager.replicas 2
kubernetes.namespace ABC
kubernetes.pod-template-file.jobmanager ABC
kubernetes.pod-template-file.taskmanager ABC
kubernetes.rest-service.exposed.type ClusterIP
kubernetes.service-account ABC
kubernetes.taskmanager.cpu 22.0
metrics.internal.query-service.port 6120
metrics.reporter.prom.factory.class
org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prom.port 9999
metrics.reporter.prom.whitelist _Custom_Source.0
metrics.reporters prom
parallelism.default 16
resourcemanager.taskmanager-registration.timeout 15min
resourcemanager.taskmanager-timeout 600000
rest.address 10.69.99.191
rest.connection-timeout 120000
rest.flamegraph.enabled true
restart-strategy exponential-delay
restart-strategy.exponential-delay.initial-backoff 60s
restart-strategy.exponential-delay.jitter-factor 0.1
restart-strategy.exponential-delay.max-backoff 2min
restart-strategy.exponential-delay.reset-backoff-threshold 10min
restart-strategy.fixed-delay.attempts 10
s3.entropy.key _entropy_
s3.entropy.length 4
security.ssl.internal.enabled false
security.ssl.rest.enabled false
security.ssl.verify-hostname false
slot.idle.timeout 300000
slot.request.timeout 600000
slotmanager.redundant-taskmanager-num 2
state.backend rocksdb
state.backend.fs.checkpointdir ABC
state.backend.incremental true
state.backend.local-recovery false
state.backend.rocksdb.block.cache-size 4mb
state.backend.rocksdb.localdir /data/flink/rocksdb
state.backend.rocksdb.memory.managed true
state.backend.rocksdb.memory.write-buffer-ratio 0.70
state.backend.rocksdb.timer-service.factory ROCKSDB
state.backend.rocksdb.writebuffer.size 16mb
state.checkpoints.dir ABC
state.savepoints.dir ABC
taskmanager.data.port 6121
taskmanager.data.ssl.enabled false
taskmanager.heartbeat.interval 60000
taskmanager.heartbeat.timeout 180000
taskmanager.memory.framework.heap.size 256mb
taskmanager.memory.framework.off-heap.size 256mb
taskmanager.memory.jvm-metaspace.size 1gb
taskmanager.memory.jvm-overhead.max 10gb
taskmanager.memory.jvm-overhead.min 10gb
taskmanager.memory.managed.size 40gb
taskmanager.memory.network.max 20gb
taskmanager.memory.network.min 20gb
taskmanager.memory.process.size 176g
taskmanager.memory.task.off-heap.size 1gb
taskmanager.network.memory.exclusive-buffers-request-timeout-ms 120000
taskmanager.network.request-backoff.max 120000
taskmanager.network.retries 10
taskmanager.numberOfTaskSlots 44
taskmanager.registration.timeout 30 min
taskmanager.rpc.port 6122
web.cancel.enable false
web.tmpdir ABC

version OpenJDK 64-Bit Server VM - Azul Systems, Inc. -
11/11.0.21.0.102+1-LTS
arch amd64
options -Dlog4j2.formatMsgNoLookups=true -Dsfdc.enableFips=false
--patch-module=jdk.jfr=/opt/tools/Linux/jdk/openjdk_11.0.21.0.102_11.69.52_x64/lib/sfdc/patches/jdk.jfr.jar
--patch-module=java.base=/opt/tools/Linux/jdk/openjdk_11.0.21.0.102_11.69.52_x64/lib/sfdc/patches/java.base.jar
--patch-module=java.desktop=/opt/tools/Linux/jdk/openjdk_11.0.21.0.102_11.69.52_x64/lib/sfdc/patches/java.desktop.jar
--patch-module=java.xml=/opt/tools/Linux/jdk/openjdk_11.0.21.0.102_11.69.52_x64/lib/sfdc/patches/java.xml.jar
--add-exports=java.base/sun.security.internal.spec=org.bouncycastle.fips.core
--add-exports=java.base/sun.security.provider=org.bouncycastle.fips.core
-XX:CompileCommand=quiet
-XX:CompileCommand=exclude,sfdc/security/sfdcrng/RdRandSecureRandomSpi.generatePRF
-Dsfdc.crl.check=false
--module-path=/opt/tools/Linux/jdk/openjdk_11.0.21.0.102_11.69.52_x64/lib/sfdc/modules
-Djava.security.properties=/opt/tools/Linux/jdk/openjdk_11.0.21.0.102_11.69.52_x64/conf/security/sfdc.java.security
-Xmx66437775360 -Xms66437775360 -XX:MaxMetaspaceSize=1073741824
-Dlog.file=ABC
-Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
-Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties
-Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml

Thank you for your time and expertise.

Amrit Sarkar
Engineer | Search and Kubernetes
https://seamadic.com/
Twitter https://twitter.com/sarkaramrit2
LinkedIn: https://www.linkedin.com/in/sarkaramrit2
Medium: https://medium.com/@sarkaramrit2

Reply via email to