[jira] [Created] (FLINK-28695) Fail to send partition request to restarted taskmanager

Simonas (Jira) Tue, 26 Jul 2022 07:01:50 -0700

Simonas created FLINK-28695:
-------------------------------

             Summary: Fail to send partition request to restarted taskmanager
                 Key: FLINK-28695
                 URL: https://issues.apache.org/jira/browse/FLINK-28695
             Project: Flink
          Issue Type: Bug
          Components: Deployment / Kubernetes, Runtime / Network
    Affects Versions: 1.15.1, 1.15.0
            Reporter: Simonas
         Attachments: deployment.txt, job_log.txt, jobmanager_config.txt, 
jobmanager_logs.txt, pod_restart.txt, taskmanager_config.txt

After upgrade to *1.15.1* we started getting error while deploying JOB

{code:java}
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
Sending the partition request to '/XXX.XXX.XX.32:6121 (#0)' failed. at
org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:145)
.... {code}
{code:java}
Caused by:
org.apache.flink.shaded.netty4.io.netty.channel.StacklessClosedChannelException
atrg.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object,
ChannelPromise)(Unknown Source){code}
After investigation we managed narrow it down to the exact behavior then this
issue happens:
# Deploying JOB on fresh kubernetes session cluster with multiple
TaskManagers: TM1 and TM2 is successful. Job has multiple partitions running on
both TM1 and TM2.
# One TaskManager TM2 (XXX.XXX.XX.32) fails for unrelated issue. For example
OOM exception.
# Kubernetes POD with mentioned TaskManager TM2 is restarted. POD retains same
IP address as before.
# JobManager is able to pickup the restarted TM2 (XXX.XXX.XX.32)
# JOB is restarted because it was running on the failed TaskManager TM2
# TM1 data channel to TM2 is closed and we get LocalTransportException:
Sending the partition request to '/XXX.XXX.XX.32:6121 (#0)' failed during JOB
running stage.
# When we explicitly delete pod with TM2 it creates new POD with different IP
address and JOB is able to start again.

Important to note that we didn't encountered this issue with previous *1.14.4*
version and TaskManager restarts didn't cause such error.

Please note attached kubernetes deployments and reduced logs from JobManager.
TaskManager logs did show errors before error, but doesn't show anything
significant after restart.

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-28695) Fail to send partition request to restarted taskmanager

Reply via email to