Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

yidan zhao Wed, 16 Jun 2021 23:08:14 -0700

这啥原理，这个改动我没办法直接改，需要申请。


东东 <[email protected]> 于2021年6月17日周四 下午1:36写道：
>
>
>
> 把其中一个改成0
>
>
> 在 2021-06-17 13:11:01，"yidan zhao" <[email protected]> 写道：
> >是的，宿主机IP。
> >
> >net.ipv4.tcp_tw_reuse = 1
> >net.ipv4.tcp_timestamps = 1
> >
> >东东 <[email protected]> 于2021年6月17日周四 下午12:52写道：
> >>
> >> 10.35.215.18是宿主机IP？
> >>
> >> 看一下  tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
> >> 实在不行就 tcpdump 吧
> >>
> >>
> >>
> >> 在 2021-06-17 12:41:58，"yidan zhao" <[email protected]> 写道：
> >> >@东东 standalone集群。 随机时间，一会一个的，没有固定规律。  和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。
> >> >我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。
> >> >
> >> >此外，有个点我不是很清楚，网上这个报错很少，类似的都是
> >> >RemoteTransportException，然后提示中说taskmager可能已丢失之类。但我的是
> >> >LocalTransportException，不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
> >> >
> >> >东东 <[email protected]> 于2021年6月17日周四 上午11:19写道：
> >> >>
> >> >> 单机standalone，还是Docker/K8s ?
> >> >>
> >> >>
> >> >>
> >> >> 这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？
> >> >>
> >> >>
> >> >>
> >> >> 在 2021-06-16 19:10:24，"yidan zhao" <[email protected]> 写道：
> >> >> >Hi, yingjie.
> >> >> >If the network is not stable, which config parameter I should adjust.
> >> >> >
> >> >> >yidan zhao <[email protected]> 于2021年6月16日周三 下午6:56写道：
> >> >> >>
> >> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> >> >> >> 142892, so it is not bad.
> >> >> >> 3: stream job.
> >> >> >> 4: I will try to config taskmanager.network.retries which is default
> >> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> >> >> >> is 120s。
> >> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> >> >> >> so I think it is a reasonable value.
> >> >> >>
> >> >> >> 1: can not be sure.
> >> >> >>
> >> >> >> Yingjie Cao <[email protected]> 于2021年6月16日周三 下午4:34写道：
> >> >> >> >
> >> >> >> > Hi yidan,
> >> >> >> >
> >> >> >> > 1. Is the network stable?
> >> >> >> > 2. Is there any GC problem?
> >> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for 
> >> >> >> > more information.
> >> >> >> > 4. You may try to config these two options: 
> >> >> >> > taskmanager.network.retries, 
> >> >> >> > taskmanager.network.netty.client.connectTimeoutSec. More relevant 
> >> >> >> > options can be found in 'Data Transport Network Stack' section of 
> >> >> >> > [2].
> >> >> >> > 5. If it is not the above cases, it is may related to [3], you may 
> >> >> >> > need to check the number of tcp connection per TM and node.
> >> >> >> >
> >> >> >> > Hope this helps.
> >> >> >> >
> >> >> >> > [1] 
> >> >> >> > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> >> >> >> > [2] 
> >> >> >> > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> >> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >> >> >> >
> >> >> >> > Best,
> >> >> >> > Yingjie
> >> >> >> >
> >> >> >> > yidan zhao <[email protected]> 于2021年6月16日周三 下午3:36写道：
> >> >> >> >>
> >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> >> >> >> have also met this problem?
> >> >> >> >>
> >> >> >> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 
> >> >> >> >> containers,
> >> >> >> >> each 28G mem.

Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

回复