这啥原理,这个改动我没办法直接改,需要申请。
东东 <dongdongking...@163.com> 于2021年6月17日周四 下午1:36写道: > > > > 把其中一个改成0 > > > 在 2021-06-17 13:11:01,"yidan zhao" <hinobl...@gmail.com> 写道: > >是的,宿主机IP。 > > > >net.ipv4.tcp_tw_reuse = 1 > >net.ipv4.tcp_timestamps = 1 > > > >东东 <dongdongking...@163.com> 于2021年6月17日周四 下午12:52写道: > >> > >> 10.35.215.18是宿主机IP? > >> > >> 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值 > >> 实在不行就 tcpdump 吧 > >> > >> > >> > >> 在 2021-06-17 12:41:58,"yidan zhao" <hinobl...@gmail.com> 写道: > >> >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。 和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。 > >> >我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。 > >> > > >> >此外,有个点我不是很清楚,网上这个报错很少,类似的都是 > >> >RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是 > >> >LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。 > >> > > >> >东东 <dongdongking...@163.com> 于2021年6月17日周四 上午11:19写道: > >> >> > >> >> 单机standalone,还是Docker/K8s ? > >> >> > >> >> > >> >> > >> >> 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关? > >> >> > >> >> > >> >> > >> >> 在 2021-06-16 19:10:24,"yidan zhao" <hinobl...@gmail.com> 写道: > >> >> >Hi, yingjie. > >> >> >If the network is not stable, which config parameter I should adjust. > >> >> > > >> >> >yidan zhao <hinobl...@gmail.com> 于2021年6月16日周三 下午6:56写道: > >> >> >> > >> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time: > >> >> >> 142892, so it is not bad. > >> >> >> 3: stream job. > >> >> >> 4: I will try to config taskmanager.network.retries which is default > >> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default > >> >> >> is 120s。 > >> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+, > >> >> >> so I think it is a reasonable value. > >> >> >> > >> >> >> 1: can not be sure. > >> >> >> > >> >> >> Yingjie Cao <kevin.ying...@gmail.com> 于2021年6月16日周三 下午4:34写道: > >> >> >> > > >> >> >> > Hi yidan, > >> >> >> > > >> >> >> > 1. Is the network stable? > >> >> >> > 2. Is there any GC problem? > >> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for > >> >> >> > more information. > >> >> >> > 4. You may try to config these two options: > >> >> >> > taskmanager.network.retries, > >> >> >> > taskmanager.network.netty.client.connectTimeoutSec. More relevant > >> >> >> > options can be found in 'Data Transport Network Stack' section of > >> >> >> > [2]. > >> >> >> > 5. If it is not the above cases, it is may related to [3], you may > >> >> >> > need to check the number of tcp connection per TM and node. > >> >> >> > > >> >> >> > Hope this helps. > >> >> >> > > >> >> >> > [1] > >> >> >> > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ > >> >> >> > [2] > >> >> >> > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ > >> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643 > >> >> >> > > >> >> >> > Best, > >> >> >> > Yingjie > >> >> >> > > >> >> >> > yidan zhao <hinobl...@gmail.com> 于2021年6月16日周三 下午3:36写道: > >> >> >> >> > >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone > >> >> >> >> have also met this problem? > >> >> >> >> > >> >> >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 > >> >> >> >> containers, > >> >> >> >> each 28G mem.