Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Yingjie Cao Wed, 16 Jun 2021 05:01:51 -0700

Maybe you can try to
increase taskmanager.network.retries,
taskmanager.network.netty.server.backlog and
taskmanager.network.netty.sendReceiveBufferSize. These options are useful
for our jobs.


yidan zhao <[email protected]> 于2021年6月16日周三 下午7:10写道：

> Hi, yingjie.
> If the network is not stable, which config parameter I should adjust.
>
> yidan zhao <[email protected]> 于2021年6月16日周三 下午6:56写道：
> >
> > 2: I use G1, and no full gc occurred, young gc count: 422, time:
> > 142892, so it is not bad.
> > 3: stream job.
> > 4: I will try to config taskmanager.network.retries which is default
> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> > is 120s。
> > 5: I checked the net fd number of the taskmanager, it is about 1000+,
> > so I think it is a reasonable value.
> >
> > 1: can not be sure.
> >
> > Yingjie Cao <[email protected]> 于2021年6月16日周三 下午4:34写道：
> > >
> > > Hi yidan,
> > >
> > > 1. Is the network stable?
> > > 2. Is there any GC problem?
> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more
> information.
> > > 4. You may try to config these two options:
> taskmanager.network.retries,
> taskmanager.network.netty.client.connectTimeoutSec. More relevant options
> can be found in 'Data Transport Network Stack' section of [2].
> > > 5. If it is not the above cases, it is may related to [3], you may
> need to check the number of tcp connection per TM and node.
> > >
> > > Hope this helps.
> > >
> > > [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> > > [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> > > [3] https://issues.apache.org/jira/browse/FLINK-22643
> > >
> > > Best,
> > > Yingjie
> > >
> > > yidan zhao <[email protected]> 于2021年6月16日周三 下午3:36写道：
> > >>
> > >> Attachment is the exception stack from flink's web-ui. Does anyone
> > >> have also met this problem?
> > >>
> > >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> > >> each 28G mem.
>

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

回复