Hi Rainie,

there are two probably causes:
* Network instabilities
* Taskmanager died, then you can further dig in the taskmanager logs for
errors right before that time.

In both cases, Flink should restart the job with the correct restart
policies if configured.

On Sat, Feb 20, 2021 at 10:07 PM Rainie Li <raini...@pinterest.com> wrote:

> Hello,
>
> I launched a job with a larger load on hadoop yarn cluster.
> The Job finished after running 5 hours, I didn't find any error from
> JobManger log besides this connect exception.
>
>
>
>
>
> *2021-02-20 13:20:14,110 WARN  akka.remote.transport.netty.NettyTransport
>                    - Remote connection to [/10.1.57.146:48368
> <http://10.1.57.146:48368>] failed with java.io.IOException: Connection
> reset by peer2021-02-20 13:20:14,110 WARN
>  akka.remote.ReliableDeliverySupervisor                        -
> Association with remote system [akka.tcp://flink-metrics@host:35241] has
> failed, address is now gated for [50] ms. Reason: [Disassociated]
> 2021-02-20 13:20:14,110 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system
> [akka.tcp://flink@host:39493] has failed, address is now gated for [50] ms.
> Reason: [Disassociated] 2021-02-20 13:20:14,110 WARN
>  akka.remote.ReliableDeliverySupervisor                        -
> Association with remote system [akka.tcp://flink-metrics@host:38481] has
> failed, address is now gated for [50] ms. Reason: [Disassociated] *
>
> Any idea what caused the job to be finished and how to resolve it?
> Any suggestions are appreciated.
>
> Thanks
> Best regards
> Rainie
>

Reply via email to