如题,我目前生产中经常遇到,任务失败,cancel过程导致TM失败,进而其他任务都失败的这种。

我现在比较麻烦的是,我无法判定具体是外界因素比如网络等导致TM失败,进而导致任务失败。还是先任务由于某种原因失败,然后restart过程导致TM失败。

目前是看每台机器的TM日志,不太一样。
有的TM第一个异常日志是:Attempting to cancel task, ...., Triggering cancellation of
task code...
有的TM第一个异常日志是:xxxx (40/60)#0 (5e91a8139f7858005f4c06bb1b6e9ca6) switched
from RUNNING to FAILED with failure cause:
org.apache.flink.runtime.io.network.netty.exception.RemoteTran
sportException: Error at remote task manager '10.xx.94.150/10.35.94.150:136
'.
还有一个不太一样,如下:
2021-12-16 15:26:33,189 INFO
 
org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager
[] - State change: SUSPENDED
2021-12-16 15:26:33,190 WARN
 org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver []
- Connection to ZooKeeper suspended. Can no longer re
trieve the leader from ZooKeeper.
2021-12-16 15:26:33,190 INFO
 org.apache.flink.runtime.taskexecutor.TaskExecutor           [] -
JobManager for job 5ae97a7c2319a277520f8dc92d311347 with leade
r id a4faf21926590158b40b372347a746a9 lost leadership.
2021-12-16 15:26:33,191 WARN
 org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver []
- Connection to ZooKeeper suspended. Can no longer re
trieve the leader from ZooKeeper.
2021-12-16 15:26:33,191 INFO
 org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Close
JobManager connection for job 5ae97a7c2319a277520f8dc92d311347.
2021-12-16 15:26:33,191 INFO  org.apache.flink.runtime.taskmanager.Task
               [] - Attempting to fail task externally
ip_gap_g4_SidIncludeFilter(39/40)#0 (619938d9cfa52431265bebc836bceafb).

如上3个TM的日志,是否可以确认是第3个为根本原因?

回复