web ui图:https://s3.bmp.ovh/imgs/2022/05/20/dd142de9be3a2c99.png
网络视图:https://i.bmp.ovh/imgs/2022/05/20/f3c741b28bd208d4.png

JM1(rest server leader) 异常日志:
WARN  2022-05-20 12:02:12,523
org.apache.flink.runtime.checkpoint.CheckpointsCleaner       - Could
not properly discard completed checkpoint 22259.
java.io.IOException: Directory
bos://flink-bucket/flink/default-checkpoints/bal_baiduid_ft_job/b03390c8295713fbd79f57f57a1e3bdb/chk-22259
is not empty.
        at 
org.apache.hadoop.fs.bos.BaiduBosFileSystem.delete(BaiduBosFileSystem.java:209)
~[bos-hdfs-sdk-1.0.1-SNAPSHOT-0.jar:?]
        at 
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.delete(HadoopFileSystem.java:160)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
        at 
org.apache.flink.runtime.state.filesystem.FsCompletedCheckpointStorageLocation.disposeStorageLocation(FsCompletedCheckpointStorageLocation.java:74)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
        at 
org.apache.flink.runtime.checkpoint.CompletedCheckpoint.discard(CompletedCheckpoint.java:263)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
        at 
org.apache.flink.runtime.checkpoint.CheckpointsCleaner.lambda$cleanCheckpoint$0(CheckpointsCleaner.java:60)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
        at 
org.apache.flink.runtime.checkpoint.CheckpointsCleaner.lambda$cleanup$2(CheckpointsCleaner.java:85)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_251]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_251]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_251]
INFO  2022-05-20 12:03:22,441
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    -
Triggering checkpoint 21979 (type=CHECKPOINT) @ 1653019401517 for job
07950b109ab5c3a0ed8576673ab562f7.
INFO  2022-05-20 12:03:31,061
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    -
Completed checkpoint 21979 for job 07950b109ab5c3a0ed8576673ab562f7
(1785911977 bytes in 9066 ms).


如上,我web-ui是开启的,所有是一直有请求刷的,不存在相关异常(当然本身从请求返回码200来看也不像是异常)。

Shengkai Fang <fskm...@gmail.com> 于2022年5月20日周五 10:50写道:
>
> 你好,图挂了,应该是需要图床工具。
>
> 另外,能否贴一下相关的异常日志呢?
>
> Best,
> Shengkai
>
> yidan zhao <hinobl...@gmail.com> 于2022年5月20日周五 10:28写道:
>
> > UI视图:[image: 1.png].
> >
> > 网络视图:
> > [image: image.png]
> >
> >
> > 补充部分集群部署信息:
> > (1)flink1.13,standalone集群,基于zk做的HA。3 jm,若干tm。
> > (2)jm的rest api开启了ssl,基于 nginx
> > 做了代理转发(但大概率不会是机制问题,因为不是百分百出现此问题,我集群其他任务都正常,都是运行一段时间后会出现)。
> >          猜测:是否可能和运行一段时间后,出现jm进程挂掉,任务recover更换,rest jm的leader变换有关呢?
> >                     目前来看部分jm的日志偶尔存在ssl握手相关报错,但也挺奇怪。  注意:我web
> > ui打开,看着jm的日志,是不出日志的(我是基于zk拿到leader,看leader jm的日志)。我web
> > ui一直刷,理论上如果出错日志应该有相关报错,但实际没报错,报错和这个无关,都是ckpt吧啦的。
> >

回复