Re: flink 触发保存点失败
Hi, 之前遇到过这个 jobid 为 0 的报错情况。我们的场景是是任务开启了基于 zk 的 ha,但是使用未配置 ha 的 flink client 去运行 savepoint 命令。 可以考虑下是否是相同的问题。 Michael Ran 于2021年7月23日周五 上午10:43写道: > 有没可能是文件的问题,比如写入权限之类的? > 在 2021-07-13 17:31:19,"仙剑……情动人间" <1510603...@qq.com.INVALID> 写道: > >Hi All, > > > > > > 我触发Flink > 保存点总是失败,报错如下,一直说是超时,但是没有进一步的信息可以查看,我查资料说可以设置checkpoint超时时间,我设置了2min,但是触发 > >保存点时在2min之前就会报错,另外我的 状态 并不大 > > > > > > > > > > The program finished with the following exception: > > > > > >org.apache.flink.util.FlinkException: Triggering a savepoint for the job > failed. > > at > org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:777) > > at > org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:754) > > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002) > > at > org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:751) > > at > org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1072) > > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > at > org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > > at > org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132) > >Caused by: java.util.concurrent.TimeoutException > > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) > > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) > > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582) > > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) >
Re: flink 触发保存点失败
Hi, 这个看上去是client触发savepoint失败,而不是savepoint本身end-to-end执行超时。建议对照一下JobManager的日志,观察在触发的时刻,JM日志里是否有触发savepoint的相关日志,也可以在flink web UI上观察相应的savepoint是否出现在checkpoint tab的历史里面。 祝好 唐云 From: 仙剑……情动人间 <1510603...@qq.com.INVALID> Sent: Tuesday, July 13, 2021 17:31 To: flink邮件列表 Subject: flink 触发保存点失败 Hi All, 我触发Flink 保存点总是失败,报错如下,一直说是超时,但是没有进一步的信息可以查看,我查资料说可以设置checkpoint超时时间,我设置了2min,但是触发 保存点时在2min之前就会报错,另外我的 状态 并不大 The program finished with the following exception: org.apache.flink.util.FlinkException: Triggering a savepoint for the job failed. at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:777) at org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:754) at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002) at org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:751) at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1072) at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132) Caused by: java.util.concurrent.TimeoutException at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) at org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) at org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Re: flink 触发保存点失败
Hi! 这个报错是 client 提起触发 checkpoint 的请求后,job manager 没有及时反馈 checkpoint 的结果。没有及时反馈的原因可能有很多,比如 checkpoint 超时,比如网络通信问题等等。可以打开 flink web ui 看一下是否有更多信息,或者打开 job manager 和 task manager 的 log 看一下。 仙剑……情动人间 <1510603...@qq.com.invalid> 于2021年7月13日周二 下午7:19写道: > Hi All, > > > 我触发Flink > 保存点总是失败,报错如下,一直说是超时,但是没有进一步的信息可以查看,我查资料说可以设置checkpoint超时时间,我设置了2min,但是触发 > 保存点时在2min之前就会报错,另外我的 状态 并不大 > > > > > The program finished with the following exception: > > > org.apache.flink.util.FlinkException: Triggering a savepoint for the job > failed. > at > org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:777) > at > org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:754) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002) > at > org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:751) > at > org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1072) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at > org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at > org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132) > Caused by: java.util.concurrent.TimeoutException > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)