Hi, TM上有报错信息嘛?有的话可以贴出来看一下是什么导致cp失败的
-- Best! Xuyang Hi, TM上有报错信息嘛?有的话可以贴出来看一下是什么导致cp失败的 在 2022-08-23 20:41:59,"yidan zhao" <hinobl...@gmail.com> 写道: >补充部分信息: >看日志,如果是 flink savepoint xxx 这样触发检查点,JM的日志很简单: >2022-08-23 20:33:22,307 INFO >org.apache.flink.runtime.jobmaster.JobMaster [] - >Triggering savepoint for job 8d231de75b8227a1b >715b1aa665caa91. > >2022-08-23 20:33:22,318 INFO >org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - >Triggering checkpoint 5 (type=SavepointType{na >me='Savepoint', postCheckpointAction=NONE, formatType=CANONICAL}) @ >1661258002307 for job 8d231de75b8227a1b715b1aa665caa91. > >2022-08-23 20:33:23,701 INFO >org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream >[] - Cannot create recoverable writer > due to Recoverable writers on Hadoop are only supported for HDFS, >will use the ordinary writer. > >2022-08-23 20:33:23,908 INFO >org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - >Completed checkpoint 5 for job 8d231de75b8227a1b715b1aa665caa91 >(1638207 bytes, checkpointDuration=1600 ms, finalizationTime=1 ms). > > >如果是 stop xxx 这样停止任务,则JM日志(错误)如下: > >2022-08-23 20:35:01,834 INFO >org.apache.flink.runtime.jobmaster.JobMaster [] - >Triggering stop-with-savepoint for job >8d231de75b8227a1b715b1aa665caa91. > >2022-08-23 20:35:01,842 INFO >org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - >Triggering checkpoint 6 (type=SavepointType{name='Suspend Savepoint', >postCheckpointAction=SUSPEND, formatType=CANONICAL}) @ 1661258101834 >for job 8d231de75b8227a1b715b1aa665caa91. > >2022-08-23 20:35:02,083 INFO >org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - >Decline checkpoint 6 by task a65383dad01bc15f654c4afe4aa63b6d of job >8d231de75b8227a1b715b1aa665caa91 at 10.35.95.150:13151-3dfdc5 @ >xxx.xxx.com (dataPort=13156). >(此处看起来是被decline了,原因是 task failed?) >org.apache.flink.util.SerializedThrowable: Task name with subtask : >Source: XXX_Kafka(startTs:latest) ->... ->... ->... (10/10)#2 Failure >reason: Task has failed. > at > org.apache.flink.runtime.taskmanager.Task.declineCheckpoint(Task.java:1388) >~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.runtime.taskmanager.Task.lambda$triggerCheckpointBarrier$3(Task.java:1331) >~[flink-dist-1.15.1.jar:1.15.1] > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) >~[?:1.8.0_251] > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) >~[?:1.8.0_251] > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) >~[?:1.8.0_251] > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) >~[?:1.8.0_251] > at > org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:343) >~[flink-dist-1.15.1.jar:1.15.1] >Caused by: org.apache.flink.util.SerializedThrowable: >org.apache.flink.streaming.connectors.kafka.internals.Handover$ClosedException > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) >~[?:1.8.0_251] > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) >~[?:1.8.0_251] > at > java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:957) >~[?:1.8.0_251] > at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940) >~[?:1.8.0_251] > ... 3 more >Caused by: org.apache.flink.util.SerializedThrowable > at > org.apache.flink.streaming.connectors.kafka.internals.Handover.close(Handover.java:177) >~[?:?] > at > org.apache.flink.streaming.connectors.kafka.internals.KafkaFetcher.cancel(KafkaFetcher.java:164) >~[?:?] > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.cancel(FlinkKafkaConsumerBase.java:1002) >~[?:?] > at > org.apache.flink.streaming.api.operators.StreamSource.stop(StreamSource.java:128) >~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.streaming.runtime.tasks.SourceStreamTask.stopOperatorForStopWithSavepoint(SourceStreamTask.java:305) >~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.streaming.runtime.tasks.SourceStreamTask.lambda$triggerStopWithSavepointAsync$1(SourceStreamTask.java:285) >~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93) >~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) >~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:338) >~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324) >~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201) >~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:804) >~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:753) >~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948) >~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927) >~[flink-dist-1.15.1.jar:1.15.1] > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741) >~[flink-dist-1.15.1.jar:1.15.1] > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) >~[flink-dist-1.15.1.jar:1.15.1] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_251] > >yidan zhao <hinobl...@gmail.com> 于2022年8月23日周二 20:31写道: >> >> 如题,stop,停止并保存检查点失败。 >> 测试看 cancel、cancel -s 方式都成功。 cancel -s 可成功生成检查点并退出。 >> >> stop则不行,报错主要是 >> Could not stop with a savepoint job "1b87f308e2582f3cc0e3ccc812471201" >> ... >> Caused by: java.util.concurrent.ExecutionException: >> java.util.concurrent.CompletionException: >> org.apache.flink.runtime.checkpoint.CheckpointEx >> ception: Task has failed. >> ... >> Caused by: org.apache.flink.util.SerializedThrowable: >> org.apache.flink.runtime.checkpoint.CheckpointException: Task has >> failed. >> ... >> Caused by: org.apache.flink.util.SerializedThrowable: Task has failed. >> ... >> >> ______详细日志: