Hey Jacqlyn, According to the stack trace, it seems that there is a problem when the checkpoint is triggered. Is this the problem after the restore? would you like to share some logs related to restoring?
Best, Yanfei Jacqlyn Bender via user <user@flink.apache.org> 于2023年9月8日周五 05:11写道: > > Hey folks, > > > We experienced a pipeline failure where our job manager restarted and we were > for some reason unable to restore from our last successful checkpoint. We had > regularly completed checkpoints every 10 minutes up to this failure and 0 > failed checkpoints logged. Using Flink version 1.17.1. > > > Wondering if anyone can shed light on what might have happened? > > > Here's the error from our logs: > > > Message: FATAL: Thread ‘Checkpoint Timer’ produced an uncaught exception. > Stopping the process... > > > extendedStackTrace: java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: java.lang.NullPointerException > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:669) > ~[a-pipeline-name.jar:1.0] > > at > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986) > ~[?:?] > > at > java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) > ~[?:?] > > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > [?:?] > > at > java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) > [?:?] > > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910) > [?:?] > > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) > [?:?] > > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > [?:?] > > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] > > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > [?:?] > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > > at java.lang.Thread.run(Thread.java:829) [?:?] > > Caused by: java.util.concurrent.CompletionException: > java.lang.NullPointerException > > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) > ~[?:?] > > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) > ~[?:?] > > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932) > ~[?:?] > > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) > ~[?:?] > > ... 7 more > > Caused by: java.lang.NullPointerException > > at > org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:399) > ~[a-pipeline-name.jar:1.0] > > at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?] > > at > java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085) > ~[?:?] > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:947) > ~[a-pipeline-name.jar:1.0] > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:923) > ~[a-pipeline-name.jar:1.0] > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:655) > ~[a-pipeline-name.jar:1.0] > > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) > ~[?:?] > > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) > ~[?:?] > > ... 7 more > >