Hey Jacqlyn,
According to the stack trace, it seems that there is a problem when
the checkpoint is triggered. Is this the problem after the restore?
would you like to share some logs related to restoring?

Best,
Yanfei

Jacqlyn Bender via user <user@flink.apache.org> 于2023年9月8日周五 05:11写道:
>
> Hey folks,
>
>
> We experienced a pipeline failure where our job manager restarted and we were 
> for some reason unable to restore from our last successful checkpoint. We had 
> regularly completed checkpoints every 10 minutes up to this failure and 0 
> failed checkpoints logged. Using Flink version 1.17.1.
>
>
> Wondering if anyone can shed light on what might have happened?
>
>
> Here's the error from our logs:
>
>
> Message: FATAL: Thread ‘Checkpoint Timer’ produced an uncaught exception. 
> Stopping the process...
>
>
> extendedStackTrace: java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: java.lang.NullPointerException
>
> at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:669)
>  ~[a-pipeline-name.jar:1.0]
>
> at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
>  ~[?:?]
>
> at 
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
>  ~[?:?]
>
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>  [?:?]
>
> at 
> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) 
> [?:?]
>
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910)
>  [?:?]
>
> at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>  [?:?]
>
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 
> [?:?]
>
> at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>  [?:?]
>
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>
> at java.lang.Thread.run(Thread.java:829) [?:?]
>
> Caused by: java.util.concurrent.CompletionException: 
> java.lang.NullPointerException
>
> at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
>  ~[?:?]
>
> at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
>  ~[?:?]
>
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932) 
> ~[?:?]
>
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
>  ~[?:?]
>
> ... 7 more
>
> Caused by: java.lang.NullPointerException
>
> at 
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:399)
>  ~[a-pipeline-name.jar:1.0]
>
> at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]
>
> at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085) 
> ~[?:?]
>
> at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:947)
>  ~[a-pipeline-name.jar:1.0]
>
> at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:923)
>  ~[a-pipeline-name.jar:1.0]
>
> at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:655)
>  ~[a-pipeline-name.jar:1.0]
>
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) 
> ~[?:?]
>
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
>  ~[?:?]
>
> ... 7 more
>
>

Reply via email to