Re: Failure to restore from last completed checkpoint
Hello, Just a shot in the dark here, but could it be related to https://issues.apache.org/jira/browse/FLINK-32241 ? Such failures can cause many exceptions, but I think the ones you've included aren't pointing to the root cause, so I'm not sure if that issue applies to you. Regards, Alexis. On Fri, 8 Sept 2023, 20:43 Jacqlyn Bender via user, wrote: > Hi Yanfei, > > We were never able to restore from a checkpoint, we ended up restoring > from a savepoint as fallback. Would those logs suggest we failed to take a > checkpoint before the job manager restarted? Our observabillity monitors > showed no failed checkpoints. > > Here is an exception that occurred before the failure to restore from the > checkpoint: > > java.io.IOException: Cannot register Closeable, registry is already > closed. Closing argument. > > at > org.apache.flink.util.AbstractAutoCloseableRegistry.registerCloseable(AbstractAutoCloseableRegistry.java:89) > ~[a-pipeline-name.jar:1.0] > > at > org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForStateHandle(RocksDBStateDownloader.java:128) > ~[flink-dist-1.17.1-shopify-81a88f8.jar:1.17.1-shopify-81a88f8] > > at > org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.lambda$createDownloadRunnables$0(RocksDBStateDownloader.java:110) > ~[flink-dist-1.17.1-shopify-81a88f8.jar:1.17.1-shopify-81a88f8] > > at > org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49) > ~[a-pipeline-name.jar:1.0] > > at > java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) > ~[?:?] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > ~[?:?] > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > ~[?:?] > > at java.lang.Thread.run(Thread.java:829) [?:?] > > Thanks, > Jacqlyn > > On Thu, Sep 7, 2023 at 7:42 PM Yanfei Lei wrote: > >> Hey Jacqlyn, >> According to the stack trace, it seems that there is a problem when >> the checkpoint is triggered. Is this the problem after the restore? >> would you like to share some logs related to restoring? >> >> Best, >> Yanfei >> >> Jacqlyn Bender via user 于2023年9月8日周五 05:11写道: >> > >> > Hey folks, >> > >> > >> > We experienced a pipeline failure where our job manager restarted and >> we were for some reason unable to restore from our last successful >> checkpoint. We had regularly completed checkpoints every 10 minutes up to >> this failure and 0 failed checkpoints logged. Using Flink version 1.17.1. >> > >> > >> > Wondering if anyone can shed light on what might have happened? >> > >> > >> > Here's the error from our logs: >> > >> > >> > Message: FATAL: Thread ‘Checkpoint Timer’ produced an uncaught >> exception. Stopping the process... >> > >> > >> > extendedStackTrace: java.util.concurrent.CompletionException: >> java.util.concurrent.CompletionException: java.lang.NullPointerException >> > >> > at >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:669) >> ~[a-pipeline-name.jar:1.0] >> > >> > at >> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986) >> ~[?:?] >> > >> > at >> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) >> ~[?:?] >> > >> > at >> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) >> [?:?] >> > >> > at >> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) >> [?:?] >> > >> > at >> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910) >> [?:?] >> > >> > at >> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) >> [?:?] >> > >> > at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) >> [?:?] >> > >> > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] >> > >> > at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) >> [?:?] >> > >> > at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) >> [?:?] >> > >> > at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) >> [?:?] >> > >> > at java.lang.Thread.run(Thread.java:829) [?:?] >> > >> > Caused by: java.util.concurrent.CompletionException: >> java.lang.NullPointerException >> > >> > at >> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) >> ~[?:?] >> > >> > at >> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) >> ~[?:?] >> > >> > at >> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932) >> ~[?:?] >> > >> > at >> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) >> ~[?:?] >> > >> > ... 7 more >> > >> > Caused by: java.lang.NullPointerException >> > >> > at >> org.apache.flink.runtime.operators.coordi
Re: Failure to restore from last completed checkpoint
Hi Yanfei, We were never able to restore from a checkpoint, we ended up restoring from a savepoint as fallback. Would those logs suggest we failed to take a checkpoint before the job manager restarted? Our observabillity monitors showed no failed checkpoints. Here is an exception that occurred before the failure to restore from the checkpoint: java.io.IOException: Cannot register Closeable, registry is already closed. Closing argument. at org.apache.flink.util.AbstractAutoCloseableRegistry.registerCloseable(AbstractAutoCloseableRegistry.java:89) ~[a-pipeline-name.jar:1.0] at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForStateHandle(RocksDBStateDownloader.java:128) ~[flink-dist-1.17.1-shopify-81a88f8.jar:1.17.1-shopify-81a88f8] at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.lambda$createDownloadRunnables$0(RocksDBStateDownloader.java:110) ~[flink-dist-1.17.1-shopify-81a88f8.jar:1.17.1-shopify-81a88f8] at org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49) ~[a-pipeline-name.jar:1.0] at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) [?:?] Thanks, Jacqlyn On Thu, Sep 7, 2023 at 7:42 PM Yanfei Lei wrote: > Hey Jacqlyn, > According to the stack trace, it seems that there is a problem when > the checkpoint is triggered. Is this the problem after the restore? > would you like to share some logs related to restoring? > > Best, > Yanfei > > Jacqlyn Bender via user 于2023年9月8日周五 05:11写道: > > > > Hey folks, > > > > > > We experienced a pipeline failure where our job manager restarted and we > were for some reason unable to restore from our last successful checkpoint. > We had regularly completed checkpoints every 10 minutes up to this failure > and 0 failed checkpoints logged. Using Flink version 1.17.1. > > > > > > Wondering if anyone can shed light on what might have happened? > > > > > > Here's the error from our logs: > > > > > > Message: FATAL: Thread ‘Checkpoint Timer’ produced an uncaught > exception. Stopping the process... > > > > > > extendedStackTrace: java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: java.lang.NullPointerException > > > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:669) > ~[a-pipeline-name.jar:1.0] > > > > at > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986) > ~[?:?] > > > > at > java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) > ~[?:?] > > > > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > [?:?] > > > > at > java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) > [?:?] > > > > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910) > [?:?] > > > > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) > [?:?] > > > > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > [?:?] > > > > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] > > > > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > [?:?] > > > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > > > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > > > > at java.lang.Thread.run(Thread.java:829) [?:?] > > > > Caused by: java.util.concurrent.CompletionException: > java.lang.NullPointerException > > > > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) > ~[?:?] > > > > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) > ~[?:?] > > > > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932) > ~[?:?] > > > > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) > ~[?:?] > > > > ... 7 more > > > > Caused by: java.lang.NullPointerException > > > > at > org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:399) > ~[a-pipeline-name.jar:1.0] > > > > at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?] > > > > at > java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085) > ~[?:?] > > > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:947) > ~[a-pipeline-name.jar:1.0] > > > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordina
Re: Failure to restore from last completed checkpoint
Hey Jacqlyn, According to the stack trace, it seems that there is a problem when the checkpoint is triggered. Is this the problem after the restore? would you like to share some logs related to restoring? Best, Yanfei Jacqlyn Bender via user 于2023年9月8日周五 05:11写道: > > Hey folks, > > > We experienced a pipeline failure where our job manager restarted and we were > for some reason unable to restore from our last successful checkpoint. We had > regularly completed checkpoints every 10 minutes up to this failure and 0 > failed checkpoints logged. Using Flink version 1.17.1. > > > Wondering if anyone can shed light on what might have happened? > > > Here's the error from our logs: > > > Message: FATAL: Thread ‘Checkpoint Timer’ produced an uncaught exception. > Stopping the process... > > > extendedStackTrace: java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: java.lang.NullPointerException > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:669) > ~[a-pipeline-name.jar:1.0] > > at > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986) > ~[?:?] > > at > java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) > ~[?:?] > > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > [?:?] > > at > java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) > [?:?] > > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910) > [?:?] > > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) > [?:?] > > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > [?:?] > > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] > > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > [?:?] > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > > at java.lang.Thread.run(Thread.java:829) [?:?] > > Caused by: java.util.concurrent.CompletionException: > java.lang.NullPointerException > > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) > ~[?:?] > > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) > ~[?:?] > > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932) > ~[?:?] > > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) > ~[?:?] > > ... 7 more > > Caused by: java.lang.NullPointerException > > at > org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:399) > ~[a-pipeline-name.jar:1.0] > > at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?] > > at > java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085) > ~[?:?] > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:947) > ~[a-pipeline-name.jar:1.0] > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:923) > ~[a-pipeline-name.jar:1.0] > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:655) > ~[a-pipeline-name.jar:1.0] > > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) > ~[?:?] > > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) > ~[?:?] > > ... 7 more > >
Failure to restore from last completed checkpoint
Hey folks, We experienced a pipeline failure where our job manager restarted and we were for some reason unable to restore from our last successful checkpoint. We had regularly completed checkpoints every 10 minutes up to this failure and 0 failed checkpoints logged. Using Flink version 1.17.1. Wondering if anyone can shed light on what might have happened? Here's the error from our logs: Message: FATAL: Thread ‘Checkpoint Timer’ produced an uncaught exception. Stopping the process... extendedStackTrace: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: java.lang.NullPointerException at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:669) ~[a-pipeline-name.jar:1.0] at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986) ~[?:?] at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) ~[?:?] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?] at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) [?:?] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910) [?:?] at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) [?:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] Caused by: java.util.concurrent.CompletionException: java.lang.NullPointerException at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) ~[?:?] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) ~[?:?] at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932) ~[?:?] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) ~[?:?] ... 7 more Caused by: java.lang.NullPointerException at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:399) ~[a-pipeline-name.jar:1.0] at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?] at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085) ~[?:?] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:947) ~[a-pipeline-name.jar:1.0] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:923) ~[a-pipeline-name.jar:1.0] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:655) ~[a-pipeline-name.jar:1.0] at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) ~[?:?] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) ~[?:?] ... 7 more