Re: Failure to restore from last completed checkpoint

2023-09-08 Thread Alexis Sarda-Espinosa
Hello,

Just a shot in the dark here, but could it be related to
https://issues.apache.org/jira/browse/FLINK-32241 ?

Such failures can cause many exceptions, but I think the ones you've
included aren't pointing to the root cause, so I'm not sure if that issue
applies to you.

Regards,
Alexis.

On Fri, 8 Sept 2023, 20:43 Jacqlyn Bender via user, 
wrote:

> Hi Yanfei,
>
> We were never able to restore from a checkpoint, we ended up restoring
> from a savepoint as fallback. Would those logs suggest we failed to take a
> checkpoint before the job manager restarted? Our observabillity monitors
> showed no failed checkpoints.
>
> Here is an exception that occurred before the failure to restore from the
> checkpoint:
>
> java.io.IOException: Cannot register Closeable, registry is already
> closed. Closing argument.
>
> at
> org.apache.flink.util.AbstractAutoCloseableRegistry.registerCloseable(AbstractAutoCloseableRegistry.java:89)
> ~[a-pipeline-name.jar:1.0]
>
> at
> org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForStateHandle(RocksDBStateDownloader.java:128)
> ~[flink-dist-1.17.1-shopify-81a88f8.jar:1.17.1-shopify-81a88f8]
>
> at
> org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.lambda$createDownloadRunnables$0(RocksDBStateDownloader.java:110)
> ~[flink-dist-1.17.1-shopify-81a88f8.jar:1.17.1-shopify-81a88f8]
>
> at
> org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49)
> ~[a-pipeline-name.jar:1.0]
>
> at
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
> ~[?:?] at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> ~[?:?]
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> ~[?:?]
>
> at java.lang.Thread.run(Thread.java:829) [?:?]
>
> Thanks,
> Jacqlyn
>
> On Thu, Sep 7, 2023 at 7:42 PM Yanfei Lei  wrote:
>
>> Hey Jacqlyn,
>> According to the stack trace, it seems that there is a problem when
>> the checkpoint is triggered. Is this the problem after the restore?
>> would you like to share some logs related to restoring?
>>
>> Best,
>> Yanfei
>>
>> Jacqlyn Bender via user  于2023年9月8日周五 05:11写道:
>> >
>> > Hey folks,
>> >
>> >
>> > We experienced a pipeline failure where our job manager restarted and
>> we were for some reason unable to restore from our last successful
>> checkpoint. We had regularly completed checkpoints every 10 minutes up to
>> this failure and 0 failed checkpoints logged. Using Flink version 1.17.1.
>> >
>> >
>> > Wondering if anyone can shed light on what might have happened?
>> >
>> >
>> > Here's the error from our logs:
>> >
>> >
>> > Message: FATAL: Thread ‘Checkpoint Timer’ produced an uncaught
>> exception. Stopping the process...
>> >
>> >
>> > extendedStackTrace: java.util.concurrent.CompletionException:
>> java.util.concurrent.CompletionException: java.lang.NullPointerException
>> >
>> > at
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:669)
>> ~[a-pipeline-name.jar:1.0]
>> >
>> > at
>> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
>> ~[?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
>> ~[?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>> [?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
>> [?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910)
>> [?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>> [?:?]
>> >
>> > at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>> [?:?]
>> >
>> > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>> >
>> > at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>> [?:?]
>> >
>> > at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>> [?:?]
>> >
>> > at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>> [?:?]
>> >
>> > at java.lang.Thread.run(Thread.java:829) [?:?]
>> >
>> > Caused by: java.util.concurrent.CompletionException:
>> java.lang.NullPointerException
>> >
>> > at
>> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
>> ~[?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
>> ~[?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932)
>> ~[?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
>> ~[?:?]
>> >
>> > ... 7 more
>> >
>> > Caused by: java.lang.NullPointerException
>> >
>> > at
>> org.apache.flink.runtime.operators.coordi

Re: Failure to restore from last completed checkpoint

2023-09-08 Thread Jacqlyn Bender via user
Hi Yanfei,

We were never able to restore from a checkpoint, we ended up restoring from
a savepoint as fallback. Would those logs suggest we failed to take a
checkpoint before the job manager restarted? Our observabillity monitors
showed no failed checkpoints.

Here is an exception that occurred before the failure to restore from the
checkpoint:

java.io.IOException: Cannot register Closeable, registry is already closed.
Closing argument.

at
org.apache.flink.util.AbstractAutoCloseableRegistry.registerCloseable(AbstractAutoCloseableRegistry.java:89)
~[a-pipeline-name.jar:1.0]

at
org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForStateHandle(RocksDBStateDownloader.java:128)
~[flink-dist-1.17.1-shopify-81a88f8.jar:1.17.1-shopify-81a88f8]

at
org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.lambda$createDownloadRunnables$0(RocksDBStateDownloader.java:110)
~[flink-dist-1.17.1-shopify-81a88f8.jar:1.17.1-shopify-81a88f8]

at
org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49)
~[a-pipeline-name.jar:1.0]

at
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
~[?:?] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]

at java.lang.Thread.run(Thread.java:829) [?:?]

Thanks,
Jacqlyn

On Thu, Sep 7, 2023 at 7:42 PM Yanfei Lei  wrote:

> Hey Jacqlyn,
> According to the stack trace, it seems that there is a problem when
> the checkpoint is triggered. Is this the problem after the restore?
> would you like to share some logs related to restoring?
>
> Best,
> Yanfei
>
> Jacqlyn Bender via user  于2023年9月8日周五 05:11写道:
> >
> > Hey folks,
> >
> >
> > We experienced a pipeline failure where our job manager restarted and we
> were for some reason unable to restore from our last successful checkpoint.
> We had regularly completed checkpoints every 10 minutes up to this failure
> and 0 failed checkpoints logged. Using Flink version 1.17.1.
> >
> >
> > Wondering if anyone can shed light on what might have happened?
> >
> >
> > Here's the error from our logs:
> >
> >
> > Message: FATAL: Thread ‘Checkpoint Timer’ produced an uncaught
> exception. Stopping the process...
> >
> >
> > extendedStackTrace: java.util.concurrent.CompletionException:
> java.util.concurrent.CompletionException: java.lang.NullPointerException
> >
> > at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:669)
> ~[a-pipeline-name.jar:1.0]
> >
> > at
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
> ~[?:?]
> >
> > at
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
> ~[?:?]
> >
> > at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
> [?:?]
> >
> > at
> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
> [?:?]
> >
> > at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910)
> [?:?]
> >
> > at
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
> [?:?]
> >
> > at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> [?:?]
> >
> > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
> >
> > at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> [?:?]
> >
> > at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> [?:?]
> >
> > at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> [?:?]
> >
> > at java.lang.Thread.run(Thread.java:829) [?:?]
> >
> > Caused by: java.util.concurrent.CompletionException:
> java.lang.NullPointerException
> >
> > at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
> ~[?:?]
> >
> > at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
> ~[?:?]
> >
> > at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932)
> ~[?:?]
> >
> > at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
> ~[?:?]
> >
> > ... 7 more
> >
> > Caused by: java.lang.NullPointerException
> >
> > at
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:399)
> ~[a-pipeline-name.jar:1.0]
> >
> > at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]
> >
> > at
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085)
> ~[?:?]
> >
> > at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:947)
> ~[a-pipeline-name.jar:1.0]
> >
> > at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordina

Re: Failure to restore from last completed checkpoint

2023-09-07 Thread Yanfei Lei
Hey Jacqlyn,
According to the stack trace, it seems that there is a problem when
the checkpoint is triggered. Is this the problem after the restore?
would you like to share some logs related to restoring?

Best,
Yanfei

Jacqlyn Bender via user  于2023年9月8日周五 05:11写道:
>
> Hey folks,
>
>
> We experienced a pipeline failure where our job manager restarted and we were 
> for some reason unable to restore from our last successful checkpoint. We had 
> regularly completed checkpoints every 10 minutes up to this failure and 0 
> failed checkpoints logged. Using Flink version 1.17.1.
>
>
> Wondering if anyone can shed light on what might have happened?
>
>
> Here's the error from our logs:
>
>
> Message: FATAL: Thread ‘Checkpoint Timer’ produced an uncaught exception. 
> Stopping the process...
>
>
> extendedStackTrace: java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: java.lang.NullPointerException
>
> at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:669)
>  ~[a-pipeline-name.jar:1.0]
>
> at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
>  ~[?:?]
>
> at 
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
>  ~[?:?]
>
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>  [?:?]
>
> at 
> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) 
> [?:?]
>
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910)
>  [?:?]
>
> at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>  [?:?]
>
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 
> [?:?]
>
> at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>  [?:?]
>
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>
> at java.lang.Thread.run(Thread.java:829) [?:?]
>
> Caused by: java.util.concurrent.CompletionException: 
> java.lang.NullPointerException
>
> at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
>  ~[?:?]
>
> at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
>  ~[?:?]
>
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932) 
> ~[?:?]
>
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
>  ~[?:?]
>
> ... 7 more
>
> Caused by: java.lang.NullPointerException
>
> at 
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:399)
>  ~[a-pipeline-name.jar:1.0]
>
> at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]
>
> at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085) 
> ~[?:?]
>
> at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:947)
>  ~[a-pipeline-name.jar:1.0]
>
> at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:923)
>  ~[a-pipeline-name.jar:1.0]
>
> at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:655)
>  ~[a-pipeline-name.jar:1.0]
>
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) 
> ~[?:?]
>
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
>  ~[?:?]
>
> ... 7 more
>
>


Failure to restore from last completed checkpoint

2023-09-07 Thread Jacqlyn Bender via user
Hey folks,


We experienced a pipeline failure where our job manager restarted and we
were for some reason unable to restore from our last successful checkpoint.
We had regularly completed checkpoints every 10 minutes up to this failure
and 0 failed checkpoints logged. Using Flink version 1.17.1.


Wondering if anyone can shed light on what might have happened?


Here's the error from our logs:


Message: FATAL: Thread ‘Checkpoint Timer’ produced an uncaught exception.
Stopping the process...


extendedStackTrace: java.util.concurrent.CompletionException:
java.util.concurrent.CompletionException: java.lang.NullPointerException

at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:669)
~[a-pipeline-name.jar:1.0]

at
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
~[?:?]

at
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
~[?:?]

at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
[?:?]

at
java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
[?:?]

at
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910)
[?:?]

at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
[?:?]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
[?:?]

at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]

at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
[?:?]

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[?:?]

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[?:?]

at java.lang.Thread.run(Thread.java:829) [?:?]

Caused by: java.util.concurrent.CompletionException:
java.lang.NullPointerException

at
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
~[?:?]

at
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
~[?:?]

at
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932)
~[?:?]

at
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
~[?:?]

... 7 more

Caused by: java.lang.NullPointerException

at
org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:399)
~[a-pipeline-name.jar:1.0]

at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]

at
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085)
~[?:?]

at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:947)
~[a-pipeline-name.jar:1.0]

at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:923)
~[a-pipeline-name.jar:1.0]

at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:655)
~[a-pipeline-name.jar:1.0]

at
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
~[?:?]

at
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
~[?:?]

... 7 more