Re: FileNotFoundException when restoring checkpoint

Chesnay Schepler Mon, 17 Jul 2017 04:29:58 -0700

Hello,

If i recall correctly savepoints are always self-contained even ifincremental checkpointing is enabled.

However, this doesn't appear to be documented anywhere.

As for the missing file, I'm looping in Stefan who is more knowledgeableabout incremental checkpointing (and potentially know issues).


Regards,
Chesnay

On 17.07.2017 13:12, Shai Kaplan wrote:

Hi.
I'm running Flink 1.3.1 with checkpoints stored in Azure blobs.Incremental checkpoints feature is on.
The job is trying to restore a checkpoint and consistently gets:

java.lang.IllegalStateException: Could not initialize keyed state backend.
atorg.apache.flink.streaming.api.operators.AbstractStreamOperator.initKeyedState(AbstractStreamOperator.java:321)
atorg.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:217)
atorg.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:676)
atorg.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:663)
atorg.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:252)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)

at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException:wasbs://***/5065c840d4d4ba8c7cd91f793012bab1/chk-37/f52d633b-9d3f-47c1-bf23-58dcc54572e3
atorg.apache.hadoop.fs.azure.NativeAzureFileSystem.open(NativeAzureFileSystem.java:1905)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
atorg.apache.flink.runtime.fs.hdfs.HadoopFileSystem.open(HadoopFileSystem.java:404)
atorg.apache.flink.runtime.fs.hdfs.HadoopFileSystem.open(HadoopFileSystem.java:48)
atorg.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:85)
atorg.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:69)
atorg.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.readStateData(RocksDBKeyedStateBackend.java:1276)
atorg.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.readAllStateData(RocksDBKeyedStateBackend.java:1458)
atorg.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstance(RocksDBKeyedStateBackend.java:1319)
atorg.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:1493)
atorg.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:965)
atorg.apache.flink.streaming.runtime.tasks.StreamTask.createKeyedStateBackend(StreamTask.java:772)
atorg.apache.flink.streaming.api.operators.AbstractStreamOperator.initKeyedState(AbstractStreamOperator.java:311)
The name of the missing file is sometimes different, but it's always amissing file in checkpoint 37. The last successful checkpoint numberwas 41, so I'm guessing that's the checkpoint it's trying to restore,but because of the incremental checkpointing it also needs files fromprevious checkpoints, which are apparently missing. Could this be aproblem in the interface with Azure? If some files failed to write,why didn't the checkpoint fail?
When I realized nothing is going to change I canceled the job, andstarted it from a savepoint, which was checkpoint number 40. Iactually expected it to fail, and that I would have to restore it froma savepoint prior to the apparently corrupted checkpoint number 37,but it didn't fail. Should I infer that savepoints are self-containedand are not incremental?

Re: FileNotFoundException when restoring checkpoint

Reply via email to