Hi Aljoscha,

Yeah I'm using S3. Is this a known problem when using S3? Do you have any
ideas on how to restore my job from this state, or prevent it from
happening again?

Thanks,
Josh


On Tue, Oct 11, 2016 at 1:58 PM, Aljoscha Krettek <aljos...@apache.org>
wrote:

> Hi,
> you are using S3 to store the checkpoints, right? It might be that you're
> running into a problem with S3 "directory listings" not being consistent.
>
> Cheers,
> Aljoscha
>
> On Tue, 11 Oct 2016 at 12:40 Josh <jof...@gmail.com> wrote:
>
> Hi all,
>
>
> I just have a couple of questions about checkpointing and restoring state 
> from RocksDB.
>
>
> 1) In some cases, I find that it is impossible to restore a job from a 
> checkpoint, due to an exception such as the one pasted below[*]. In this 
> case, it appears that the last checkpoint is somehow corrupt. Does anyone 
> know why this might happen?
>
>
> 2) When the above happens, I have no choice but to cancel the job, as it 
> repeatedly attempts to restart and keeps getting the same exception. Given 
> that no savepoint was taken recently, is it possible for me to restore the 
> job from an older checkpoint (e.g. the second-last checkpoint)?
>
>
> The version of Flink I'm using Flink-1.1-SNAPSHOT, from mid-June.
>
>
> Thanks,
>
> Josh
>
>
> [*]The exception when restoring state:
>
> java.lang.Exception: Could not restore checkpointed state to operators and 
> functions
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:480)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:219)
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Error while restoring RocksDB state 
> from 
> /mnt/yarn/usercache/hadoop/appcache/application_1476181294189_0001/flink-io-09ad1cb1-8dff-4f9a-9f61-6cae27ee6f1d/d236820a793043bd63360df6f175cae9/StreamFlatMap_9_8/dummy_state/dc5beab1-68fb-48b3-b3d6-272497d15a09/chk-1
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBStateBackend.restoreFromSemiAsyncSnapshot(RocksDBStateBackend.java:537)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBStateBackend.injectKeyValueStateSnapshots(RocksDBStateBackend.java:489)
>       at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.restoreState(AbstractStreamOperator.java:204)
>       at 
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:154)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:472)
>       ... 3 more
> Caused by: org.rocksdb.RocksDBException: NotFound: Backup not found
>       at org.rocksdb.BackupEngine.restoreDbFromLatestBackup(Native Method)
>       at 
> org.rocksdb.BackupEngine.restoreDbFromLatestBackup(BackupEngine.java:177)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBStateBackend.restoreFromSemiAsyncSnapshot(RocksDBStateBackend.java:535)
>       ... 7 more
>
>
>

Reply via email to