Hi Josh!

There are two ways to improve the RocksDB / S3 behavior

(1) Use the FullyAsync mode. It stores the data in one file, not in a
directory. Since directories are the "eventual consistent" part of S3, this
prevents many issues.

(2) Flink 1.2-SNAPSHOT has some additional fixes that circumvent additional
S3 issues.

Hope that helps,
Stephan


On Tue, Oct 11, 2016 at 4:42 PM, Josh <jof...@gmail.com> wrote:

> Hi Aljoscha,
>
> Yeah I'm using S3. Is this a known problem when using S3? Do you have any
> ideas on how to restore my job from this state, or prevent it from
> happening again?
>
> Thanks,
> Josh
>
>
> On Tue, Oct 11, 2016 at 1:58 PM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
>> Hi,
>> you are using S3 to store the checkpoints, right? It might be that you're
>> running into a problem with S3 "directory listings" not being consistent.
>>
>> Cheers,
>> Aljoscha
>>
>> On Tue, 11 Oct 2016 at 12:40 Josh <jof...@gmail.com> wrote:
>>
>> Hi all,
>>
>>
>> I just have a couple of questions about checkpointing and restoring state 
>> from RocksDB.
>>
>>
>> 1) In some cases, I find that it is impossible to restore a job from a 
>> checkpoint, due to an exception such as the one pasted below[*]. In this 
>> case, it appears that the last checkpoint is somehow corrupt. Does anyone 
>> know why this might happen?
>>
>>
>> 2) When the above happens, I have no choice but to cancel the job, as it 
>> repeatedly attempts to restart and keeps getting the same exception. Given 
>> that no savepoint was taken recently, is it possible for me to restore the 
>> job from an older checkpoint (e.g. the second-last checkpoint)?
>>
>>
>> The version of Flink I'm using Flink-1.1-SNAPSHOT, from mid-June.
>>
>>
>> Thanks,
>>
>> Josh
>>
>>
>> [*]The exception when restoring state:
>>
>> java.lang.Exception: Could not restore checkpointed state to operators and 
>> functions
>>      at 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:480)
>>      at 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:219)
>>      at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588)
>>      at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.lang.RuntimeException: Error while restoring RocksDB state 
>> from 
>> /mnt/yarn/usercache/hadoop/appcache/application_1476181294189_0001/flink-io-09ad1cb1-8dff-4f9a-9f61-6cae27ee6f1d/d236820a793043bd63360df6f175cae9/StreamFlatMap_9_8/dummy_state/dc5beab1-68fb-48b3-b3d6-272497d15a09/chk-1
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBStateBackend.restoreFromSemiAsyncSnapshot(RocksDBStateBackend.java:537)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBStateBackend.injectKeyValueStateSnapshots(RocksDBStateBackend.java:489)
>>      at 
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.restoreState(AbstractStreamOperator.java:204)
>>      at 
>> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:154)
>>      at 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:472)
>>      ... 3 more
>> Caused by: org.rocksdb.RocksDBException: NotFound: Backup not found
>>      at org.rocksdb.BackupEngine.restoreDbFromLatestBackup(Native Method)
>>      at 
>> org.rocksdb.BackupEngine.restoreDbFromLatestBackup(BackupEngine.java:177)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBStateBackend.restoreFromSemiAsyncSnapshot(RocksDBStateBackend.java:535)
>>      ... 7 more
>>
>>
>>
>

Reply via email to