Hi,

From what I can see in the log here, it looks like your RocksDB is not 
recovering from local but from a remote filesystem. This recovery basically has 
steps:

1: Create a temporary directory (in your example, this is the dir that ends 
…/5683a26f-cde2-406d-b4cf-3c6c3976f8ba) and download all the files, mainly sst 
files from remote fs to the temporary directory in local fs.

2: List all the downloaded files in the temporary directory and either hardlink 
(for sst files) or copy (for all other files) the listed files into the new 
RocksDb instance path (the path that ends with …/db)

3: Open the new db from the instance path, delete the temporary directory.

Now what is very surprising here is that it claims some file was not found (not 
clear which one, but I assume the downloaded file). However, how the file can 
be lost between downloading/listing and the attempt to hardlink it is very 
mysterious. Can you check the logs for any other exceptions and can you check 
what files exist in the recovery (e.g. what is downloaded, if the instance path 
is there, …). For now, I cannot see how a listed file could suddenly disappear, 
Flink will only delete the temporary directory if recovery is completed or 
failed. 

Also: is this problem deterministic or was this a singularity? Did you use a 
different Flink version before (which worked)?

Best,
Stefan

> On 7. Dec 2018, at 11:28, Ben Yan <yan.xiao.bin.m...@gmail.com> wrote:
> 
> hi . I am using flink-1.7.0. I am using RockDB and hdfs as statebackend, but 
> recently I found the following exception when the job resumed from the 
> checkpoint. Task-local state is always considered a secondary copy, the 
> ground truth of the checkpoint state is the primary copy in the distributed 
> store. But it seems that the job did not recover from hdfs, and it failed 
> directly.Hope someone can give me advices or hints about the problem that I 
> encountered.
> 
> 
> 2018-12-06 22:54:04,171 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess 
> (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from RUNNING to FAILED.
> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>       at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>       at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed 
> state backend for 
> KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of the 
> 1 provided restore options.
>       at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>       at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>       at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>       ... 5 more
> Caused by: java.nio.file.NoSuchFileException: 
> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst
>  -> 
> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
>       at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>       at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>       at 
> sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>       at java.nio.file.Files.createLink(Files.java:1086)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>       at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>       at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>       ... 7 more
> 
> Best
> Ben

Reply via email to