I think then you need to investigate what goes wrong in 
RocksDBIncrementalRestoreOperation::restoreInstanceDirectoryFromPath. If you 
look at the code it lists the files in a directory and tries to hard link them 
into another directory, and I would only expect to see the mentioned exception 
if the original file that we try to link does not exist. However, imo it must 
exist because we list it in the directory right before the link attempt and 
Flink is not delete anything in the meantime. So the question is, why can a 
file that was listed before just suddenly disappear when it is hard linked? The 
only potential problem could be in the path transformations and concatenations, 
but they look good to me and also pass all tests, including end-to-end tests 
that do exactly such a restore. I suggest to either observe the created files 
and what happens with the one that is mentioned in the exception or introduce 
debug logging in the code, in particular a check if the listed file (the link 
target) does exist before linking, which it should in my opinion because it is 
listed in the directory. 

> On 7. Dec 2018, at 16:33, Ben Yan <yan.xiao.bin.m...@gmail.com> wrote:
> 
> The version of the recovered checkpoint is also 1.7.0 . 
> 
> Stefan Richter <s.rich...@data-artisans.com 
> <mailto:s.rich...@data-artisans.com>> 于2018年12月7日周五 下午11:06写道:
> Just to clarify, the checkpoint from which you want to resume in 1.7, was 
> that taken by 1.6 or by 1.7? So far this is a bit mysterious because it says 
> FileNotFound, but the whole iteration is driven by listing the existing 
> files. Can you somehow monitor which files and directories are created during 
> the restore attempt?
> 
>> On 7. Dec 2018, at 15:53, Ben Yan <yan.xiao.bin.m...@gmail.com 
>> <mailto:yan.xiao.bin.m...@gmail.com>> wrote:
>> 
>> hi ,Stefan
>> 
>> Thank you for your explanation. I used flink1.6.2, which is without any 
>> problems. I have tested it a few times with version 1.7.0, but every time I 
>> resume from the checkpoint, the job will show the exception I showed 
>> earlier, which will make the job unrecoverable.And I checked all the logs, 
>> except for this exception, there are no other exceptions.
>> 
>> The following is all the logs when an exception occurs:
>> 2018-12-06 22:53:41,282 INFO  
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess 
>> (120/138) (25ab0c8d0bc657860b766fa4c8d85a42) switched from DEPLOYING to 
>> RUNNING.
>> 2018-12-06 22:53:41,285 INFO  
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess 
>> (2/138) (f770d22a976463d90fb4349d1c8521b8) switched from RUNNING to FAILED.
>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>      at 
>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>      at 
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>      at 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>      at 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>      at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>      at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed 
>> state backend for 
>> KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of 
>> the 1 provided restore options.
>>      at 
>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>      at 
>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>      at 
>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>      ... 5 more
>> Caused by: java.nio.file.NoSuchFileException: 
>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst
>>  -> 
>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
>>      at 
>> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>      at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>      at 
>> sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>      at java.nio.file.Files.createLink(Files.java:1086)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>      at 
>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>      at 
>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>      ... 7 more
>> 2018-12-06 22:53:41,286 INFO  
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job 
>> Flink-Job-Offline (6e40c9381aa12f69b6ac182c91d993f5) switched from state 
>> RUNNING to FAILING.
>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>      at 
>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>      at 
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>      at 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>      at 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>      at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>      at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed 
>> state backend for 
>> KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of 
>> the 1 provided restore options.
>>      at 
>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>      at 
>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>      at 
>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>      ... 5 more
>> Caused by: java.nio.file.NoSuchFileException: 
>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst
>>  -> 
>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
>>      at 
>> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>      at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>      at 
>> sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>      at java.nio.file.Files.createLink(Files.java:1086)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>      at 
>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>      at 
>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>      ... 7 more
>> 2018-12-06 22:53:41,287 INFO  
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: 
>> topic.rate (1/16) (5637f1c3568ca7c29db002e579c05546) switched from RUNNING 
>> to CANCELING.
>> 
>> 
>> Best, 
>> Ben
>> 
>> Stefan Richter <s.rich...@data-artisans.com 
>> <mailto:s.rich...@data-artisans.com>> 于2018年12月7日周五 下午10:00写道:
>> Hi,
>> 
>> From what I can see in the log here, it looks like your RocksDB is not 
>> recovering from local but from a remote filesystem. This recovery basically 
>> has steps:
>> 
>> 1: Create a temporary directory (in your example, this is the dir that ends 
>> …/5683a26f-cde2-406d-b4cf-3c6c3976f8ba) and download all the files, mainly 
>> sst files from remote fs to the temporary directory in local fs.
>> 
>> 2: List all the downloaded files in the temporary directory and either 
>> hardlink (for sst files) or copy (for all other files) the listed files into 
>> the new RocksDb instance path (the path that ends with …/db)
>> 
>> 3: Open the new db from the instance path, delete the temporary directory.
>> 
>> Now what is very surprising here is that it claims some file was not found 
>> (not clear which one, but I assume the downloaded file). However, how the 
>> file can be lost between downloading/listing and the attempt to hardlink it 
>> is very mysterious. Can you check the logs for any other exceptions and can 
>> you check what files exist in the recovery (e.g. what is downloaded, if the 
>> instance path is there, …). For now, I cannot see how a listed file could 
>> suddenly disappear, Flink will only delete the temporary directory if 
>> recovery is completed or failed. 
>> 
>> Also: is this problem deterministic or was this a singularity? Did you use a 
>> different Flink version before (which worked)?
>> 
>> Best,
>> Stefan
>> 
>>> On 7. Dec 2018, at 11:28, Ben Yan <yan.xiao.bin.m...@gmail.com 
>>> <mailto:yan.xiao.bin.m...@gmail.com>> wrote:
>>> 
>>> hi . I am using flink-1.7.0. I am using RockDB and hdfs as statebackend, 
>>> but recently I found the following exception when the job resumed from the 
>>> checkpoint. Task-local state is always considered a secondary copy, the 
>>> ground truth of the checkpoint state is the primary copy in the distributed 
>>> store. But it seems that the job did not recover from hdfs, and it failed 
>>> directly.Hope someone can give me advices or hints about the problem that I 
>>> encountered.
>>> 
>>> 
>>> 2018-12-06 22:54:04,171 INFO  
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - 
>>> KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from 
>>> RUNNING to FAILED.
>>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>>     at 
>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>>     at 
>>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>>     at java.lang.Thread.run(Thread.java:748)
>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed 
>>> state backend for 
>>> KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of 
>>> the 1 provided restore options.
>>>     at 
>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>>     at 
>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>>     at 
>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>>     ... 5 more
>>> Caused by: java.nio.file.NoSuchFileException: 
>>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst
>>>  -> 
>>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
>>>     at 
>>> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>>     at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>>     at 
>>> sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>>     at java.nio.file.Files.createLink(Files.java:1086)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>>     at 
>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>>     at 
>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>>     ... 7 more
>>> 
>>> Best
>>> Ben
>> 
> 

Reply via email to