The version of the recovered checkpoint is also 1.7.0 .

Stefan Richter <s.rich...@data-artisans.com> 于2018年12月7日周五 下午11:06写道:

> Just to clarify, the checkpoint from which you want to resume in 1.7, was
> that taken by 1.6 or by 1.7? So far this is a bit mysterious because it
> says FileNotFound, but the whole iteration is driven by listing the
> existing files. Can you somehow monitor which files and directories are
> created during the restore attempt?
>
> On 7. Dec 2018, at 15:53, Ben Yan <yan.xiao.bin.m...@gmail.com> wrote:
>
> hi ,Stefan
>
> Thank you for your explanation. I used flink1.6.2, which is without any
> problems. I have tested it a few times with version 1.7.0, but every time I
> resume from the checkpoint, the job will show the exception I showed
> earlier, which will make the job unrecoverable.And I checked all the logs,
> except for this exception, there are no other exceptions.
>
> The following is all the logs when an exception occurs:
>
> 2018-12-06 22:53:41,282 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess 
> (120/138) (25ab0c8d0bc657860b766fa4c8d85a42) switched from DEPLOYING to 
> RUNNING.
>
> 2018-12-06 22:53:41,285 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess 
> (2/138) (f770d22a976463d90fb4349d1c8521b8) switched from RUNNING to FAILED.
> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>       at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>       at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed 
> state backend for 
> KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of the 
> 1 provided restore options.
>       at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>       at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>       at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>       ... 5 more
> Caused by: java.nio.file.NoSuchFileException: 
> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst
>  -> 
> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
>       at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>       at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>       at 
> sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>       at java.nio.file.Files.createLink(Files.java:1086)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>       at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>       at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>       ... 7 more
> 2018-12-06 22:53:41,286 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job 
> Flink-Job-Offline (6e40c9381aa12f69b6ac182c91d993f5) switched from state 
> RUNNING to FAILING.
> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>       at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>       at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed 
> state backend for 
> KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of the 
> 1 provided restore options.
>       at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>       at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>       at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>       ... 5 more
> Caused by: java.nio.file.NoSuchFileException: 
> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst
>  -> 
> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
>       at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>       at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>       at 
> sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>       at java.nio.file.Files.createLink(Files.java:1086)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>       at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>       at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>       ... 7 more
> 2018-12-06 22:53:41,287 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: 
> topic.rate (1/16) (5637f1c3568ca7c29db002e579c05546) switched from RUNNING to 
> CANCELING.
>
>
>
> Best,
> Ben
>
> Stefan Richter <s.rich...@data-artisans.com> 于2018年12月7日周五 下午10:00写道:
>
>> Hi,
>>
>> From what I can see in the log here, it looks like your RocksDB is not
>> recovering from local but from a remote filesystem. This recovery basically
>> has steps:
>>
>> 1: Create a temporary directory (in your example, this is the dir that
>> ends …/5683a26f-cde2-406d-b4cf-3c6c3976f8ba) and download all the files,
>> mainly sst files from remote fs to the temporary directory in local fs.
>>
>> 2: List all the downloaded files in the temporary directory and either
>> hardlink (for sst files) or copy (for all other files) the listed files
>> into the new RocksDb instance path (the path that ends with …/db)
>>
>> 3: Open the new db from the instance path, delete the temporary directory.
>>
>> Now what is very surprising here is that it claims some file was not
>> found (not clear which one, but I assume the downloaded file). However, how
>> the file can be lost between downloading/listing and the attempt to
>> hardlink it is very mysterious. Can you check the logs for any other
>> exceptions and can you check what files exist in the recovery (e.g. what is
>> downloaded, if the instance path is there, …). For now, I cannot see how a
>> listed file could suddenly disappear, Flink will only delete the temporary
>> directory if recovery is completed or failed.
>>
>> Also: is this problem deterministic or was this a singularity? Did you
>> use a different Flink version before (which worked)?
>>
>> Best,
>> Stefan
>>
>> On 7. Dec 2018, at 11:28, Ben Yan <yan.xiao.bin.m...@gmail.com> wrote:
>>
>> hi . I am using flink-1.7.0. I am using RockDB and hdfs as statebackend,
>> but recently I found the following exception when the job resumed from the
>> checkpoint. Task-local state is always considered a secondary copy, the
>> ground truth of the checkpoint state is the primary copy in the distributed
>> store. But it seems that the job did not recover from hdfs, and it
>> failed directly.Hope someone can give me advices or hints about the
>> problem that I encountered.
>>
>>
>> 2018-12-06 22:54:04,171 INFO  
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess 
>> (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from RUNNING to FAILED.
>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>      at 
>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>      at 
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>      at 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>      at 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>      at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>      at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed 
>> state backend for 
>> KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of 
>> the 1 provided restore options.
>>      at 
>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>      at 
>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>      at 
>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>      ... 5 more
>> Caused by: java.nio.file.NoSuchFileException: 
>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst
>>  -> 
>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
>>      at 
>> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>      at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>      at 
>> sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>      at java.nio.file.Files.createLink(Files.java:1086)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>      at 
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>      at 
>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>      at 
>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>      ... 7 more
>>
>>
>> Best
>>
>> Ben
>>
>>
>>
>

Reply via email to