Hi Lars,

Have you used any of the new restore modes that were introduced with 1.15?
https://flink.apache.org/2022/05/06/restore-modes.html

Best regards,

Martijn

On Fri, Dec 9, 2022 at 2:52 PM Lars Skjærven <lar...@gmail.com> wrote:

> Lifecycle rulesNone
>
> On Fri, Dec 9, 2022 at 3:17 AM Hangxiang Yu <master...@gmail.com> wrote:
>
>> Hi, Lars.
>> Could you check whether you have configured the lifecycle of google cloud
>> storage[1] which is not recommended in the flink checkpoint usage?
>>
>> [1] https://cloud.google.com/storage/docs/lifecycle
>>
>> On Fri, Dec 9, 2022 at 2:02 AM Lars Skjærven <lar...@gmail.com> wrote:
>>
>>> Hello,
>>> We had an incident today with a job that could not restore after crash
>>> (for unknown reason). Specifically, it fails due to a missing checkpoint
>>> file. We've experienced this a total of three times with Flink 1.15.2, but
>>> never with 1.14.x. Last time was during a node upgrade, but that was not
>>> the case this time.
>>>
>>> I've not been able to reproduce this issue. I've checked that I can kill
>>> the taskmanager and jobmanager (using kubectl delete pod), and the job
>>> restores as expected.
>>>
>>> The job is running with kubernetes high availability, rocksdb and
>>> incremental checkpointing.
>>>
>>> Any tips are highly appreciated.
>>>
>>> Thanks,
>>> Lars
>>>
>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed
>>> state backend for
>>> KeyedProcessOperator_bf374b554824ef28e76619f4fa153430_(2/2) from any of the
>>> 1 provided restore options.
>>> at
>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160)
>>> at
>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:346)
>>> at
>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:164)
>>> ... 11 more
>>> Caused by: org.apache.flink.runtime.state.BackendBuildingException:
>>> Caught unexpected exception.
>>> at
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:395)
>>> at
>>> org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:483)
>>> at
>>> org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:97)
>>> at
>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:329)
>>> at
>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
>>> at
>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
>>> ... 13 more
>>> Caused by: java.io.FileNotFoundException: Item not found:
>>> 'gs://some-bucket-name/flink-jobs/namespaces/default/jobs/d60a6c94-ddbc-42a1-947e-90f62749835a/checkpoints/d60a6c94ddbc42a1947e90f62749835a/shared/3cb2bb55-b4b0-44e5-948a-5d38ec088253'.
>>> Note, it is possible that the live version is still available but the
>>> requested generation is deleted.
>>> at
>>> com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createFileNotFoundException(GoogleCloudStorageExceptions.java:46)
>>>
>>>
>>
>> --
>> Best,
>> Hangxiang.
>>
>

Reply via email to