Same error again today. Any tips ? I'm considering downgrading to Flink
1.14 ?

On Wed, Dec 14, 2022 at 11:51 AM Lars Skjærven <lar...@gmail.com> wrote:

> As far as I understand we are not specifying anything on restore mode. so
> I guess default (NO_CLAIM) is what we're using.
>
> We're using ververica platform to handle deploys, and things are a bit
> obscure on what happens underneath.
>
> It happened again this morning:
>
> Caused by: java.io.FileNotFoundException: Item not found: 
> 'gs://bucketname/namespace/flink-jobs/namespaces/default/jobs/fbdde9e7-cf5a-44b4-a3d4-d3ed517432a0/checkpoints/fbdde9e7cf5a44b4a3d4d3ed517432a0/shared/ae551eda-a588-45be-ba08-32bfbc50e965'.
>  Note, it is possible that the live version is still available but the 
> requested generation is deleted.
>
>
> On Tue, Dec 13, 2022 at 11:37 PM Martijn Visser <martijnvis...@apache.org>
> wrote:
>
>> Hi Lars,
>>
>> Have you used any of the new restore modes that were introduced with
>> 1.15? https://flink.apache.org/2022/05/06/restore-modes.html
>>
>> Best regards,
>>
>> Martijn
>>
>> On Fri, Dec 9, 2022 at 2:52 PM Lars Skjærven <lar...@gmail.com> wrote:
>>
>>> Lifecycle rulesNone
>>>
>>> On Fri, Dec 9, 2022 at 3:17 AM Hangxiang Yu <master...@gmail.com> wrote:
>>>
>>>> Hi, Lars.
>>>> Could you check whether you have configured the lifecycle of google
>>>> cloud storage[1] which is not recommended in the flink checkpoint usage?
>>>>
>>>> [1] https://cloud.google.com/storage/docs/lifecycle
>>>>
>>>> On Fri, Dec 9, 2022 at 2:02 AM Lars Skjærven <lar...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>> We had an incident today with a job that could not restore after crash
>>>>> (for unknown reason). Specifically, it fails due to a missing checkpoint
>>>>> file. We've experienced this a total of three times with Flink 1.15.2, but
>>>>> never with 1.14.x. Last time was during a node upgrade, but that was not
>>>>> the case this time.
>>>>>
>>>>> I've not been able to reproduce this issue. I've checked that I can
>>>>> kill the taskmanager and jobmanager (using kubectl delete pod), and the 
>>>>> job
>>>>> restores as expected.
>>>>>
>>>>> The job is running with kubernetes high availability, rocksdb and
>>>>> incremental checkpointing.
>>>>>
>>>>> Any tips are highly appreciated.
>>>>>
>>>>> Thanks,
>>>>> Lars
>>>>>
>>>>> Caused by: org.apache.flink.util.FlinkException: Could not restore
>>>>> keyed state backend for
>>>>> KeyedProcessOperator_bf374b554824ef28e76619f4fa153430_(2/2) from any of 
>>>>> the
>>>>> 1 provided restore options.
>>>>> at
>>>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160)
>>>>> at
>>>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:346)
>>>>> at
>>>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:164)
>>>>> ... 11 more
>>>>> Caused by: org.apache.flink.runtime.state.BackendBuildingException:
>>>>> Caught unexpected exception.
>>>>> at
>>>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:395)
>>>>> at
>>>>> org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:483)
>>>>> at
>>>>> org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:97)
>>>>> at
>>>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:329)
>>>>> at
>>>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
>>>>> at
>>>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
>>>>> ... 13 more
>>>>> Caused by: java.io.FileNotFoundException: Item not found:
>>>>> 'gs://some-bucket-name/flink-jobs/namespaces/default/jobs/d60a6c94-ddbc-42a1-947e-90f62749835a/checkpoints/d60a6c94ddbc42a1947e90f62749835a/shared/3cb2bb55-b4b0-44e5-948a-5d38ec088253'.
>>>>> Note, it is possible that the live version is still available but the
>>>>> requested generation is deleted.
>>>>> at
>>>>> com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createFileNotFoundException(GoogleCloudStorageExceptions.java:46)
>>>>>
>>>>>
>>>>
>>>> --
>>>> Best,
>>>> Hangxiang.
>>>>
>>>

Reply via email to