Same error again today. Any tips ? I'm considering downgrading to Flink 1.14 ?
On Wed, Dec 14, 2022 at 11:51 AM Lars Skjærven <lar...@gmail.com> wrote: > As far as I understand we are not specifying anything on restore mode. so > I guess default (NO_CLAIM) is what we're using. > > We're using ververica platform to handle deploys, and things are a bit > obscure on what happens underneath. > > It happened again this morning: > > Caused by: java.io.FileNotFoundException: Item not found: > 'gs://bucketname/namespace/flink-jobs/namespaces/default/jobs/fbdde9e7-cf5a-44b4-a3d4-d3ed517432a0/checkpoints/fbdde9e7cf5a44b4a3d4d3ed517432a0/shared/ae551eda-a588-45be-ba08-32bfbc50e965'. > Note, it is possible that the live version is still available but the > requested generation is deleted. > > > On Tue, Dec 13, 2022 at 11:37 PM Martijn Visser <martijnvis...@apache.org> > wrote: > >> Hi Lars, >> >> Have you used any of the new restore modes that were introduced with >> 1.15? https://flink.apache.org/2022/05/06/restore-modes.html >> >> Best regards, >> >> Martijn >> >> On Fri, Dec 9, 2022 at 2:52 PM Lars Skjærven <lar...@gmail.com> wrote: >> >>> Lifecycle rulesNone >>> >>> On Fri, Dec 9, 2022 at 3:17 AM Hangxiang Yu <master...@gmail.com> wrote: >>> >>>> Hi, Lars. >>>> Could you check whether you have configured the lifecycle of google >>>> cloud storage[1] which is not recommended in the flink checkpoint usage? >>>> >>>> [1] https://cloud.google.com/storage/docs/lifecycle >>>> >>>> On Fri, Dec 9, 2022 at 2:02 AM Lars Skjærven <lar...@gmail.com> wrote: >>>> >>>>> Hello, >>>>> We had an incident today with a job that could not restore after crash >>>>> (for unknown reason). Specifically, it fails due to a missing checkpoint >>>>> file. We've experienced this a total of three times with Flink 1.15.2, but >>>>> never with 1.14.x. Last time was during a node upgrade, but that was not >>>>> the case this time. >>>>> >>>>> I've not been able to reproduce this issue. I've checked that I can >>>>> kill the taskmanager and jobmanager (using kubectl delete pod), and the >>>>> job >>>>> restores as expected. >>>>> >>>>> The job is running with kubernetes high availability, rocksdb and >>>>> incremental checkpointing. >>>>> >>>>> Any tips are highly appreciated. >>>>> >>>>> Thanks, >>>>> Lars >>>>> >>>>> Caused by: org.apache.flink.util.FlinkException: Could not restore >>>>> keyed state backend for >>>>> KeyedProcessOperator_bf374b554824ef28e76619f4fa153430_(2/2) from any of >>>>> the >>>>> 1 provided restore options. >>>>> at >>>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160) >>>>> at >>>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:346) >>>>> at >>>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:164) >>>>> ... 11 more >>>>> Caused by: org.apache.flink.runtime.state.BackendBuildingException: >>>>> Caught unexpected exception. >>>>> at >>>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:395) >>>>> at >>>>> org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:483) >>>>> at >>>>> org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:97) >>>>> at >>>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:329) >>>>> at >>>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168) >>>>> at >>>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135) >>>>> ... 13 more >>>>> Caused by: java.io.FileNotFoundException: Item not found: >>>>> 'gs://some-bucket-name/flink-jobs/namespaces/default/jobs/d60a6c94-ddbc-42a1-947e-90f62749835a/checkpoints/d60a6c94ddbc42a1947e90f62749835a/shared/3cb2bb55-b4b0-44e5-948a-5d38ec088253'. >>>>> Note, it is possible that the live version is still available but the >>>>> requested generation is deleted. >>>>> at >>>>> com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createFileNotFoundException(GoogleCloudStorageExceptions.java:46) >>>>> >>>>> >>>> >>>> -- >>>> Best, >>>> Hangxiang. >>>> >>>