Hi Richard, googling a bit indicates that this might actually be a GCS problem [1, 2, 3]. The proposed solution/workaround so far is to retry the whole upload operation as part of the application logic. Since I assume that you are writing to GCS via Hadoop's file system this should actually fall into the realm of the Hadoop file system implementation and not Flink.
What you could do to mitigate the problem a bit is to set the number of tolerable checkpoint failures to a non-zero value via `CheckpointConfig.setTolerableCheckpointFailureNumber`. Setting this to `n` means that the job will only fail and then restart after `n` checkpoint failures. Unfortunately, we do not support a failure rate yet. [1] https://github.com/googleapis/google-cloud-java/issues/3586 [2] https://github.com/googleapis/google-cloud-java/issues/5704 [3] https://issuetracker.google.com/issues/137168102 Cheers, Till On Tue, Jan 28, 2020 at 6:25 PM Richard Deurwaarder <rich...@xeli.eu> wrote: > Hi all, > > We've got a Flink job running on 1.8.0 which writes its state (rocksdb) to > Google Cloud Storage[1]. We've noticed that jobs with a large amount of > state (500gb range) are becoming *very* unstable. In the order of > restarting once an hour or even more. > > The reason for this instability is that we run into "410 Gone"[4] errors > from Google Cloud Storage. This indicates an upload (write from Flink's > perspective) took place and it wanted to resume the write[2] but could not > find the file which it needed to resume. My guess is this is because the > previous attempt either failed or perhaps it uploads in chunks of 67mb [3]. > > The library logs this line when this happens: > > "Encountered status code 410 when accessing URL > https://www.googleapis.com/upload/storage/v1/b/<project>/o?ifGenerationMatch=0&name=job-manager/15aa2391-a055-4bfd-8d82-e9e4806baa9c/8ae818761055cdc022822010a8b4a1ed/chk-52224/_metadata&uploadType=resumable&upload_id=AEnB2UqJwkdrQ8YuzqrTp9Nk4bDnzbuJcTlD5E5hKNLNz4xQ7vjlYrDzYC29ImHcp0o6OjSCmQo6xkDSj5OHly7aChH0JxxXcg. > Delegating to response handler for possible retry." > > We're kind of stuck on these questions: > * Is flink capable or doing these retries? > * Does anyone succesfully write their (rocksdb) state to Google Cloud > storage for bigger state sizes? > * Is it possible flink renames or deletes certain directories before all > flushes have been done based on an atomic guarantee provided by HDFS that > does not hold on other implementations perhaps? A race condition of sorts > > Basically does anyone recognize this behavior? > > Regards, > > Richard Deurwaarder > > [1] We use an HDFS implementation provided by Google > https://github.com/GoogleCloudDataproc/bigdata-interop/tree/master/gcs > [2] > https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone > [3] > https://github.com/GoogleCloudDataproc/bigdata-interop/blob/master/gcs/CONFIGURATION.md > (see > fs.gs.outputstream.upload.chunk.size) > [4] Stacktrace: > https://gist.github.com/Xeli/da4c0af2c49c060139ad01945488e492 >