Hi all, We've got a Flink job running on 1.8.0 which writes its state (rocksdb) to Google Cloud Storage[1]. We've noticed that jobs with a large amount of state (500gb range) are becoming *very* unstable. In the order of restarting once an hour or even more.
The reason for this instability is that we run into "410 Gone"[4] errors from Google Cloud Storage. This indicates an upload (write from Flink's perspective) took place and it wanted to resume the write[2] but could not find the file which it needed to resume. My guess is this is because the previous attempt either failed or perhaps it uploads in chunks of 67mb [3]. The library logs this line when this happens: "Encountered status code 410 when accessing URL https://www.googleapis.com/upload/storage/v1/b/<project>/o?ifGenerationMatch=0&name=job-manager/15aa2391-a055-4bfd-8d82-e9e4806baa9c/8ae818761055cdc022822010a8b4a1ed/chk-52224/_metadata&uploadType=resumable&upload_id=AEnB2UqJwkdrQ8YuzqrTp9Nk4bDnzbuJcTlD5E5hKNLNz4xQ7vjlYrDzYC29ImHcp0o6OjSCmQo6xkDSj5OHly7aChH0JxxXcg. Delegating to response handler for possible retry." We're kind of stuck on these questions: * Is flink capable or doing these retries? * Does anyone succesfully write their (rocksdb) state to Google Cloud storage for bigger state sizes? * Is it possible flink renames or deletes certain directories before all flushes have been done based on an atomic guarantee provided by HDFS that does not hold on other implementations perhaps? A race condition of sorts Basically does anyone recognize this behavior? Regards, Richard Deurwaarder [1] We use an HDFS implementation provided by Google https://github.com/GoogleCloudDataproc/bigdata-interop/tree/master/gcs [2] https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone [3] https://github.com/GoogleCloudDataproc/bigdata-interop/blob/master/gcs/CONFIGURATION.md (see fs.gs.outputstream.upload.chunk.size) [4] Stacktrace: https://gist.github.com/Xeli/da4c0af2c49c060139ad01945488e492