Does flink support retries on checkpoint write failures

Richard Deurwaarder Tue, 28 Jan 2020 09:25:20 -0800

Hi all,

We've got a Flink job running on 1.8.0 which writes its state (rocksdb) to
Google Cloud Storage[1]. We've noticed that jobs with a large amount of
state (500gb range) are becoming *very* unstable. In the order of
restarting once an hour or even more.


The reason for this instability is that we run into "410 Gone"[4] errors
from Google Cloud Storage. This indicates an upload (write from Flink's
perspective) took place and it wanted to resume the write[2] but could not
find the file which it needed to resume. My guess is this is because the
previous attempt either failed or perhaps it uploads in chunks of 67mb [3].

The library logs this line when this happens:

"Encountered status code 410 when accessing URL
https://www.googleapis.com/upload/storage/v1/b/<project>/o?ifGenerationMatch=0&name=job-manager/15aa2391-a055-4bfd-8d82-e9e4806baa9c/8ae818761055cdc022822010a8b4a1ed/chk-52224/_metadata&uploadType=resumable&upload_id=AEnB2UqJwkdrQ8YuzqrTp9Nk4bDnzbuJcTlD5E5hKNLNz4xQ7vjlYrDzYC29ImHcp0o6OjSCmQo6xkDSj5OHly7aChH0JxxXcg.
Delegating to response handler for possible retry."

We're kind of stuck on these questions:
* Is flink capable or doing these retries?
* Does anyone succesfully write their (rocksdb) state to Google Cloud
storage for bigger state sizes?
* Is it possible flink renames or deletes certain directories before all
flushes have been done based on an atomic guarantee provided by HDFS that
does not hold on other implementations perhaps? A race condition of sorts

Basically does anyone recognize this behavior?

Regards,

Richard Deurwaarder

[1] We use an HDFS implementation provided by Google
https://github.com/GoogleCloudDataproc/bigdata-interop/tree/master/gcs
[2] https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone
[3]
https://github.com/GoogleCloudDataproc/bigdata-interop/blob/master/gcs/CONFIGURATION.md
(see
fs.gs.outputstream.upload.chunk.size)
[4] Stacktrace:
https://gist.github.com/Xeli/da4c0af2c49c060139ad01945488e492

Does flink support retries on checkpoint write failures

Reply via email to