[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files
[ https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329672#comment-17329672 ] Flink Jira Bot commented on FLINK-13808: This issue has been labeled "stale-minor" for 7 days. It is closed now. If you are still affected by this or would like to raise the priority of this ticket please re-open, removing the label "auto-closed" and raise the ticket priority accordingly. > Checkpoints expired by timeout may leak RocksDB files > - > > Key: FLINK-13808 > URL: https://issues.apache.org/jira/browse/FLINK-13808 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.0, 1.8.1 > Environment: So far only reliably reproducible on a 4-node cluster > with parallelism ≥ 100. But do try > https://github.com/jcaesar/flink-rocksdb-file-leak >Reporter: Julius Michaelis >Priority: Minor > Labels: stale-minor > > A RocksDB state backend with HDFS checkpoints, with or without local > recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout. > If the size of a checkpoint crosses what can be transferred during one > checkpoint timeout, checkpoints will continue to fail forever. If this is > combined with a quick rollover of SST files (e.g. due to a high density of > writes), this may quickly exhaust available disk space (or memory, as /tmp is > the default location). > As a workaround, the jobmanager's REST API can be frequently queried for > failed checkpoints, and associated files deleted accordingly. > I've tried investing the cause a little bit, but I'm stuck: > * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before > completing.}} and similar gets printed, so > * [{{abortExpired}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549], > so > * [{{dispose}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416], > so > * [{{cancelCaller}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488], > so > * [the canceler is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497] > ([through one more > layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]), > so > * [{{cleanup}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95], > (possibly [not from > {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]), > so > * [{{cleanupProvidedResources}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162] > (this is the indirection that made me give up), so > * [this trace > log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372] > should be printed, but it isn't. > I have some time to further investigate, but I'd appreciate help on finding > out where in this chain things go wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files
[ https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321470#comment-17321470 ] Flink Jira Bot commented on FLINK-13808: This issue and all of its Sub-Tasks have not been updated for 180 days. So, it has been labeled "stale-minor". If you are still affected by this bug or are still interested in this issue, please give an update and remove the label. In 7 days the issue will be closed automatically. > Checkpoints expired by timeout may leak RocksDB files > - > > Key: FLINK-13808 > URL: https://issues.apache.org/jira/browse/FLINK-13808 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.0, 1.8.1 > Environment: So far only reliably reproducible on a 4-node cluster > with parallelism ≥ 100. But do try > https://github.com/jcaesar/flink-rocksdb-file-leak >Reporter: Julius Michaelis >Priority: Minor > Labels: stale-minor > > A RocksDB state backend with HDFS checkpoints, with or without local > recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout. > If the size of a checkpoint crosses what can be transferred during one > checkpoint timeout, checkpoints will continue to fail forever. If this is > combined with a quick rollover of SST files (e.g. due to a high density of > writes), this may quickly exhaust available disk space (or memory, as /tmp is > the default location). > As a workaround, the jobmanager's REST API can be frequently queried for > failed checkpoints, and associated files deleted accordingly. > I've tried investing the cause a little bit, but I'm stuck: > * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before > completing.}} and similar gets printed, so > * [{{abortExpired}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549], > so > * [{{dispose}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416], > so > * [{{cancelCaller}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488], > so > * [the canceler is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497] > ([through one more > layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]), > so > * [{{cleanup}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95], > (possibly [not from > {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]), > so > * [{{cleanupProvidedResources}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162] > (this is the indirection that made me give up), so > * [this trace > log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372] > should be printed, but it isn't. > I have some time to further investigate, but I'd appreciate help on finding > out where in this chain things go wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files
[ https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995375#comment-16995375 ] Congxian Qiu(klion26) commented on FLINK-13808: --- [~sewen] FYI, create an issue to track this FLINK-15236 > Checkpoints expired by timeout may leak RocksDB files > - > > Key: FLINK-13808 > URL: https://issues.apache.org/jira/browse/FLINK-13808 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.0, 1.8.1 > Environment: So far only reliably reproducible on a 4-node cluster > with parallelism ≥ 100. But do try > https://github.com/jcaesar/flink-rocksdb-file-leak >Reporter: Julius Michaelis >Priority: Minor > > A RocksDB state backend with HDFS checkpoints, with or without local > recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout. > If the size of a checkpoint crosses what can be transferred during one > checkpoint timeout, checkpoints will continue to fail forever. If this is > combined with a quick rollover of SST files (e.g. due to a high density of > writes), this may quickly exhaust available disk space (or memory, as /tmp is > the default location). > As a workaround, the jobmanager's REST API can be frequently queried for > failed checkpoints, and associated files deleted accordingly. > I've tried investing the cause a little bit, but I'm stuck: > * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before > completing.}} and similar gets printed, so > * [{{abortExpired}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549], > so > * [{{dispose}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416], > so > * [{{cancelCaller}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488], > so > * [the canceler is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497] > ([through one more > layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]), > so > * [{{cleanup}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95], > (possibly [not from > {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]), > so > * [{{cleanupProvidedResources}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162] > (this is the indirection that made me give up), so > * [this trace > log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372] > should be printed, but it isn't. > I have some time to further investigate, but I'd appreciate help on finding > out where in this chain things go wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files
[ https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994786#comment-16994786 ] Stephan Ewen commented on FLINK-13808: -- [~Caesar] I think a cancellation via timeout on the TaskManager makes sense, but would be more of a safety net here as well, I would assume. [~klion26] The {{maxConcurrentCheckpoints}} like you suggested would not take manually triggered checkpoints / savepoints into account. > Checkpoints expired by timeout may leak RocksDB files > - > > Key: FLINK-13808 > URL: https://issues.apache.org/jira/browse/FLINK-13808 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.0, 1.8.1 > Environment: So far only reliably reproducible on a 4-node cluster > with parallelism ≥ 100. But do try > https://github.com/jcaesar/flink-rocksdb-file-leak >Reporter: Julius Michaelis >Priority: Minor > > A RocksDB state backend with HDFS checkpoints, with or without local > recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout. > If the size of a checkpoint crosses what can be transferred during one > checkpoint timeout, checkpoints will continue to fail forever. If this is > combined with a quick rollover of SST files (e.g. due to a high density of > writes), this may quickly exhaust available disk space (or memory, as /tmp is > the default location). > As a workaround, the jobmanager's REST API can be frequently queried for > failed checkpoints, and associated files deleted accordingly. > I've tried investing the cause a little bit, but I'm stuck: > * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before > completing.}} and similar gets printed, so > * [{{abortExpired}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549], > so > * [{{dispose}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416], > so > * [{{cancelCaller}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488], > so > * [the canceler is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497] > ([through one more > layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]), > so > * [{{cleanup}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95], > (possibly [not from > {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]), > so > * [{{cleanupProvidedResources}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162] > (this is the indirection that made me give up), so > * [this trace > log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372] > should be printed, but it isn't. > I have some time to further investigate, but I'd appreciate help on finding > out where in this chain things go wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files
[ https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953299#comment-16953299 ] Julius Michaelis commented on FLINK-13808: -- As for something else that may be easier to implement than a cancellation message: why don't the taskmanagers also cancel checkpoints based on timeouts? (Of course, that doesn't catch other cancellation reasons, but…) > Checkpoints expired by timeout may leak RocksDB files > - > > Key: FLINK-13808 > URL: https://issues.apache.org/jira/browse/FLINK-13808 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.0, 1.8.1 > Environment: So far only reliably reproducible on a 4-node cluster > with parallelism ≥ 100. But do try > https://github.com/jcaesar/flink-rocksdb-file-leak >Reporter: Julius Michaelis >Priority: Minor > > A RocksDB state backend with HDFS checkpoints, with or without local > recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout. > If the size of a checkpoint crosses what can be transferred during one > checkpoint timeout, checkpoints will continue to fail forever. If this is > combined with a quick rollover of SST files (e.g. due to a high density of > writes), this may quickly exhaust available disk space (or memory, as /tmp is > the default location). > As a workaround, the jobmanager's REST API can be frequently queried for > failed checkpoints, and associated files deleted accordingly. > I've tried investing the cause a little bit, but I'm stuck: > * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before > completing.}} and similar gets printed, so > * [{{abortExpired}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549], > so > * [{{dispose}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416], > so > * [{{cancelCaller}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488], > so > * [the canceler is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497] > ([through one more > layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]), > so > * [{{cleanup}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95], > (possibly [not from > {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]), > so > * [{{cleanupProvidedResources}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162] > (this is the indirection that made me give up), so > * [this trace > log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372] > should be printed, but it isn't. > I have some time to further investigate, but I'd appreciate help on finding > out where in this chain things go wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files
[ https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946458#comment-16946458 ] Congxian Qiu(klion26) commented on FLINK-13808: --- Thanks for the reply [~sewen] I agree we should add a {{taskmanager.checkpoints.max-concurrent}} for saftety net, but this may be tricky if maxConcurrentCheckpoints > 1, for example, we have maxConcurrentCheckpoints = 2, and we have checkpoint 1 and checkpoint 2 running now, checkpoint 2 was declined by some reason, we’ll trigger checkpoint 3, and in some TM the snapshots of checkpoint 1 and checkpoint 2 are still running, now if we want to cancel the earliest checkpoint, we may choose checkpoint 1(but checkpoint 1 may success in the future). So I propose to add this configure {{taskmanager.checkpoints.max-concurrent}}, and the default value for maxConcurrentCheckpoints=1 is 1 and unlimited for maxConcurrentCheckpoints > 1. * If maxConcurrentCheckpoints = 1, the default {{taskmanager.checkpoints.max-concurrent}} is 1. * If maxConcurrentCheckpoints > 1 the default value for {{taskmanager.checkpoints.max-concurrent}}, is unlimited Yes, as you described, if the cancellation messages do not cancel quick enough, there may no checkpoint complete anymore. As described above, I'll create an issue to add additional taskmanager.checkpoints.max-concurrent value, what do you think about this, sir? Aside from this, I reported another issue(FLINK-13861) which will cause JM never trigger checkpoint any more if cancel an expired checkpoint throws an Exception. > Checkpoints expired by timeout may leak RocksDB files > - > > Key: FLINK-13808 > URL: https://issues.apache.org/jira/browse/FLINK-13808 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.0, 1.8.1 > Environment: So far only reliably reproducible on a 4-node cluster > with parallelism ≥ 100. But do try > https://github.com/jcaesar/flink-rocksdb-file-leak >Reporter: Julius Michaelis >Priority: Minor > > A RocksDB state backend with HDFS checkpoints, with or without local > recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout. > If the size of a checkpoint crosses what can be transferred during one > checkpoint timeout, checkpoints will continue to fail forever. If this is > combined with a quick rollover of SST files (e.g. due to a high density of > writes), this may quickly exhaust available disk space (or memory, as /tmp is > the default location). > As a workaround, the jobmanager's REST API can be frequently queried for > failed checkpoints, and associated files deleted accordingly. > I've tried investing the cause a little bit, but I'm stuck: > * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before > completing.}} and similar gets printed, so > * [{{abortExpired}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549], > so > * [{{dispose}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416], > so > * [{{cancelCaller}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488], > so > * [the canceler is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497] > ([through one more > layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]), > so > * [{{cleanup}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95], > (possibly [not from > {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]), > so > * [{{cleanupProvidedResources}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162] > (this is the indirection that made me give up), so > * [this trace > log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372] > should be printed, but it isn't. > I have some time to further investigate, but I'd appreciate help on finding > out where in this chain
[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files
[ https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939689#comment-16939689 ] Stephan Ewen commented on FLINK-13808: -- Thank you [~klion26] for the debugging and the suggestion to fix this. We should definitely have a "cancellation" call from the JobManager to the TaskManagers that cancels both the asynchronous materialization threads, and cleans up the checkpoint's temp files. I am contemplating whether having an additional {{taskmanager.checkpoints.max-concurrent}} value that limits the number of concurrent checkpoints on the TM might be a good safety net. Can there be cases where this is needed (the cancellation messages do not cancel quick enough) could also regress to a situation where no checkpoints go through any more, because different TMs have different checkpoints lingering, and there is always at least one TM declining a checkpoint. Would the cases, where the cancellation messages are not enough, be exactly those dangerous cases where the system would regress into a situation where no checkpoints can happen any more? > Checkpoints expired by timeout may leak RocksDB files > - > > Key: FLINK-13808 > URL: https://issues.apache.org/jira/browse/FLINK-13808 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.0, 1.8.1 > Environment: So far only reliably reproducible on a 4-node cluster > with parallelism ≥ 100. But do try > https://github.com/jcaesar/flink-rocksdb-file-leak >Reporter: Julius Michaelis >Priority: Minor > > A RocksDB state backend with HDFS checkpoints, with or without local > recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout. > If the size of a checkpoint crosses what can be transferred during one > checkpoint timeout, checkpoints will continue to fail forever. If this is > combined with a quick rollover of SST files (e.g. due to a high density of > writes), this may quickly exhaust available disk space (or memory, as /tmp is > the default location). > As a workaround, the jobmanager's REST API can be frequently queried for > failed checkpoints, and associated files deleted accordingly. > I've tried investing the cause a little bit, but I'm stuck: > * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before > completing.}} and similar gets printed, so > * [{{abortExpired}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549], > so > * [{{dispose}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416], > so > * [{{cancelCaller}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488], > so > * [the canceler is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497] > ([through one more > layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]), > so > * [{{cleanup}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95], > (possibly [not from > {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]), > so > * [{{cleanupProvidedResources}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162] > (this is the indirection that made me give up), so > * [this trace > log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372] > should be printed, but it isn't. > I have some time to further investigate, but I'd appreciate help on finding > out where in this chain things go wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files
[ https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939683#comment-16939683 ] Stephan Ewen commented on FLINK-13808: -- [~Caesar] yes, each checkpoint spawns per task a background thread to asynchronously materialize the state checkpoint. It is the same problem as with the temp files: Too many concurrent and lingering checkpoints. > Checkpoints expired by timeout may leak RocksDB files > - > > Key: FLINK-13808 > URL: https://issues.apache.org/jira/browse/FLINK-13808 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.0, 1.8.1 > Environment: So far only reliably reproducible on a 4-node cluster > with parallelism ≥ 100. But do try > https://github.com/jcaesar/flink-rocksdb-file-leak >Reporter: Julius Michaelis >Priority: Minor > > A RocksDB state backend with HDFS checkpoints, with or without local > recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout. > If the size of a checkpoint crosses what can be transferred during one > checkpoint timeout, checkpoints will continue to fail forever. If this is > combined with a quick rollover of SST files (e.g. due to a high density of > writes), this may quickly exhaust available disk space (or memory, as /tmp is > the default location). > As a workaround, the jobmanager's REST API can be frequently queried for > failed checkpoints, and associated files deleted accordingly. > I've tried investing the cause a little bit, but I'm stuck: > * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before > completing.}} and similar gets printed, so > * [{{abortExpired}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549], > so > * [{{dispose}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416], > so > * [{{cancelCaller}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488], > so > * [the canceler is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497] > ([through one more > layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]), > so > * [{{cleanup}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95], > (possibly [not from > {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]), > so > * [{{cleanupProvidedResources}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162] > (this is the indirection that made me give up), so > * [this trace > log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372] > should be printed, but it isn't. > I have some time to further investigate, but I'd appreciate help on finding > out where in this chain things go wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files
[ https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932283#comment-16932283 ] Julius Michaelis commented on FLINK-13808: -- Hm, Is it possible that checkpoints also spawn additional threads that "leak"? I've just had several {{java.lang.OutOfMemoryError: unable to create new native thread}} in similar scenario that lead to catastrophic failure in the entire cluster. > Checkpoints expired by timeout may leak RocksDB files > - > > Key: FLINK-13808 > URL: https://issues.apache.org/jira/browse/FLINK-13808 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.0, 1.8.1 > Environment: So far only reliably reproducible on a 4-node cluster > with parallelism ≥ 100. But do try > https://github.com/jcaesar/flink-rocksdb-file-leak >Reporter: Julius Michaelis >Priority: Minor > > A RocksDB state backend with HDFS checkpoints, with or without local > recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout. > If the size of a checkpoint crosses what can be transferred during one > checkpoint timeout, checkpoints will continue to fail forever. If this is > combined with a quick rollover of SST files (e.g. due to a high density of > writes), this may quickly exhaust available disk space (or memory, as /tmp is > the default location). > As a workaround, the jobmanager's REST API can be frequently queried for > failed checkpoints, and associated files deleted accordingly. > I've tried investing the cause a little bit, but I'm stuck: > * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before > completing.}} and similar gets printed, so > * [{{abortExpired}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549], > so > * [{{dispose}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416], > so > * [{{cancelCaller}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488], > so > * [the canceler is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497] > ([through one more > layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]), > so > * [{{cleanup}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95], > (possibly [not from > {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]), > so > * [{{cleanupProvidedResources}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162] > (this is the indirection that made me give up), so > * [this trace > log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372] > should be printed, but it isn't. > I have some time to further investigate, but I'd appreciate help on finding > out where in this chain things go wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files
[ https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923874#comment-16923874 ] Congxian Qiu(klion26) commented on FLINK-13808: --- After analysis with [~Caesar], this issue is caused by the IO/Network problem. The local directory will just be deleted on such below scenarios: * snapshot failed * snapshot succeed (when local recovery disabled) * have received a newer complete checkpoint(when local recovery enabled) For this issue, the snapshot is still ongoing (uploading the sst files) when observed there are some files leaked. so the local directory would not be deleted. I think FLINK-8871 help for this issue, Aside from FLINK-8871, I want to propose such improvements: * just keep ${{maxConcurrentCheckpoint}} snapshot on TM side, which means if we have {{maxConcurrentCheckpoint}} is 2, and the current checkpoint 5, we'll cancel all the checkpoint before 4 (maybe the complete/cancel RPC message is late) * add some debug/trace log to track the steps of the snapshot on tm side, so users can know where is snapshot currently is on [~StephanEwen] [~carp84] What do you think about the above two improvements, if this is ok, I'll file issues and contribute them. > Checkpoints expired by timeout may leak RocksDB files > - > > Key: FLINK-13808 > URL: https://issues.apache.org/jira/browse/FLINK-13808 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.8.0, 1.8.1 > Environment: So far only reliably reproducible on a 4-node cluster > with parallelism ≥ 100. But do try > https://github.com/jcaesar/flink-rocksdb-file-leak >Reporter: Julius Michaelis >Priority: Minor > > A RocksDB state backend with HDFS checkpoints, with or without local > recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout. > If the size of a checkpoint crosses what can be transferred during one > checkpoint timeout, checkpoints will continue to fail forever. If this is > combined with a quick rollover of SST files (e.g. due to a high density of > writes), this may quickly exhaust available disk space (or memory, as /tmp is > the default location). > As a workaround, the jobmanager's REST API can be frequently queried for > failed checkpoints, and associated files deleted accordingly. > I've tried investing the cause a little bit, but I'm stuck: > * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before > completing.}} and similar gets printed, so > * [{{abortExpired}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549], > so > * [{{dispose}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416], > so > * [{{cancelCaller}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488], > so > * [the canceler is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497] > ([through one more > layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]), > so > * [{{cleanup}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95], > (possibly [not from > {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]), > so > * [{{cleanupProvidedResources}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162] > (this is the indirection that made me give up), so > * [this trace > log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372] > should be printed, but it isn't. > I have some time to further investigate, but I'd appreciate help on finding > out where in this chain things go wrong. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files
[ https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918254#comment-16918254 ] Congxian Qiu(klion26) commented on FLINK-13808: --- Thanks for the ping [~carp84], I'll take a look at this now. > Checkpoints expired by timeout may leak RocksDB files > - > > Key: FLINK-13808 > URL: https://issues.apache.org/jira/browse/FLINK-13808 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.8.0, 1.8.1 > Environment: So far only reliably reproducible on a 4-node cluster > with parallelism ≥ 100. But do try > https://github.com/jcaesar/flink-rocksdb-file-leak >Reporter: Julius Michaelis >Priority: Minor > > A RocksDB state backend with HDFS checkpoints, with or without local > recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout. > If the size of a checkpoint crosses what can be transferred during one > checkpoint timeout, checkpoints will continue to fail forever. If this is > combined with a quick rollover of SST files (e.g. due to a high density of > writes), this may quickly exhaust available disk space (or memory, as /tmp is > the default location). > As a workaround, the jobmanager's REST API can be frequently queried for > failed checkpoints, and associated files deleted accordingly. > I've tried investing the cause a little bit, but I'm stuck: > * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before > completing.}} and similar gets printed, so > * [{{abortExpired}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549], > so > * [{{dispose}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416], > so > * [{{cancelCaller}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488], > so > * [the canceler is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497] > ([through one more > layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]), > so > * [{{cleanup}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95], > (possibly [not from > {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]), > so > * [{{cleanupProvidedResources}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162] > (this is the indirection that made me give up), so > * [this trace > log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372] > should be printed, but it isn't. > I have some time to further investigate, but I'd appreciate help on finding > out where in this chain things go wrong. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files
[ https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918253#comment-16918253 ] Yu Li commented on FLINK-13808: --- Thanks for the ping Stephan, just noticed. [~klion26] Please take a look here, thanks. > Checkpoints expired by timeout may leak RocksDB files > - > > Key: FLINK-13808 > URL: https://issues.apache.org/jira/browse/FLINK-13808 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.8.0, 1.8.1 > Environment: So far only reliably reproducible on a 4-node cluster > with parallelism ≥ 100. But do try > https://github.com/jcaesar/flink-rocksdb-file-leak >Reporter: Julius Michaelis >Priority: Minor > > A RocksDB state backend with HDFS checkpoints, with or without local > recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout. > If the size of a checkpoint crosses what can be transferred during one > checkpoint timeout, checkpoints will continue to fail forever. If this is > combined with a quick rollover of SST files (e.g. due to a high density of > writes), this may quickly exhaust available disk space (or memory, as /tmp is > the default location). > As a workaround, the jobmanager's REST API can be frequently queried for > failed checkpoints, and associated files deleted accordingly. > I've tried investing the cause a little bit, but I'm stuck: > * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before > completing.}} and similar gets printed, so > * [{{abortExpired}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549], > so > * [{{dispose}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416], > so > * [{{cancelCaller}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488], > so > * [the canceler is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497] > ([through one more > layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]), > so > * [{{cleanup}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95], > (possibly [not from > {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]), > so > * [{{cleanupProvidedResources}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162] > (this is the indirection that made me give up), so > * [this trace > log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372] > should be printed, but it isn't. > I have some time to further investigate, but I'd appreciate help on finding > out where in this chain things go wrong. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files
[ https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913079#comment-16913079 ] Stephan Ewen commented on FLINK-13808: -- [~carp84] FYI > Checkpoints expired by timeout may leak RocksDB files > - > > Key: FLINK-13808 > URL: https://issues.apache.org/jira/browse/FLINK-13808 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.8.0, 1.8.1 > Environment: So far only reliably reproducible on a 4-node cluster > with parallelism ≥ 100. But do try > https://github.com/jcaesar/flink-rocksdb-file-leak >Reporter: Julius Michaelis >Priority: Minor > > A RocksDB state backend with HDFS checkpoints, with or without local > recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout. > If the size of a checkpoint crosses what can be transferred during one > checkpoint timeout, checkpoints will continue to fail forever. If this is > combined with a quick rollover of SST files (e.g. due to a high density of > writes), this may quickly exhaust available disk space (or memory, as /tmp is > the default location). > As a workaround, the jobmanager's REST API can be frequently queried for > failed checkpoints, and associated files deleted accordingly. > I've tried investing the cause a little bit, but I'm stuck: > * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before > completing.}} and similar gets printed, so > * [{{abortExpired}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549], > so > * [{{dispose}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416], > so > * [{{cancelCaller}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488], > so > * [the canceler is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497] > ([through one more > layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]), > so > * [{{cleanup}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95], > (possibly [not from > {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]), > so > * [{{cleanupProvidedResources}} is > invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162] > (this is the indirection that made me give up), so > * [this trace > log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372] > should be printed, but it isn't. > I have some time to further investigate, but I'd appreciate help on finding > out where in this chain things go wrong. -- This message was sent by Atlassian Jira (v8.3.2#803003)