[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197658#comment-17197658 ] Etienne Chauchot commented on FLINK-17073: -- Ah, sure ! I did not notice, he'll be back that soon. Thanks. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197547#comment-17197547 ] Piotr Nowojski commented on FLINK-17073: Hi [~echauchot], [sorry for the delay|https://twitter.com/schneems/status/1191844272682786822]. [~roman_khachatryan] should be back on Monday, can we postpone the review until then? I think it would be the most efficient to wait for him to finish this off. I can ping [~roman_khachatryan] to take a look at it as soon as he is back online. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197538#comment-17197538 ] Etienne Chauchot commented on FLINK-17073: -- Hi, I addressed all the comments in the PR and added an integration test. Is someone available for finishing review/merging while Roman is on vacation ? > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177057#comment-17177057 ] Etienne Chauchot commented on FLINK-17073: -- Hi all, I'll be off for 2 weeks starting tonight, so I'll need to wait until my return to tackle these subjects. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175585#comment-17175585 ] Etienne Chauchot commented on FLINK-17073: -- [@ifndef-SleePy|https://github.com/ifndef-SleePy] [@rkhachatryan|https://github.com/rkhachatryan] I applied all your comments of the design doc to the PR PTAL > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170657#comment-17170657 ] Roman Khachatryan commented on FLINK-17073: --- Hi Etienne, You're right, I missed option 3.b in your design, which I think is implemented in the PR. However, I'm not sure that we can use it. I've commented in the design doc, please take a look. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170643#comment-17170643 ] Etienne Chauchot commented on FLINK-17073: -- Hi Roman, thanks for your feedback. Yes indeed it diverges in some points because while coding I figured out better responsibilities, coupling etc... The reasons are explained in the comments in the doc. Please take a look at them and tell me what you think. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170150#comment-17170150 ] Roman Khachatryan commented on FLINK-17073: --- Thanks for the PR [~echauchot] , I found that it significantly diverges both from [your design doc|#comment-17144726] and [my proposal above|#comment-17162168], in that it doesn't have CheckpointCleaner and it doesn't trigger checkpoint request upon discard completion. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170015#comment-17170015 ] Biao Liu commented on FLINK-17073: -- [~echauchot], sorry for the late reply. Thanks for pushing this! I'm OK with [~roman_khachatryan]'s plan. It's simpler to implement in some aspects indeed. In my plan, we have to consider how to avoid synchronous cleaning which you mentioned. Because in the near future, {{CheckpointCoordinator}} would be no big lock anymore. {quote}...we can drop new checkpoint requests when there are too many checkpoints to clean...{quote} I think we should take care of the cleaning for both successful checkpoint and failed checkpoint. I have left some comments in the doc. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168891#comment-17168891 ] Etienne Chauchot commented on FLINK-17073: -- [~roman_khachatryan] [~SleePy] I just submitted the PR: [https://github.com/apache/flink/pull/13040] also please read the design doc's implementation plan comments for an explanation of the impl choices (responsibilities, interfaces, threshold, resume event ...) > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166501#comment-17166501 ] Roman Khachatryan commented on FLINK-17073: --- Thanks for your analysis [~echauchot]. Sure, go ahead! > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166480#comment-17166480 ] Etienne Chauchot commented on FLINK-17073: -- [~roman_khachatryan] When [~SleePy] and I discussed in [the deisgn doc|https://docs.google.com/document/d/1q0y0aWlJMoUWNW7jjsM8uWfHsy2dM6YmmcmhpQzgLMA/edit?usp=sharing], the idea was to wait until last checkpoint was cleaned before accepting another (that is what we called make cleaning part of checkpoint processing). Thus, checking only existing number of pending checkpoints was enough (no need for a new queue) to foresee an flood of checkpoints to clean. But the solution you propose (managing the queue of the checkpoints to clean and monitor its size) seems even simpler to me: it avoids having to sync normal checkpointing and checkpoint cleaning: As you said, when we chose a checkpoint trigger request to execute (*CheckpointRequestDecider.chooseRequestToExecute*), we can drop new checkpoint requests when there are too many checkpoints to clean and thus regulate the whole checkpointing system. Syncing cleaning and checkpointing might not be necessary for this regulation, you're right. If you don't mind, I'll go for this implementation proposal in the design doc. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166310#comment-17166310 ] Etienne Chauchot commented on FLINK-17073: -- [~SleePy] sure, I'll update the google doc to add impl plan. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166244#comment-17166244 ] Biao Liu commented on FLINK-17073: -- BTW [~echauchot], before writing any codes, it would be great to write an implementation plan first. That's a better place to discuss implementation detail. I heard some other guys are also interested in this issue. It would be helpful fo them to understand what is happening. Besides that, there would be some other PRs on {{CheckpointCoordinator}} at the same time. We have to make sure there would be no big conflict between these changes. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165710#comment-17165710 ] Roman Khachatryan commented on FLINK-17073: --- Thanks [~echauchot], [~SleePy]! [~SleePy], sure, I just shared my view of how it can be implemented. [~echauchot], for (1) I don't think having a separate queue is an issue, rather the opposite (the class + thread manages its own work queue). for (2) I think checking the aforementioned queue size (+ number of deletions in progress) is enough, isn't it? > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165518#comment-17165518 ] Biao Liu commented on FLINK-17073: -- To [~roman_khachatryan], thanks for nice suggestions! {quote}I think an alternative (or complementary) temporary solution is to use a bounded queue when creating ioExecutor.{quote} I'm not a fan of this temporary solution. We have to consider how to treat the invoker which launches asynchronous IO operations through {{ioExecutor}} if the queue is full. Make them failed or wait till there is some space available? I'm afraid it's not a small work to review all the places calls {{ioExecutor}}. If we want a temporary solution, maybe we could just increase the thread count. Regarding to the long-term solution. Actually Etienne and me have not discuss many of the implementation details. I just gave some suggestions to make sure it's in the right direction. It's cool to have your detailed suggestions. It may help a lot for the contributor who is not familiar with this part. I just thought we don't have to discuss too much details here. It might be better to give contributor more free space. We could pay more attention on code review to guarantee it's correct and reasonable. BTW, just a tiny suggestion, code refactoring is not necessary, we should focus on solving the issue first. After that, we could consider if we could do some refactoring to make the codes more readable or elegant. To [~echauchot], besides the implementation, is there any question about the plan? Please feel free to ask anything that you don't understand. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164623#comment-17164623 ] Etienne Chauchot commented on FLINK-17073: -- Hi [~roman_khachatryan], thanks for the suggestions! Overall this is what I intended to do modulo these minor things: In 1. I meant to use the _CompletedCheckpoints_ dequeue to keep track of the checkpoints to clean and avoid adding a new queue In 2. Yes indeed, it needs to call _executeQueuedRequest,_ but it needs also not to call it when the previous checkpoint cleaning is not done (still need to figure out how to sync cleaning with work in CheckpointCoordinator) so that checkpoint cleaning becomes part of the checkpoint process and not a side fire-and-forget process. This behavior will be configurable to avoid lowering checkpoint rate when CP cleaning rate is not a problem. Once cleaning is part of the standard checkpointing process, checking in flight checkpoints will tell how many potential cleaning checkpoints there are and if there are too much, drop any new CP trigger request. In 3. yes that is what I meant in "drop any new CP trigger request" above. In 4. I'm not clear yet about concurrency in checkpointing. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162168#comment-17162168 ] Roman Khachatryan commented on FLINK-17073: --- As for the long-term solution, I'd propose the following: # Extract *ZooKeeperCompletedCheckpointStore.tryRemoveCompletedCheckpoint* (along with *executor*) to a new class, e.g. *CheckpointCleaner* that maintains a queue of checkpoints to remove # On removal completion, it calls *CheckpointCoordinator*.timer.execute(*executeQueuedRequest*) # In *CheckpointRequestDecider.chooseRequestToExecute*, check *CheckpointCleaner.numberOfCheckpointsToRemove* and return if it's greater than the threshold (see below) # *CheckpointCleaner* reports **this count in a thread-safe manner (performance isn't an issue here) This way, we ensure these properties: # if checkpoint can't be started, it's queued (if the queue has space) but not started # once a subsumed checkpoint is removed, we check the queue, and if possible, start the checkpoint It's possible that we check the queue twice (1st in chooseRequestToExecute, 2nd in CheckpointCleaner), but that's OK, we'll just execute the next request if possible. Regarding adding an additional configuration parameter for the threshold, I don't see much value in it. Conceptually, we don't want to proceed as long as there are checkpoints to remove from the previous completion. So we can use max-concurrent-checkpoints as a threshold (maybe multiplied by some constant factor to account for spikes and savepoints). What do you think [~echauchot], [~SleePy]? > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162112#comment-17162112 ] Roman Khachatryan commented on FLINK-17073: --- I think an alternative (or complementary) temporary solution is to use a bounded queue when creating ioExecutor. This way, we solve this issue and also possible others, which we don't account for in design doc. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161882#comment-17161882 ] Etienne Chauchot commented on FLINK-17073: -- thanks guys ! > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161704#comment-17161704 ] Zhijiang commented on FLINK-17073: -- I have assigned it to [~echauchot] > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161657#comment-17161657 ] Biao Liu commented on FLINK-17073: -- [~echauchot], sorry I don't have the authorization of issue assignment. [~pnowojski], could you help to assign the ticket to him? > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161316#comment-17161316 ] Etienne Chauchot commented on FLINK-17073: -- [~SleePy] can you assign this ticket to me please ? > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151881#comment-17151881 ] Etienne Chauchot commented on FLINK-17073: -- [~pnowojski] no problem. Sorry for bothering you during your days off :) . [~SleePy] thanks for reviewing this doc, I'll look at your comments. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151618#comment-17151618 ] Biao Liu commented on FLINK-17073: -- Hi [~echauchot], thanks for doing so much. I left a couple of comments in design doc. The second proposal seems to be a reliable and light solution :) > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150791#comment-17150791 ] Piotr Nowojski commented on FLINK-17073: Hey [~echauchot], I'm OoO since last two weeks and will be back on July 13th. Can we sync offline on this issue? > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17144733#comment-17144733 ] Etienne Chauchot commented on FLINK-17073: -- I'm always glad to help ! No problem about syncing with [~pnowojski] and [~SleePy]. Not an trivial problem indeed, but I figured out that it could be a good way to learn quite a lot about Flink internals :) > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17144726#comment-17144726 ] Etienne Chauchot commented on FLINK-17073: -- [~trohrmann], I wrote [this|[https://s.apache.org/checkpoint-backpressure]] FLIP style design document for checkpoint backpressure. Can you tell me what you think? Also I don't have the rights to create FLIP design documents in flink confluence workspace so I did the FLIP in a google doc. Can you give me the rights? > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17144715#comment-17144715 ] Till Rohrmann commented on FLINK-17073: --- Yes, your help is highly appreciated. Given that this is not a trivial problem I would suggest to sync with [~pnowojski] and [~SleePy] about the next steps. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143678#comment-17143678 ] Etienne Chauchot commented on FLINK-17073: -- [~trohrmann] I'm thinking about this issue, I'll start a first design doc for the checkpoint backpressure system. I'll send it there and on the ML. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116660#comment-17116660 ] Etienne Chauchot commented on FLINK-17073: -- Anyway I'd like to take part to the design discussions regarding checkpoint backpressure in order to learn about these topics. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108315#comment-17108315 ] Etienne Chauchot commented on FLINK-17073: -- ok, fair enough, I'll pick another one. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Major > Fix For: 1.11.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108305#comment-17108305 ] Till Rohrmann commented on FLINK-17073: --- Hi [~echauchot], I think what needs to happen is that we backpressure the whole checkpointing mechanism if the cleanup cannot keep up with it. This means that we don't trigger new checkpoints. This is far from trivial to realize, though, and on top of it, it needs to be properly designed. I'm not entirely sure whether this is really a good task to get started with. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Major > Fix For: 1.11.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108270#comment-17108270 ] Etienne Chauchot commented on FLINK-17073: -- Hi, I just started contributing to Flink. I'd like to take a look at this subject as it can be a good introduction to Flink architecture IMHO. As the temporary workaround discussed above has been implemented [here|[https://github.com/apache/flink/pull/11957]] maybe it is time to tackle the above subject. One thing I wonder is: if we want to limit the number of CompletedCheckpoints submitted to the IOExecutor for cleaning, what happens if _ZooKeeperCompletedCheckpointStore_ tries to submit a new CompletedCheckpoint when the limit has already been reached ? Shall it delay the submission waiting for the current number of submitted tasks to decrease? > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Major > Fix For: 1.11.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098770#comment-17098770 ] Piotr Nowojski commented on FLINK-17073: As it's a rare issue that in this particular case will have a workaround (increasing number of threads in the thread pool), I'm reducing the priority. Also because of other independent efforts in the {{CheckpointCoordinator}} code, it's unlikely the bug fix will be back-ported to previous releases. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Major > Fix For: 1.11.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094277#comment-17094277 ] Till Rohrmann commented on FLINK-17073: --- I would like to keep this ticket and FLINK-17248 separate. The reason is that I believe that the latter will just mitigate the problem and the true bug is that we don't limit the number of concurrent checkpoint clean up tasks. In that sense, it has always been broken. Moreover, this ticket is not about a regression we want to fix. The underlying problem only surfaced due to changing the executor. What we can do is to say that FLINK-17421 will fix this bug here. Hence, once FLINK-17421 is closed, we can close this bug issue. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Critical > Fix For: 1.10.1, 1.11.0, 1.9.4 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093738#comment-17093738 ] Piotr Nowojski commented on FLINK-17073: I haven't analysed the issue, so I'm not sure if indeed backpressuring is the right thing to do, but assuming that is the case, it's not a bug fix, but an improvement/new feature, so I've created another ticket for that FLINK-17421. Btw, IMO FLINK-17248 is a duplicate of this ticket, as it's fixing regression reported in this issue. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Critical > Fix For: 1.10.1, 1.11.0, 1.9.4 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084149#comment-17084149 ] Yun Tang commented on FLINK-17073: -- I prefer to the configuration solution to keep the behavior the same as before. Current Flink architecture cannot totally prevent this problem if the speed of creating checkpoint larger than the speed of deleting previous checkpoints. Increase the pool size could not prevent this but only mitigate the possibility. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Critical > Fix For: 1.9.3, 1.10.1, 1.11.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084107#comment-17084107 ] Till Rohrmann commented on FLINK-17073: --- Maybe we could create a related issue which introduces the configuration option and keep this one for the proper fix which might entail to throttle the checkpoint throughput based on the cleanup backlog. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Critical > Fix For: 1.9.3, 1.10.1, 1.11.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084096#comment-17084096 ] Till Rohrmann commented on FLINK-17073: --- I agree, but this would require bigger architectural changes. In the meantime I would suggest to make the number of IO threads configurable for the user. That way, users can work around this problem until the proper fix has been put in place. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Critical > Fix For: 1.9.3, 1.10.1, 1.11.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084058#comment-17084058 ] Chesnay Schepler commented on FLINK-17073: -- Seems like the underlying issue is that we aren't limiting the number of tasks that can be queued up (which I _think_ would implicitly slow down checkpointing since it would delay the completion of a pending checkpoint). Ignoring potential architectural problems, this should make the system way more resilient to these kind of issues than an increase of the pool size would. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Priority: Critical > Fix For: 1.9.3, 1.10.1, 1.11.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)