[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-09-17 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197658#comment-17197658
 ] 

Etienne Chauchot commented on FLINK-17073:
--

Ah, sure !  I did not notice, he'll be back that soon.

Thanks.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-09-17 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197547#comment-17197547
 ] 

Piotr Nowojski commented on FLINK-17073:


Hi [~echauchot], [sorry for the 
delay|https://twitter.com/schneems/status/1191844272682786822]. 
[~roman_khachatryan] should be back on Monday, can we postpone the review until 
then? I think it would be the most efficient to wait for him to finish this 
off. I can ping [~roman_khachatryan] to take a look at it as soon as he is back 
online.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-09-17 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197538#comment-17197538
 ] 

Etienne Chauchot commented on FLINK-17073:
--

Hi, I addressed all the comments in the PR and added an integration test. Is 
someone available for finishing review/merging while Roman is on vacation ?

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-08-13 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177057#comment-17177057
 ] 

Etienne Chauchot commented on FLINK-17073:
--

Hi all, 

I'll be off for 2 weeks starting tonight, so I'll need to wait until my return 
to tackle these subjects.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-08-11 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175585#comment-17175585
 ] 

Etienne Chauchot commented on FLINK-17073:
--

[@ifndef-SleePy|https://github.com/ifndef-SleePy] 
[@rkhachatryan|https://github.com/rkhachatryan] I applied all your comments of 
the design doc to the PR PTAL

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-08-04 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170657#comment-17170657
 ] 

Roman Khachatryan commented on FLINK-17073:
---

Hi Etienne, 

You're right, I missed option 3.b in your design, which I think is implemented 
in the PR.

However, I'm not sure that we can use it. I've commented in the design doc, 
please take a look.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-08-04 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170643#comment-17170643
 ] 

Etienne Chauchot commented on FLINK-17073:
--

Hi Roman, thanks for your feedback. Yes indeed it diverges in some points 
because while coding I figured out better responsibilities, coupling etc... The 
reasons are explained in the comments in the doc. Please take a look at them 
and tell me what you think.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-08-03 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170150#comment-17170150
 ] 

Roman Khachatryan commented on FLINK-17073:
---

Thanks for the PR [~echauchot] ,

I found that it significantly diverges both from [your design 
doc|#comment-17144726] and [my proposal above|#comment-17162168],
in that it doesn't have CheckpointCleaner and it doesn't trigger checkpoint 
request upon discard completion.

 

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-08-03 Thread Biao Liu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170015#comment-17170015
 ] 

Biao Liu commented on FLINK-17073:
--

[~echauchot], sorry for the late reply. Thanks for pushing this!

I'm OK with [~roman_khachatryan]'s plan. It's simpler to implement in some 
aspects indeed. In my plan, we have to consider how to avoid synchronous 
cleaning which you mentioned. Because in the near future, 
{{CheckpointCoordinator}} would be no big lock anymore. 

{quote}...we can drop new checkpoint requests when there are too many 
checkpoints to clean...{quote}
I think we should take care of the cleaning for both successful checkpoint and 
failed checkpoint. 

I have left some comments in the doc.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-31 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168891#comment-17168891
 ] 

Etienne Chauchot commented on FLINK-17073:
--

[~roman_khachatryan] [~SleePy] I just submitted the PR: 
[https://github.com/apache/flink/pull/13040] also please read the design doc's 
implementation plan comments for an explanation of the impl choices 
(responsibilities, interfaces, threshold, resume event ...)

 

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-28 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166501#comment-17166501
 ] 

Roman Khachatryan commented on FLINK-17073:
---

Thanks for your analysis [~echauchot].
Sure, go ahead!

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-28 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166480#comment-17166480
 ] 

Etienne Chauchot commented on FLINK-17073:
--

[~roman_khachatryan] 

When [~SleePy] and I discussed in [the deisgn 
doc|https://docs.google.com/document/d/1q0y0aWlJMoUWNW7jjsM8uWfHsy2dM6YmmcmhpQzgLMA/edit?usp=sharing],
 the idea was to wait until last checkpoint was cleaned before accepting 
another (that is what we called make cleaning part of checkpoint processing). 
Thus, checking only existing number of pending checkpoints was enough (no need 
for a new queue) to foresee an flood of checkpoints to clean. 

But the solution you propose (managing the queue of the checkpoints to clean 
and monitor its size) seems even simpler to me: it avoids having to sync normal 
checkpointing and checkpoint cleaning:

As you said, when we chose a checkpoint trigger request to execute 
(*CheckpointRequestDecider.chooseRequestToExecute*), we can drop new checkpoint 
requests when there are too many checkpoints to clean and thus regulate the 
whole checkpointing system. Syncing cleaning and checkpointing might not be 
necessary for this regulation, you're right.

If you don't mind, I'll go for this implementation proposal in the design doc.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-28 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166310#comment-17166310
 ] 

Etienne Chauchot commented on FLINK-17073:
--

[~SleePy] sure, I'll update the google doc to add impl plan.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-28 Thread Biao Liu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166244#comment-17166244
 ] 

Biao Liu commented on FLINK-17073:
--

BTW [~echauchot], before writing any codes, it would be great to write an 
implementation plan first. That's a better place to discuss implementation 
detail.
I heard some other guys are also interested in this issue. It would be helpful 
fo them to understand what is happening. Besides that, there would be some 
other PRs on {{CheckpointCoordinator}} at the same time. We have to make sure 
there would be no big conflict between these changes.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-27 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165710#comment-17165710
 ] 

Roman Khachatryan commented on FLINK-17073:
---

Thanks [~echauchot], [~SleePy]!

[~SleePy], sure, I just shared my view of how it can be implemented.

[~echauchot], 

for (1) I don't think having a separate queue is an issue, rather the opposite 
(the class + thread manages its own work queue).

for (2) I think checking the aforementioned queue size (+ number of deletions 
in progress) is enough, isn't it?

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-27 Thread Biao Liu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165518#comment-17165518
 ] 

Biao Liu commented on FLINK-17073:
--

To [~roman_khachatryan], thanks for nice suggestions!

{quote}I think an alternative (or complementary) temporary solution is to use a 
bounded queue when creating ioExecutor.{quote}
I'm not a fan of this temporary solution. We have to consider how to treat the 
invoker which launches asynchronous IO operations through {{ioExecutor}} if the 
queue is full. Make them failed or wait till there is some space available? I'm 
afraid it's not a small work to review all the places calls {{ioExecutor}}. If 
we want a temporary solution, maybe we could just increase the thread count. 

Regarding to the long-term solution. Actually Etienne and me have not discuss 
many of the implementation details. I just gave some suggestions to make sure 
it's in the right direction. It's cool to have your detailed suggestions. It 
may help a lot for the contributor who is not familiar with this part. I just 
thought we don't have to discuss too much details here. It might be better to 
give contributor more free space. We could pay more attention on code review to 
guarantee it's correct and reasonable.

BTW, just a tiny suggestion, code refactoring is not necessary, we should focus 
on solving the issue first. After that, we could consider if we could do some 
refactoring to make the codes more readable or elegant. 

To [~echauchot], besides the implementation, is there any question about the 
plan? Please feel free to ask anything that you don't understand. 

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-24 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164623#comment-17164623
 ] 

Etienne Chauchot commented on FLINK-17073:
--

Hi [~roman_khachatryan],

thanks for the suggestions! Overall this is what I intended to do modulo these 
minor things:

In 1. I meant to use the _CompletedCheckpoints_ dequeue to keep track of the 
checkpoints to clean and avoid adding a new queue

In 2. Yes indeed, it needs to call _executeQueuedRequest,_ but it needs also 
not to call it when the previous checkpoint cleaning is not done (still need to 
figure out how to sync cleaning with work in CheckpointCoordinator) so that 
checkpoint cleaning becomes part of the checkpoint process and not a side 
fire-and-forget process. This behavior will be configurable to avoid lowering 
checkpoint rate when CP cleaning rate is not a problem. Once cleaning is part 
of the standard checkpointing process, checking in flight checkpoints will tell 
how many potential cleaning checkpoints there are and if there are too much, 
drop any new CP trigger request.

In 3. yes that is what I meant in "drop any new CP trigger request" above.

In 4.  I'm not clear yet about concurrency in checkpointing.  

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-21 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162168#comment-17162168
 ] 

Roman Khachatryan commented on FLINK-17073:
---

As for the long-term solution, I'd propose the following:
 # Extract *ZooKeeperCompletedCheckpointStore.tryRemoveCompletedCheckpoint* 
(along with *executor*) to a new class, e.g. *CheckpointCleaner* that maintains 
a queue of checkpoints to remove
 # On removal completion, it calls 
*CheckpointCoordinator*.timer.execute(*executeQueuedRequest*)
 # In *CheckpointRequestDecider.chooseRequestToExecute*, check 
*CheckpointCleaner.numberOfCheckpointsToRemove* and return if it's greater than 
the threshold (see below) 
 # *CheckpointCleaner* reports **this count in a thread-safe manner 
(performance isn't an issue here)

 

This way, we ensure these properties:
 # if checkpoint can't be started, it's queued (if the queue has space) but not 
started
 # once a subsumed checkpoint is removed, we check the queue, and if possible, 
start the checkpoint 

It's possible that we check the queue twice (1st in chooseRequestToExecute, 2nd 
in CheckpointCleaner), but that's OK, we'll just execute the next request if 
possible.

 

Regarding adding an additional configuration parameter for the threshold, I 
don't see much value in it. Conceptually, we don't want to proceed as long as 
there are checkpoints to remove from the previous completion.

So we can use max-concurrent-checkpoints as a threshold (maybe multiplied by 
some constant factor to account for spikes and savepoints). 

 

What do you think [~echauchot], [~SleePy]?

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-21 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162112#comment-17162112
 ] 

Roman Khachatryan commented on FLINK-17073:
---

I think an alternative (or complementary) temporary solution is to use a 
bounded queue when creating ioExecutor.

This way, we solve this issue and also possible others, which we don't account 
for in design doc.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-21 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161882#comment-17161882
 ] 

Etienne Chauchot commented on FLINK-17073:
--

thanks guys !

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-20 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161704#comment-17161704
 ] 

Zhijiang commented on FLINK-17073:
--

I have assigned it to [~echauchot]

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-20 Thread Biao Liu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161657#comment-17161657
 ] 

Biao Liu commented on FLINK-17073:
--

[~echauchot], sorry I don't have the authorization of issue assignment. 
[~pnowojski], could you help to assign the ticket to him?

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-20 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161316#comment-17161316
 ] 

Etienne Chauchot commented on FLINK-17073:
--

[~SleePy] can you assign this ticket to me please ?

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-06 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151881#comment-17151881
 ] 

Etienne Chauchot commented on FLINK-17073:
--

[~pnowojski] no problem. Sorry for bothering you during your days off :) . 
[~SleePy] thanks for reviewing this doc, I'll look at your comments.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-05 Thread Biao Liu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151618#comment-17151618
 ] 

Biao Liu commented on FLINK-17073:
--

Hi [~echauchot], thanks for doing so much. I left a couple of comments in 
design doc. The second proposal seems to be a reliable and light solution :)

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-03 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150791#comment-17150791
 ] 

Piotr Nowojski commented on FLINK-17073:


Hey [~echauchot], I'm OoO since last two weeks and will be back on July 13th. 
Can we sync offline on this issue?

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-06-25 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17144733#comment-17144733
 ] 

Etienne Chauchot commented on FLINK-17073:
--

I'm always glad to help ! No problem about syncing with [~pnowojski] and 
[~SleePy]. Not an trivial problem indeed, but I figured out that it could be a 
good way to learn quite a lot about Flink internals :)

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-06-25 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17144726#comment-17144726
 ] 

Etienne Chauchot commented on FLINK-17073:
--

[~trohrmann], I wrote [this|[https://s.apache.org/checkpoint-backpressure]] 
FLIP style design document for checkpoint backpressure. Can you tell me what 
you think? Also I don't have the rights to create FLIP design documents in 
flink confluence workspace so I did the FLIP in a google doc. Can you give me 
the rights?

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-06-25 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17144715#comment-17144715
 ] 

Till Rohrmann commented on FLINK-17073:
---

Yes, your help is highly appreciated. Given that this is not a trivial problem 
I would suggest to sync with [~pnowojski] and [~SleePy] about the next steps.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-06-24 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143678#comment-17143678
 ] 

Etienne Chauchot commented on FLINK-17073:
--

[~trohrmann]  I'm thinking about this issue, I'll start a first design doc for 
the checkpoint backpressure system. I'll send it there and on the ML.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-05-26 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116660#comment-17116660
 ] 

Etienne Chauchot commented on FLINK-17073:
--

Anyway I'd like to take part to the design discussions regarding checkpoint 
backpressure in order to learn about these topics.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-05-15 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108315#comment-17108315
 ] 

Etienne Chauchot commented on FLINK-17073:
--

ok, fair enough, I'll pick another one.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.11.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-05-15 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108305#comment-17108305
 ] 

Till Rohrmann commented on FLINK-17073:
---

Hi [~echauchot], I think what needs to happen is that we backpressure the whole 
checkpointing mechanism if the cleanup cannot keep up with it. This means that 
we don't trigger new checkpoints. This is far from trivial to realize, though, 
and on top of it, it needs to be properly designed. I'm not entirely sure 
whether this is really a good task to get started with.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.11.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-05-15 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108270#comment-17108270
 ] 

Etienne Chauchot commented on FLINK-17073:
--

Hi, I just started contributing to Flink. I'd like to take a look at this 
subject as it can be a good introduction to Flink architecture IMHO. As the 
temporary workaround discussed above has been implemented 
[here|[https://github.com/apache/flink/pull/11957]] maybe it is time to tackle 
the above subject. One thing I wonder is: if we want to limit the number of 
CompletedCheckpoints submitted to the IOExecutor for cleaning, what happens if 

_ZooKeeperCompletedCheckpointStore_  tries to submit a new CompletedCheckpoint 
when the limit has already been reached ? Shall it delay the submission waiting 
for the current number of submitted tasks to decrease?

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.11.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-05-04 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098770#comment-17098770
 ] 

Piotr Nowojski commented on FLINK-17073:


As it's a rare issue that in this particular case will have a workaround 
(increasing number of threads in the thread pool), I'm reducing the priority. 
Also because of other independent efforts in the {{CheckpointCoordinator}} 
code, it's unlikely the bug fix will be back-ported to previous releases.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.11.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-04-28 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094277#comment-17094277
 ] 

Till Rohrmann commented on FLINK-17073:
---

I would like to keep this ticket and FLINK-17248 separate. The reason is that I 
believe that the latter will just mitigate the problem and the true bug is that 
we don't limit the number of concurrent checkpoint clean up tasks. In that 
sense, it has always been broken. Moreover, this ticket is not about a 
regression we want to fix. The underlying problem only surfaced due to changing 
the executor.

What we can do is to say that FLINK-17421 will fix this bug here. Hence, once 
FLINK-17421 is closed, we can close this bug issue.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Critical
> Fix For: 1.10.1, 1.11.0, 1.9.4
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-04-27 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093738#comment-17093738
 ] 

Piotr Nowojski commented on FLINK-17073:


I haven't analysed the issue, so I'm not sure if indeed backpressuring is the 
right thing to do, but assuming that is the case, it's not a bug fix, but an 
improvement/new feature, so I've created another ticket for that FLINK-17421.

Btw, IMO FLINK-17248 is a duplicate of this ticket, as it's fixing regression 
reported in this issue.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Critical
> Fix For: 1.10.1, 1.11.0, 1.9.4
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-04-15 Thread Yun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084149#comment-17084149
 ] 

Yun Tang commented on FLINK-17073:
--

I prefer to the configuration solution to keep the behavior the same as before. 
Current Flink architecture cannot totally prevent this problem if the speed of 
creating checkpoint larger than the speed of deleting previous checkpoints. 
Increase the pool size could not prevent this but only mitigate the possibility.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Critical
> Fix For: 1.9.3, 1.10.1, 1.11.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-04-15 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084107#comment-17084107
 ] 

Till Rohrmann commented on FLINK-17073:
---

Maybe we could create a related issue which introduces the configuration option 
and keep this one for the proper fix which might entail to throttle the 
checkpoint throughput based on the cleanup backlog.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Critical
> Fix For: 1.9.3, 1.10.1, 1.11.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-04-15 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084096#comment-17084096
 ] 

Till Rohrmann commented on FLINK-17073:
---

I agree, but this would require bigger architectural changes. In the meantime I 
would suggest to make the number of IO threads configurable for the user. That 
way, users can work around this problem until the proper fix has been put in 
place.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Critical
> Fix For: 1.9.3, 1.10.1, 1.11.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-04-15 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084058#comment-17084058
 ] 

Chesnay Schepler commented on FLINK-17073:
--

Seems like the underlying issue is that we aren't limiting the number of tasks 
that can be queued up (which I _think_ would implicitly slow down checkpointing 
since it would delay the completion of a pending checkpoint). Ignoring 
potential architectural problems, this should make the system way more 
resilient to these kind of issues than an increase of the pool size would.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Priority: Critical
> Fix For: 1.9.3, 1.10.1, 1.11.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)