[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333089#comment-17333089 ] Stephan Ewen commented on FLINK-13698: -- I think it is not this ticket that we need, or at least not what is described here. We need to change the {{CompletedCheckpointStore}} to do asynchronous loading of checkpoint metadata. That is an independent issue from this ticket. > Rework threading model of CheckpointCoordinator > --- > > Key: FLINK-13698 > URL: https://issues.apache.org/jira/browse/FLINK-13698 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.10.0 >Reporter: Piotr Nowojski >Priority: Critical > Fix For: 1.14.0 > > > Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is > executed by multiple different threads (mostly {{ioExecutor}}, but not only). > It's causing multiple concurrency issues, for example: > https://issues.apache.org/jira/browse/FLINK-13497 > Proper fix would be to rethink threading model there. At first glance it > doesn't seem that this code should be multi threaded, except of parts doing > the actual IO operations, so it should be possible to run everything in one > single ExecutionGraph's thread and just run asynchronously necessary IO > operations with some feedback loop ("mailbox style"). > I would strongly recommend fixing this issue before adding new features in > the \{{CheckpointCoordinator}} component. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330080#comment-17330080 ] Till Rohrmann commented on FLINK-13698: --- What is the state of this problem [~pnowojski]. I think we should raise the priority of this ticket to critical because it causes cluster instabilities which are quite serious. I think we should try to solve this problem eventually. > Rework threading model of CheckpointCoordinator > --- > > Key: FLINK-13698 > URL: https://issues.apache.org/jira/browse/FLINK-13698 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.10.0 >Reporter: Piotr Nowojski >Priority: Major > Fix For: 1.13.0 > > > Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is > executed by multiple different threads (mostly {{ioExecutor}}, but not only). > It's causing multiple concurrency issues, for example: > https://issues.apache.org/jira/browse/FLINK-13497 > Proper fix would be to rethink threading model there. At first glance it > doesn't seem that this code should be multi threaded, except of parts doing > the actual IO operations, so it should be possible to run everything in one > single ExecutionGraph's thread and just run asynchronously necessary IO > operations with some feedback loop ("mailbox style"). > I would strongly recommend fixing this issue before adding new features in > the \{{CheckpointCoordinator}} component. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17229069#comment-17229069 ] Roman Khachatryan commented on FLINK-13698: --- >From what I know from a recent discussion with Piotr, it's not being included >in 1.12. (he is currenlty on vacation). > Rework threading model of CheckpointCoordinator > --- > > Key: FLINK-13698 > URL: https://issues.apache.org/jira/browse/FLINK-13698 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.10.0 >Reporter: Piotr Nowojski >Assignee: Biao Liu >Priority: Major > Fix For: 1.12.0 > > > Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is > executed by multiple different threads (mostly {{ioExecutor}}, but not only). > It's causing multiple concurrency issues, for example: > https://issues.apache.org/jira/browse/FLINK-13497 > Proper fix would be to rethink threading model there. At first glance it > doesn't seem that this code should be multi threaded, except of parts doing > the actual IO operations, so it should be possible to run everything in one > single ExecutionGraph's thread and just run asynchronously necessary IO > operations with some feedback loop ("mailbox style"). > I would strongly recommend fixing this issue before adding new features in > the \{{CheckpointCoordinator}} component. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228487#comment-17228487 ] Stephan Ewen commented on FLINK-13698: -- AS far as I know, there has been no progress here. > Rework threading model of CheckpointCoordinator > --- > > Key: FLINK-13698 > URL: https://issues.apache.org/jira/browse/FLINK-13698 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.10.0 >Reporter: Piotr Nowojski >Assignee: Biao Liu >Priority: Major > Fix For: 1.12.0 > > > Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is > executed by multiple different threads (mostly {{ioExecutor}}, but not only). > It's causing multiple concurrency issues, for example: > https://issues.apache.org/jira/browse/FLINK-13497 > Proper fix would be to rethink threading model there. At first glance it > doesn't seem that this code should be multi threaded, except of parts doing > the actual IO operations, so it should be possible to run everything in one > single ExecutionGraph's thread and just run asynchronously necessary IO > operations with some feedback loop ("mailbox style"). > I would strongly recommend fixing this issue before adding new features in > the \{{CheckpointCoordinator}} component. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228484#comment-17228484 ] Till Rohrmann commented on FLINK-13698: --- What's the progress of this issue [~pnowojski]? Will this improvement be included in Flink {{1.12.0}}? > Rework threading model of CheckpointCoordinator > --- > > Key: FLINK-13698 > URL: https://issues.apache.org/jira/browse/FLINK-13698 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.10.0 >Reporter: Piotr Nowojski >Assignee: Biao Liu >Priority: Major > Fix For: 1.12.0 > > > Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is > executed by multiple different threads (mostly {{ioExecutor}}, but not only). > It's causing multiple concurrency issues, for example: > https://issues.apache.org/jira/browse/FLINK-13497 > Proper fix would be to rethink threading model there. At first glance it > doesn't seem that this code should be multi threaded, except of parts doing > the actual IO operations, so it should be possible to run everything in one > single ExecutionGraph's thread and just run asynchronously necessary IO > operations with some feedback loop ("mailbox style"). > I would strongly recommend fixing this issue before adding new features in > the \{{CheckpointCoordinator}} component. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931119#comment-16931119 ] Piotr Nowojski commented on FLINK-13698: Thanks for the update [~SleePy]. > Rework threading model of CheckpointCoordinator > --- > > Key: FLINK-13698 > URL: https://issues.apache.org/jira/browse/FLINK-13698 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.10.0 >Reporter: Piotr Nowojski >Assignee: Biao Liu >Priority: Critical > > Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is > executed by multiple different threads (mostly {{ioExecutor}}, but not only). > It's causing multiple concurrency issues, for example: > https://issues.apache.org/jira/browse/FLINK-13497 > Proper fix would be to rethink threading model there. At first glance it > doesn't seem that this code should be multi threaded, except of parts doing > the actual IO operations, so it should be possible to run everything in one > single ExecutionGraph's thread and just run asynchronously necessary IO > operations with some feedback loop ("mailbox style"). > I would strongly recommend fixing this issue before adding new features in > the \{{CheckpointCoordinator}} component. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924472#comment-16924472 ] Biao Liu commented on FLINK-13698: -- Just a progress updating, now I'm doing some POC developments. I found it's a bit hard to separate the issue and work into subtasks. Because the changes of subtasks are tightly coupled. So I stopped creating new subtasks to avoid adjusting them during the further developing. > Rework threading model of CheckpointCoordinator > --- > > Key: FLINK-13698 > URL: https://issues.apache.org/jira/browse/FLINK-13698 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.10.0 >Reporter: Piotr Nowojski >Assignee: Biao Liu >Priority: Critical > > Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is > executed by multiple different threads (mostly {{ioExecutor}}, but not only). > It's causing multiple concurrency issues, for example: > https://issues.apache.org/jira/browse/FLINK-13497 > Proper fix would be to rethink threading model there. At first glance it > doesn't seem that this code should be multi threaded, except of parts doing > the actual IO operations, so it should be possible to run everything in one > single ExecutionGraph's thread and just run asynchronously necessary IO > operations with some feedback loop ("mailbox style"). > I would strongly recommend fixing this issue before adding new features in > the \{{CheckpointCoordinator}} component. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915430#comment-16915430 ] Biao Liu commented on FLINK-13698: -- Sorry for a bit late response. I was trapped in some other things last week :( Thanks for detailed reviewing [~pnowojski], [~till.rohrmann]. It helps a lot. And I really appreciate that. I will separate this proposal into series of subtasks. > Rework threading model of CheckpointCoordinator > --- > > Key: FLINK-13698 > URL: https://issues.apache.org/jira/browse/FLINK-13698 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.10.0 >Reporter: Piotr Nowojski >Assignee: Biao Liu >Priority: Critical > > Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is > executed by multiple different threads (mostly {{ioExecutor}}, but not only). > It's causing multiple concurrency issues, for example: > https://issues.apache.org/jira/browse/FLINK-13497 > Proper fix would be to rethink threading model there. At first glance it > doesn't seem that this code should be multi threaded, except of parts doing > the actual IO operations, so it should be possible to run everything in one > single ExecutionGraph's thread and just run asynchronously necessary IO > operations with some feedback loop ("mailbox style"). > I would strongly recommend fixing this issue before adding new features in > the \{{CheckpointCoordinator}} component. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906075#comment-16906075 ] Biao Liu commented on FLINK-13698: -- Thanks for your attention [~StephanEwen]! Piotr and me had a short discussion last week. I believe my proposal matches your suggestions well. :) > Rework threading model of CheckpointCoordinator > --- > > Key: FLINK-13698 > URL: https://issues.apache.org/jira/browse/FLINK-13698 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.10.0 >Reporter: Piotr Nowojski >Assignee: Biao Liu >Priority: Critical > > Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is > executed by multiple different threads (mostly {{ioExecutor}}, but not only). > It's causing multiple concurrency issues, for example: > https://issues.apache.org/jira/browse/FLINK-13497 > Proper fix would be to rethink threading model there. At first glance it > doesn't seem that this code should be multi threaded, except of parts doing > the actual IO operations, so it should be possible to run everything in one > single ExecutionGraph's thread and just run asynchronously necessary IO > operations with some feedback loop ("mailbox style"). > I would strongly recommend fixing this issue before adding new features in > the \{{CheckpointCoordinator}} component. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906060#comment-16906060 ] Stephan Ewen commented on FLINK-13698: -- I would suggest to have the {{CheckpointCoordinator}} and the {{Scheduler}}/{{ExecutionGraph}} in the same thread. That makes a lot of things much easier. - The checkpoint timer logic should be factored out into the periodic checkpoint trigger. - The actual call - All cleanup and writing of state should be delegated to the I/O executor. -> mainly state release as a "fire and forget" execution in the I/O executor -> writing out master-hook data and checkpoint metadata needs to happen in the I/O executor, with a Future whose completion is again handles in the main thread executor. > Rework threading model of CheckpointCoordinator > --- > > Key: FLINK-13698 > URL: https://issues.apache.org/jira/browse/FLINK-13698 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.10.0 >Reporter: Piotr Nowojski >Assignee: Biao Liu >Priority: Critical > > Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is > executed by multiple different threads (mostly {{ioExecutor}}, but not only). > It's causing multiple concurrency issues, for example: > https://issues.apache.org/jira/browse/FLINK-13497 > Proper fix would be to rethink threading model there. At first glance it > doesn't seem that this code should be multi threaded, except of parts doing > the actual IO operations, so it should be possible to run everything in one > single ExecutionGraph's thread and just run asynchronously necessary IO > operations with some feedback loop ("mailbox style"). > I would strongly recommend fixing this issue before adding new features in > the \{{CheckpointCoordinator}} component. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905943#comment-16905943 ] Biao Liu commented on FLINK-13698: -- Hi [~pnowojski], I have attached a design document. That would be great if you have time to review it. > Rework threading model of CheckpointCoordinator > --- > > Key: FLINK-13698 > URL: https://issues.apache.org/jira/browse/FLINK-13698 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.10.0 >Reporter: Piotr Nowojski >Assignee: Biao Liu >Priority: Critical > > Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is > executed by multiple different threads (mostly {{ioExecutor}}, but not only). > It's causing multiple concurrency issues, for example: > https://issues.apache.org/jira/browse/FLINK-13497 > Proper fix would be to rethink threading model there. At first glance it > doesn't seem that this code should be multi threaded, except of parts doing > the actual IO operations, so it should be possible to run everything in one > single ExecutionGraph's thread and just run asynchronously necessary IO > operations with some feedback loop ("mailbox style"). > I would strongly recommend fixing this issue before adding new features in > the \{{CheckpointCoordinator}} component. -- This message was sent by Atlassian JIRA (v7.6.14#76016)