[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator

2021-04-27 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333089#comment-17333089
 ] 

Stephan Ewen commented on FLINK-13698:
--

I think it is not this ticket that we need, or at least not what is described 
here.

We need to change the {{CompletedCheckpointStore}} to do asynchronous loading 
of checkpoint metadata. That is an independent issue from this ticket.

> Rework threading model of CheckpointCoordinator
> ---
>
> Key: FLINK-13698
> URL: https://issues.apache.org/jira/browse/FLINK-13698
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Priority: Critical
> Fix For: 1.14.0
>
>
> Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is 
> executed by multiple different threads (mostly {{ioExecutor}}, but not only). 
> It's causing multiple concurrency issues, for example: 
> https://issues.apache.org/jira/browse/FLINK-13497
> Proper fix would be to rethink threading model there. At first glance it 
> doesn't seem that this code should be multi threaded, except of parts doing 
> the actual IO operations, so it should be possible to run everything in one 
> single ExecutionGraph's thread and just run asynchronously necessary IO 
> operations with some feedback loop ("mailbox style").
> I would strongly recommend fixing this issue before adding new features in 
> the \{{CheckpointCoordinator}} component.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator

2021-04-23 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330080#comment-17330080
 ] 

Till Rohrmann commented on FLINK-13698:
---

What is the state of this problem [~pnowojski]. I think we should raise the 
priority of this ticket to critical because it causes cluster instabilities 
which are quite serious. I think we should try to solve this problem eventually.

> Rework threading model of CheckpointCoordinator
> ---
>
> Key: FLINK-13698
> URL: https://issues.apache.org/jira/browse/FLINK-13698
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is 
> executed by multiple different threads (mostly {{ioExecutor}}, but not only). 
> It's causing multiple concurrency issues, for example: 
> https://issues.apache.org/jira/browse/FLINK-13497
> Proper fix would be to rethink threading model there. At first glance it 
> doesn't seem that this code should be multi threaded, except of parts doing 
> the actual IO operations, so it should be possible to run everything in one 
> single ExecutionGraph's thread and just run asynchronously necessary IO 
> operations with some feedback loop ("mailbox style").
> I would strongly recommend fixing this issue before adding new features in 
> the \{{CheckpointCoordinator}} component.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator

2020-11-10 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17229069#comment-17229069
 ] 

Roman Khachatryan commented on FLINK-13698:
---

>From what I know from a recent discussion with Piotr, it's not being included 
>in 1.12. (he is currenlty on vacation).

> Rework threading model of CheckpointCoordinator
> ---
>
> Key: FLINK-13698
> URL: https://issues.apache.org/jira/browse/FLINK-13698
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Assignee: Biao Liu
>Priority: Major
> Fix For: 1.12.0
>
>
> Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is 
> executed by multiple different threads (mostly {{ioExecutor}}, but not only). 
> It's causing multiple concurrency issues, for example: 
> https://issues.apache.org/jira/browse/FLINK-13497
> Proper fix would be to rethink threading model there. At first glance it 
> doesn't seem that this code should be multi threaded, except of parts doing 
> the actual IO operations, so it should be possible to run everything in one 
> single ExecutionGraph's thread and just run asynchronously necessary IO 
> operations with some feedback loop ("mailbox style").
> I would strongly recommend fixing this issue before adding new features in 
> the \{{CheckpointCoordinator}} component.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator

2020-11-09 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228487#comment-17228487
 ] 

Stephan Ewen commented on FLINK-13698:
--

AS far as I know, there has been no progress here.

> Rework threading model of CheckpointCoordinator
> ---
>
> Key: FLINK-13698
> URL: https://issues.apache.org/jira/browse/FLINK-13698
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Assignee: Biao Liu
>Priority: Major
> Fix For: 1.12.0
>
>
> Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is 
> executed by multiple different threads (mostly {{ioExecutor}}, but not only). 
> It's causing multiple concurrency issues, for example: 
> https://issues.apache.org/jira/browse/FLINK-13497
> Proper fix would be to rethink threading model there. At first glance it 
> doesn't seem that this code should be multi threaded, except of parts doing 
> the actual IO operations, so it should be possible to run everything in one 
> single ExecutionGraph's thread and just run asynchronously necessary IO 
> operations with some feedback loop ("mailbox style").
> I would strongly recommend fixing this issue before adding new features in 
> the \{{CheckpointCoordinator}} component.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator

2020-11-09 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228484#comment-17228484
 ] 

Till Rohrmann commented on FLINK-13698:
---

What's the progress of this issue [~pnowojski]? Will this improvement be 
included in Flink {{1.12.0}}?

> Rework threading model of CheckpointCoordinator
> ---
>
> Key: FLINK-13698
> URL: https://issues.apache.org/jira/browse/FLINK-13698
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Assignee: Biao Liu
>Priority: Major
> Fix For: 1.12.0
>
>
> Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is 
> executed by multiple different threads (mostly {{ioExecutor}}, but not only). 
> It's causing multiple concurrency issues, for example: 
> https://issues.apache.org/jira/browse/FLINK-13497
> Proper fix would be to rethink threading model there. At first glance it 
> doesn't seem that this code should be multi threaded, except of parts doing 
> the actual IO operations, so it should be possible to run everything in one 
> single ExecutionGraph's thread and just run asynchronously necessary IO 
> operations with some feedback loop ("mailbox style").
> I would strongly recommend fixing this issue before adding new features in 
> the \{{CheckpointCoordinator}} component.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator

2019-09-17 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931119#comment-16931119
 ] 

Piotr Nowojski commented on FLINK-13698:


Thanks for the update [~SleePy].

> Rework threading model of CheckpointCoordinator
> ---
>
> Key: FLINK-13698
> URL: https://issues.apache.org/jira/browse/FLINK-13698
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Assignee: Biao Liu
>Priority: Critical
>
> Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is 
> executed by multiple different threads (mostly {{ioExecutor}}, but not only). 
> It's causing multiple concurrency issues, for example: 
> https://issues.apache.org/jira/browse/FLINK-13497
> Proper fix would be to rethink threading model there. At first glance it 
> doesn't seem that this code should be multi threaded, except of parts doing 
> the actual IO operations, so it should be possible to run everything in one 
> single ExecutionGraph's thread and just run asynchronously necessary IO 
> operations with some feedback loop ("mailbox style").
> I would strongly recommend fixing this issue before adding new features in 
> the \{{CheckpointCoordinator}} component.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator

2019-09-06 Thread Biao Liu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924472#comment-16924472
 ] 

Biao Liu commented on FLINK-13698:
--

Just a progress updating, now I'm doing some POC developments. I found it's a 
bit hard to separate the issue and work into subtasks. Because the changes of 
subtasks are tightly coupled. So I stopped creating new subtasks to avoid 
adjusting them during the further developing.

> Rework threading model of CheckpointCoordinator
> ---
>
> Key: FLINK-13698
> URL: https://issues.apache.org/jira/browse/FLINK-13698
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Assignee: Biao Liu
>Priority: Critical
>
> Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is 
> executed by multiple different threads (mostly {{ioExecutor}}, but not only). 
> It's causing multiple concurrency issues, for example: 
> https://issues.apache.org/jira/browse/FLINK-13497
> Proper fix would be to rethink threading model there. At first glance it 
> doesn't seem that this code should be multi threaded, except of parts doing 
> the actual IO operations, so it should be possible to run everything in one 
> single ExecutionGraph's thread and just run asynchronously necessary IO 
> operations with some feedback loop ("mailbox style").
> I would strongly recommend fixing this issue before adding new features in 
> the \{{CheckpointCoordinator}} component.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator

2019-08-25 Thread Biao Liu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915430#comment-16915430
 ] 

Biao Liu commented on FLINK-13698:
--

Sorry for a bit late response. I was trapped in some other things last week :(

Thanks for detailed reviewing [~pnowojski], [~till.rohrmann]. It helps a lot. 
And I really appreciate that.

I will separate this proposal into series of subtasks. 


> Rework threading model of CheckpointCoordinator
> ---
>
> Key: FLINK-13698
> URL: https://issues.apache.org/jira/browse/FLINK-13698
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Assignee: Biao Liu
>Priority: Critical
>
> Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is 
> executed by multiple different threads (mostly {{ioExecutor}}, but not only). 
> It's causing multiple concurrency issues, for example: 
> https://issues.apache.org/jira/browse/FLINK-13497
> Proper fix would be to rethink threading model there. At first glance it 
> doesn't seem that this code should be multi threaded, except of parts doing 
> the actual IO operations, so it should be possible to run everything in one 
> single ExecutionGraph's thread and just run asynchronously necessary IO 
> operations with some feedback loop ("mailbox style").
> I would strongly recommend fixing this issue before adding new features in 
> the \{{CheckpointCoordinator}} component.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator

2019-08-13 Thread Biao Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906075#comment-16906075
 ] 

Biao Liu commented on FLINK-13698:
--

Thanks for your attention [~StephanEwen]!

Piotr and me had a short discussion last week. 

I believe my proposal matches your suggestions well. :)

> Rework threading model of CheckpointCoordinator
> ---
>
> Key: FLINK-13698
> URL: https://issues.apache.org/jira/browse/FLINK-13698
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Assignee: Biao Liu
>Priority: Critical
>
> Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is 
> executed by multiple different threads (mostly {{ioExecutor}}, but not only). 
> It's causing multiple concurrency issues, for example: 
> https://issues.apache.org/jira/browse/FLINK-13497
> Proper fix would be to rethink threading model there. At first glance it 
> doesn't seem that this code should be multi threaded, except of parts doing 
> the actual IO operations, so it should be possible to run everything in one 
> single ExecutionGraph's thread and just run asynchronously necessary IO 
> operations with some feedback loop ("mailbox style").
> I would strongly recommend fixing this issue before adding new features in 
> the \{{CheckpointCoordinator}} component.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator

2019-08-13 Thread Stephan Ewen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906060#comment-16906060
 ] 

Stephan Ewen commented on FLINK-13698:
--

I would suggest to have the {{CheckpointCoordinator}} and the 
{{Scheduler}}/{{ExecutionGraph}} in the same thread.
That makes a lot of things much easier.

  - The checkpoint timer logic should be factored out into the periodic 
checkpoint trigger.
  - The actual call 
  - All cleanup and writing of state should be delegated to the I/O executor.
-> mainly state release as a "fire and forget" execution in the I/O executor
-> writing out master-hook data and checkpoint metadata needs to happen in 
the I/O executor, with a Future whose completion is again handles in the main 
thread executor.

> Rework threading model of CheckpointCoordinator
> ---
>
> Key: FLINK-13698
> URL: https://issues.apache.org/jira/browse/FLINK-13698
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Assignee: Biao Liu
>Priority: Critical
>
> Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is 
> executed by multiple different threads (mostly {{ioExecutor}}, but not only). 
> It's causing multiple concurrency issues, for example: 
> https://issues.apache.org/jira/browse/FLINK-13497
> Proper fix would be to rethink threading model there. At first glance it 
> doesn't seem that this code should be multi threaded, except of parts doing 
> the actual IO operations, so it should be possible to run everything in one 
> single ExecutionGraph's thread and just run asynchronously necessary IO 
> operations with some feedback loop ("mailbox style").
> I would strongly recommend fixing this issue before adding new features in 
> the \{{CheckpointCoordinator}} component.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator

2019-08-13 Thread Biao Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905943#comment-16905943
 ] 

Biao Liu commented on FLINK-13698:
--

Hi [~pnowojski], I have attached a design document. That would be great if you 
have time to review it.

> Rework threading model of CheckpointCoordinator
> ---
>
> Key: FLINK-13698
> URL: https://issues.apache.org/jira/browse/FLINK-13698
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Assignee: Biao Liu
>Priority: Critical
>
> Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is 
> executed by multiple different threads (mostly {{ioExecutor}}, but not only). 
> It's causing multiple concurrency issues, for example: 
> https://issues.apache.org/jira/browse/FLINK-13497
> Proper fix would be to rethink threading model there. At first glance it 
> doesn't seem that this code should be multi threaded, except of parts doing 
> the actual IO operations, so it should be possible to run everything in one 
> single ExecutionGraph's thread and just run asynchronously necessary IO 
> operations with some feedback loop ("mailbox style").
> I would strongly recommend fixing this issue before adding new features in 
> the \{{CheckpointCoordinator}} component.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)