[jira] [Commented] (FLINK-12858) Potential distributed deadlock in case of synchronous savepoint failure

2019-07-26 Thread Kostas Kloudas (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893881#comment-16893881
 ] 

Kostas Kloudas commented on FLINK-12858:


The test is ready and there is a PR for it here 
https://github.com/apache/flink/pull/9240

> Potential distributed deadlock in case of synchronous savepoint failure
> ---
>
> Key: FLINK-12858
> URL: https://issues.apache.org/jira/browse/FLINK-12858
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.9.0
>Reporter: Alex
>Assignee: Alex
>Priority: Blocker
> Fix For: 1.9.0
>
>
> Current implementation of stop-with-savepoint (FLINK-11458) would lock the 
> thread (on {{syncSavepointLatch}}) that carries 
> {{StreamTask.performCheckpoint()}}. For non-source tasks, this thread is 
> implied to be the task's main thread (stop-with-savepoint deliberately stops 
> any activity in the task's main thread).
> Unlocking happens either when the task is cancelled or when the corresponding 
> checkpoint is acknowledged.
> It's possible, that other downstream tasks of the same Flink job "soft" fail 
> the checkpoint/savepoint due to various reasons (for example, due to max 
> buffered bytes {{BarrierBuffer.checkSizeLimit()}}. In such case, the 
> checkpoint abortion would be notified to JM . But it looks like, the 
> checkpoint coordinator would handle such abortion as usual and assume that 
> the Flink job continues running.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-12858) Potential distributed deadlock in case of synchronous savepoint failure

2019-07-23 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891070#comment-16891070
 ] 

Till Rohrmann commented on FLINK-12858:
---

[~kkl0u] had a comment whether to add a test or not. I would be in favour of 
guarding this fix with a test case.

> Potential distributed deadlock in case of synchronous savepoint failure
> ---
>
> Key: FLINK-12858
> URL: https://issues.apache.org/jira/browse/FLINK-12858
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.9.0
>Reporter: Alex
>Assignee: Alex
>Priority: Blocker
> Fix For: 1.9.0
>
>
> Current implementation of stop-with-savepoint (FLINK-11458) would lock the 
> thread (on {{syncSavepointLatch}}) that carries 
> {{StreamTask.performCheckpoint()}}. For non-source tasks, this thread is 
> implied to be the task's main thread (stop-with-savepoint deliberately stops 
> any activity in the task's main thread).
> Unlocking happens either when the task is cancelled or when the corresponding 
> checkpoint is acknowledged.
> It's possible, that other downstream tasks of the same Flink job "soft" fail 
> the checkpoint/savepoint due to various reasons (for example, due to max 
> buffered bytes {{BarrierBuffer.checkSizeLimit()}}. In such case, the 
> checkpoint abortion would be notified to JM . But it looks like, the 
> checkpoint coordinator would handle such abortion as usual and assume that 
> the Flink job continues running.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-12858) Potential distributed deadlock in case of synchronous savepoint failure

2019-07-16 Thread Alex (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886211#comment-16886211
 ] 

Alex commented on FLINK-12858:
--

I'm not sure, but failing the task (that originates the discarded 
savepoint/checkpoint) may be not an option due to region recovery (which would 
not restart the tasks that are actually locked).

> Potential distributed deadlock in case of synchronous savepoint failure
> ---
>
> Key: FLINK-12858
> URL: https://issues.apache.org/jira/browse/FLINK-12858
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.9.0
>Reporter: Alex
>Assignee: Alex
>Priority: Blocker
> Fix For: 1.9.0
>
>
> Current implementation of stop-with-savepoint (FLINK-11458) would lock the 
> thread (on {{syncSavepointLatch}}) that carries 
> {{StreamTask.performCheckpoint()}}. For non-source tasks, this thread is 
> implied to be the task's main thread (stop-with-savepoint deliberately stops 
> any activity in the task's main thread).
> Unlocking happens either when the task is cancelled or when the corresponding 
> checkpoint is acknowledged.
> It's possible, that other downstream tasks of the same Flink job "soft" fail 
> the checkpoint/savepoint due to various reasons (for example, due to max 
> buffered bytes {{BarrierBuffer.checkSizeLimit()}}. In such case, the 
> checkpoint abortion would be notified to JM . But it looks like, the 
> checkpoint coordinator would handle such abortion as usual and assume that 
> the Flink job continues running.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-12858) Potential distributed deadlock in case of synchronous savepoint failure

2019-07-16 Thread Alex (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886197#comment-16886197
 ] 

Alex commented on FLINK-12858:
--

Some recap of discussions around this issue:
the proposed workaround solution is to fail execution of the whole job.
In case of stop-with-savepoint and {{drain=true}}, we cannot unlock the tasks 
and continue job execution (as it may side effect on job results). Handling two 
cases differently may be a little involved.

> Potential distributed deadlock in case of synchronous savepoint failure
> ---
>
> Key: FLINK-12858
> URL: https://issues.apache.org/jira/browse/FLINK-12858
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.9.0
>Reporter: Alex
>Assignee: Alex
>Priority: Blocker
> Fix For: 1.9.0
>
>
> Current implementation of stop-with-savepoint (FLINK-11458) would lock the 
> thread (on {{syncSavepointLatch}}) that carries 
> {{StreamTask.performCheckpoint()}}. For non-source tasks, this thread is 
> implied to be the task's main thread (stop-with-savepoint deliberately stops 
> any activity in the task's main thread).
> Unlocking happens either when the task is cancelled or when the corresponding 
> checkpoint is acknowledged.
> It's possible, that other downstream tasks of the same Flink job "soft" fail 
> the checkpoint/savepoint due to various reasons (for example, due to max 
> buffered bytes {{BarrierBuffer.checkSizeLimit()}}. In such case, the 
> checkpoint abortion would be notified to JM . But it looks like, the 
> checkpoint coordinator would handle such abortion as usual and assume that 
> the Flink job continues running.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)