[jira] [Commented] (FLINK-12858) Potential distributed deadlock in case of synchronous savepoint failure
[ https://issues.apache.org/jira/browse/FLINK-12858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893881#comment-16893881 ] Kostas Kloudas commented on FLINK-12858: The test is ready and there is a PR for it here https://github.com/apache/flink/pull/9240 > Potential distributed deadlock in case of synchronous savepoint failure > --- > > Key: FLINK-12858 > URL: https://issues.apache.org/jira/browse/FLINK-12858 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.9.0 >Reporter: Alex >Assignee: Alex >Priority: Blocker > Fix For: 1.9.0 > > > Current implementation of stop-with-savepoint (FLINK-11458) would lock the > thread (on {{syncSavepointLatch}}) that carries > {{StreamTask.performCheckpoint()}}. For non-source tasks, this thread is > implied to be the task's main thread (stop-with-savepoint deliberately stops > any activity in the task's main thread). > Unlocking happens either when the task is cancelled or when the corresponding > checkpoint is acknowledged. > It's possible, that other downstream tasks of the same Flink job "soft" fail > the checkpoint/savepoint due to various reasons (for example, due to max > buffered bytes {{BarrierBuffer.checkSizeLimit()}}. In such case, the > checkpoint abortion would be notified to JM . But it looks like, the > checkpoint coordinator would handle such abortion as usual and assume that > the Flink job continues running. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-12858) Potential distributed deadlock in case of synchronous savepoint failure
[ https://issues.apache.org/jira/browse/FLINK-12858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891070#comment-16891070 ] Till Rohrmann commented on FLINK-12858: --- [~kkl0u] had a comment whether to add a test or not. I would be in favour of guarding this fix with a test case. > Potential distributed deadlock in case of synchronous savepoint failure > --- > > Key: FLINK-12858 > URL: https://issues.apache.org/jira/browse/FLINK-12858 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.9.0 >Reporter: Alex >Assignee: Alex >Priority: Blocker > Fix For: 1.9.0 > > > Current implementation of stop-with-savepoint (FLINK-11458) would lock the > thread (on {{syncSavepointLatch}}) that carries > {{StreamTask.performCheckpoint()}}. For non-source tasks, this thread is > implied to be the task's main thread (stop-with-savepoint deliberately stops > any activity in the task's main thread). > Unlocking happens either when the task is cancelled or when the corresponding > checkpoint is acknowledged. > It's possible, that other downstream tasks of the same Flink job "soft" fail > the checkpoint/savepoint due to various reasons (for example, due to max > buffered bytes {{BarrierBuffer.checkSizeLimit()}}. In such case, the > checkpoint abortion would be notified to JM . But it looks like, the > checkpoint coordinator would handle such abortion as usual and assume that > the Flink job continues running. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-12858) Potential distributed deadlock in case of synchronous savepoint failure
[ https://issues.apache.org/jira/browse/FLINK-12858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886211#comment-16886211 ] Alex commented on FLINK-12858: -- I'm not sure, but failing the task (that originates the discarded savepoint/checkpoint) may be not an option due to region recovery (which would not restart the tasks that are actually locked). > Potential distributed deadlock in case of synchronous savepoint failure > --- > > Key: FLINK-12858 > URL: https://issues.apache.org/jira/browse/FLINK-12858 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.9.0 >Reporter: Alex >Assignee: Alex >Priority: Blocker > Fix For: 1.9.0 > > > Current implementation of stop-with-savepoint (FLINK-11458) would lock the > thread (on {{syncSavepointLatch}}) that carries > {{StreamTask.performCheckpoint()}}. For non-source tasks, this thread is > implied to be the task's main thread (stop-with-savepoint deliberately stops > any activity in the task's main thread). > Unlocking happens either when the task is cancelled or when the corresponding > checkpoint is acknowledged. > It's possible, that other downstream tasks of the same Flink job "soft" fail > the checkpoint/savepoint due to various reasons (for example, due to max > buffered bytes {{BarrierBuffer.checkSizeLimit()}}. In such case, the > checkpoint abortion would be notified to JM . But it looks like, the > checkpoint coordinator would handle such abortion as usual and assume that > the Flink job continues running. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-12858) Potential distributed deadlock in case of synchronous savepoint failure
[ https://issues.apache.org/jira/browse/FLINK-12858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886197#comment-16886197 ] Alex commented on FLINK-12858: -- Some recap of discussions around this issue: the proposed workaround solution is to fail execution of the whole job. In case of stop-with-savepoint and {{drain=true}}, we cannot unlock the tasks and continue job execution (as it may side effect on job results). Handling two cases differently may be a little involved. > Potential distributed deadlock in case of synchronous savepoint failure > --- > > Key: FLINK-12858 > URL: https://issues.apache.org/jira/browse/FLINK-12858 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.9.0 >Reporter: Alex >Assignee: Alex >Priority: Blocker > Fix For: 1.9.0 > > > Current implementation of stop-with-savepoint (FLINK-11458) would lock the > thread (on {{syncSavepointLatch}}) that carries > {{StreamTask.performCheckpoint()}}. For non-source tasks, this thread is > implied to be the task's main thread (stop-with-savepoint deliberately stops > any activity in the task's main thread). > Unlocking happens either when the task is cancelled or when the corresponding > checkpoint is acknowledged. > It's possible, that other downstream tasks of the same Flink job "soft" fail > the checkpoint/savepoint due to various reasons (for example, due to max > buffered bytes {{BarrierBuffer.checkSizeLimit()}}. In such case, the > checkpoint abortion would be notified to JM . But it looks like, the > checkpoint coordinator would handle such abortion as usual and assume that > the Flink job continues running. -- This message was sent by Atlassian JIRA (v7.6.14#76016)