[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
[ https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15625387#comment-15625387 ] ASF GitHub Bot commented on FLINK-4933: --- Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2701 Has been merged to the release branch 1.1. > ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph > > > Key: FLINK-4933 > URL: https://issues.apache.org/jira/browse/FLINK-4933 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.2.0, 1.1.3 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > Fix For: 1.2.0, 1.1.4 > > > Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole > {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This > situation can occur in the restarting scenario where we have a late callback > trying to update its consumers. In this case, the call should forward the > exception back to the caller and not fail the {{ExecutionGraph}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
[ https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15625388#comment-15625388 ] ASF GitHub Bot commented on FLINK-4933: --- Github user tillrohrmann closed the pull request at: https://github.com/apache/flink/pull/2701 > ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph > > > Key: FLINK-4933 > URL: https://issues.apache.org/jira/browse/FLINK-4933 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.2.0, 1.1.3 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > Fix For: 1.2.0, 1.1.4 > > > Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole > {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This > situation can occur in the restarting scenario where we have a late callback > trying to update its consumers. In this case, the call should forward the > exception back to the caller and not fail the {{ExecutionGraph}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
[ https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622930#comment-15622930 ] ASF GitHub Bot commented on FLINK-4933: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2700 > ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph > > > Key: FLINK-4933 > URL: https://issues.apache.org/jira/browse/FLINK-4933 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.2.0, 1.1.3 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > Fix For: 1.2.0, 1.1.4 > > > Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole > {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This > situation can occur in the restarting scenario where we have a late callback > trying to update its consumers. In this case, the call should forward the > exception back to the caller and not fail the {{ExecutionGraph}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
[ https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622443#comment-15622443 ] ASF GitHub Bot commented on FLINK-4933: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2701 Looks good, +1 Merging this... > ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph > > > Key: FLINK-4933 > URL: https://issues.apache.org/jira/browse/FLINK-4933 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.2.0, 1.1.3 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > Fix For: 1.2.0, 1.1.4 > > > Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole > {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This > situation can occur in the restarting scenario where we have a late callback > trying to update its consumers. In this case, the call should forward the > exception back to the caller and not fail the {{ExecutionGraph}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
[ https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622419#comment-15622419 ] ASF GitHub Bot commented on FLINK-4933: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2700 Looks good to me, +1 Merging this... > ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph > > > Key: FLINK-4933 > URL: https://issues.apache.org/jira/browse/FLINK-4933 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.2.0, 1.1.3 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > Fix For: 1.2.0, 1.1.4 > > > Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole > {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This > situation can occur in the restarting scenario where we have a late callback > trying to update its consumers. In this case, the call should forward the > exception back to the caller and not fail the {{ExecutionGraph}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
[ https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15621577#comment-15621577 ] ASF GitHub Bot commented on FLINK-4933: --- Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2701 Will rebase, because this PR is based on the release-1.1 branch which contained the failing `SpanningRecordSerializerTest`. > ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph > > > Key: FLINK-4933 > URL: https://issues.apache.org/jira/browse/FLINK-4933 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.2.0, 1.1.3 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > Fix For: 1.2.0, 1.1.4 > > > Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole > {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This > situation can occur in the restarting scenario where we have a late callback > trying to update its consumers. In this case, the call should forward the > exception back to the caller and not fail the {{ExecutionGraph}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
[ https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15621570#comment-15621570 ] ASF GitHub Bot commented on FLINK-4933: --- Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2700 Thanks for the review @uce. This is a problem which is relevant for the current master. It could make the `ExecutionGraph` go into state `FAILED` when being in state `RESTARTING` in case that the `scheduleOrUpdateConsumers` call failed. Will rebase the PR because it was based on the failing `SpanningRecordSerializerTest`. > ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph > > > Key: FLINK-4933 > URL: https://issues.apache.org/jira/browse/FLINK-4933 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.2.0, 1.1.3 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > Fix For: 1.2.0, 1.1.4 > > > Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole > {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This > situation can occur in the restarting scenario where we have a late callback > trying to update its consumers. In this case, the call should forward the > exception back to the caller and not fail the {{ExecutionGraph}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
[ https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15615486#comment-15615486 ] ASF GitHub Bot commented on FLINK-4933: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/2700#discussion_r85533357 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -917,8 +917,15 @@ class JobManager( case ScheduleOrUpdateConsumers(jobId, partitionId) => currentJobs.get(jobId) match { case Some((executionGraph, _)) => - sender ! decorateMessage(Acknowledge) - executionGraph.scheduleOrUpdateConsumers(partitionId) + try { +executionGraph.scheduleOrUpdateConsumers(partitionId) +sender ! decorateMessage(Acknowledge) + } catch { +case e: ExecutionGraphException => --- End diff -- Does it make sense to catch the more generic `Exception` type here in order to make the sender notice any problems sooner? I see that the method only throws EGExceptions currently but maybe at some point in time someone introduces a runtime exception etc. This would only be logged at the JM and the task's ask would timeout. > ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph > > > Key: FLINK-4933 > URL: https://issues.apache.org/jira/browse/FLINK-4933 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.2.0, 1.1.3 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > Fix For: 1.2.0, 1.1.4 > > > Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole > {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This > situation can occur in the restarting scenario where we have a late callback > trying to update its consumers. In this case, the call should forward the > exception back to the caller and not fail the {{ExecutionGraph}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
[ https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611564#comment-15611564 ] ASF GitHub Bot commented on FLINK-4933: --- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2701 [backport] [FLINK-4933] [exec graph] Don't let the EG fail in case of a failing scheduleOrUpdateConsumers call This is a backport for the release-1.1 branch. The only thing adapted is the added test case. Instead of failing the complete ExecutionGraph, a failing scheduleOrUpdateConsumers call will be reported back to the caller. The caller can then decide what to do. Per default, it will fail the calling task. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink backportFixScheduleOrUpdateConsumers Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2701.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2701 commit b17fada0e60ad9986680c89effa628944973d999 Author: Till RohrmannDate: 2016-10-27T09:41:29Z [FLINK-4933] [exec graph] Don't let the EG fail in case of a failing scheduleOrUpdateConsumers call Instead of failing the complete ExecutionGraph, a failing scheduleOrUpdateConsumers call will be reported back to the caller. The caller can then decide what to do. Per default, it will fail the calling task. Adapt TaskManagerTest > ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph > > > Key: FLINK-4933 > URL: https://issues.apache.org/jira/browse/FLINK-4933 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.2.0, 1.1.3 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > Fix For: 1.2.0, 1.1.4 > > > Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole > {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This > situation can occur in the restarting scenario where we have a late callback > trying to update its consumers. In this case, the call should forward the > exception back to the caller and not fail the {{ExecutionGraph}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
[ https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611364#comment-15611364 ] ASF GitHub Bot commented on FLINK-4933: --- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2700 [FLINK-4933] [exec graph] Don't let the EG fail in case of a failing scheduleOrUpdateConsumers call Instead of failing the complete ExecutionGraph, a failing scheduleOrUpdateConsumers call will be reported back to the caller. The caller can then decide what to do. Per default, it will fail the calling task. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixScheduleOrUpdateConsumers Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2700.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2700 commit e51f0a56762c7f12acc215a53ffce2af28d38583 Author: Till RohrmannDate: 2016-10-27T09:41:29Z [FLINK-4933] [exec graph] Don't let the EG fail in case of a failing scheduleOrUpdateConsumers call Instead of failing the complete ExecutionGraph, a failing scheduleOrUpdateConsumers call will be reported back to the caller. The caller can then decide what to do. Per default, it will fail the calling task. > ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph > > > Key: FLINK-4933 > URL: https://issues.apache.org/jira/browse/FLINK-4933 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.2.0, 1.1.3 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > Fix For: 1.2.0, 1.1.4 > > > Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole > {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This > situation can occur in the restarting scenario where we have a late callback > trying to update its consumers. In this case, the call should forward the > exception back to the caller and not fail the {{ExecutionGraph}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)