[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph

2016-11-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15625387#comment-15625387
 ] 

ASF GitHub Bot commented on FLINK-4933:
---

Github user tillrohrmann commented on the issue:

https://github.com/apache/flink/pull/2701
  
Has been merged to the release branch 1.1.


> ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
> 
>
> Key: FLINK-4933
> URL: https://issues.apache.org/jira/browse/FLINK-4933
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.2.0, 1.1.3
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
> Fix For: 1.2.0, 1.1.4
>
>
> Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole 
> {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This 
> situation can occur in the restarting scenario where we have a late callback 
> trying to update its consumers. In this case, the call should forward the 
> exception back to the caller and not fail the {{ExecutionGraph}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph

2016-11-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15625388#comment-15625388
 ] 

ASF GitHub Bot commented on FLINK-4933:
---

Github user tillrohrmann closed the pull request at:

https://github.com/apache/flink/pull/2701


> ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
> 
>
> Key: FLINK-4933
> URL: https://issues.apache.org/jira/browse/FLINK-4933
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.2.0, 1.1.3
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
> Fix For: 1.2.0, 1.1.4
>
>
> Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole 
> {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This 
> situation can occur in the restarting scenario where we have a late callback 
> trying to update its consumers. In this case, the call should forward the 
> exception back to the caller and not fail the {{ExecutionGraph}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph

2016-10-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622930#comment-15622930
 ] 

ASF GitHub Bot commented on FLINK-4933:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/2700


> ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
> 
>
> Key: FLINK-4933
> URL: https://issues.apache.org/jira/browse/FLINK-4933
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.2.0, 1.1.3
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
> Fix For: 1.2.0, 1.1.4
>
>
> Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole 
> {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This 
> situation can occur in the restarting scenario where we have a late callback 
> trying to update its consumers. In this case, the call should forward the 
> exception back to the caller and not fail the {{ExecutionGraph}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph

2016-10-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622443#comment-15622443
 ] 

ASF GitHub Bot commented on FLINK-4933:
---

Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/2701
  
Looks good, +1

Merging this...


> ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
> 
>
> Key: FLINK-4933
> URL: https://issues.apache.org/jira/browse/FLINK-4933
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.2.0, 1.1.3
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
> Fix For: 1.2.0, 1.1.4
>
>
> Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole 
> {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This 
> situation can occur in the restarting scenario where we have a late callback 
> trying to update its consumers. In this case, the call should forward the 
> exception back to the caller and not fail the {{ExecutionGraph}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph

2016-10-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622419#comment-15622419
 ] 

ASF GitHub Bot commented on FLINK-4933:
---

Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/2700
  
Looks good to me, +1

Merging this...


> ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
> 
>
> Key: FLINK-4933
> URL: https://issues.apache.org/jira/browse/FLINK-4933
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.2.0, 1.1.3
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
> Fix For: 1.2.0, 1.1.4
>
>
> Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole 
> {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This 
> situation can occur in the restarting scenario where we have a late callback 
> trying to update its consumers. In this case, the call should forward the 
> exception back to the caller and not fail the {{ExecutionGraph}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph

2016-10-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15621577#comment-15621577
 ] 

ASF GitHub Bot commented on FLINK-4933:
---

Github user tillrohrmann commented on the issue:

https://github.com/apache/flink/pull/2701
  
Will rebase, because this PR is based on the release-1.1 branch which 
contained the failing `SpanningRecordSerializerTest`.


> ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
> 
>
> Key: FLINK-4933
> URL: https://issues.apache.org/jira/browse/FLINK-4933
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.2.0, 1.1.3
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
> Fix For: 1.2.0, 1.1.4
>
>
> Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole 
> {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This 
> situation can occur in the restarting scenario where we have a late callback 
> trying to update its consumers. In this case, the call should forward the 
> exception back to the caller and not fail the {{ExecutionGraph}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph

2016-10-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15621570#comment-15621570
 ] 

ASF GitHub Bot commented on FLINK-4933:
---

Github user tillrohrmann commented on the issue:

https://github.com/apache/flink/pull/2700
  
Thanks for the review @uce. This is a problem which is relevant for the 
current master. It could make the `ExecutionGraph` go into state `FAILED` when 
being in state `RESTARTING` in case that the `scheduleOrUpdateConsumers` call 
failed. 

Will rebase the PR because it was based on the failing 
`SpanningRecordSerializerTest`.


> ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
> 
>
> Key: FLINK-4933
> URL: https://issues.apache.org/jira/browse/FLINK-4933
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.2.0, 1.1.3
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
> Fix For: 1.2.0, 1.1.4
>
>
> Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole 
> {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This 
> situation can occur in the restarting scenario where we have a late callback 
> trying to update its consumers. In this case, the call should forward the 
> exception back to the caller and not fail the {{ExecutionGraph}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph

2016-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15615486#comment-15615486
 ] 

ASF GitHub Bot commented on FLINK-4933:
---

Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/2700#discussion_r85533357
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
@@ -917,8 +917,15 @@ class JobManager(
 case ScheduleOrUpdateConsumers(jobId, partitionId) =>
   currentJobs.get(jobId) match {
 case Some((executionGraph, _)) =>
-  sender ! decorateMessage(Acknowledge)
-  executionGraph.scheduleOrUpdateConsumers(partitionId)
+  try {
+executionGraph.scheduleOrUpdateConsumers(partitionId)
+sender ! decorateMessage(Acknowledge)
+  } catch {
+case e: ExecutionGraphException =>
--- End diff --

Does it make sense to catch the more generic `Exception` type here in order 
to make the sender notice any problems sooner? I see that the method only 
throws EGExceptions currently but maybe at some point in time someone 
introduces a runtime exception etc. This would only be logged at the JM and the 
task's ask would timeout.


> ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
> 
>
> Key: FLINK-4933
> URL: https://issues.apache.org/jira/browse/FLINK-4933
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.2.0, 1.1.3
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
> Fix For: 1.2.0, 1.1.4
>
>
> Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole 
> {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This 
> situation can occur in the restarting scenario where we have a late callback 
> trying to update its consumers. In this case, the call should forward the 
> exception back to the caller and not fail the {{ExecutionGraph}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph

2016-10-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611564#comment-15611564
 ] 

ASF GitHub Bot commented on FLINK-4933:
---

GitHub user tillrohrmann opened a pull request:

https://github.com/apache/flink/pull/2701

[backport] [FLINK-4933] [exec graph] Don't let the EG fail in case of a 
failing scheduleOrUpdateConsumers call

This is a backport for the release-1.1 branch. The only thing adapted is 
the added test case.

Instead of failing the complete ExecutionGraph, a failing 
scheduleOrUpdateConsumers call
will be reported back to the caller. The caller can then decide what to do. 
Per default,
it will fail the calling task.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tillrohrmann/flink 
backportFixScheduleOrUpdateConsumers

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2701.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2701


commit b17fada0e60ad9986680c89effa628944973d999
Author: Till Rohrmann 
Date:   2016-10-27T09:41:29Z

[FLINK-4933] [exec graph] Don't let the EG fail in case of a failing 
scheduleOrUpdateConsumers call

Instead of failing the complete ExecutionGraph, a failing 
scheduleOrUpdateConsumers call
will be reported back to the caller. The caller can then decide what to do. 
Per default,
it will fail the calling task.

Adapt TaskManagerTest




> ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
> 
>
> Key: FLINK-4933
> URL: https://issues.apache.org/jira/browse/FLINK-4933
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.2.0, 1.1.3
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
> Fix For: 1.2.0, 1.1.4
>
>
> Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole 
> {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This 
> situation can occur in the restarting scenario where we have a late callback 
> trying to update its consumers. In this case, the call should forward the 
> exception back to the caller and not fail the {{ExecutionGraph}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4933) ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph

2016-10-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611364#comment-15611364
 ] 

ASF GitHub Bot commented on FLINK-4933:
---

GitHub user tillrohrmann opened a pull request:

https://github.com/apache/flink/pull/2700

[FLINK-4933] [exec graph] Don't let the EG fail in case of a failing 
scheduleOrUpdateConsumers call

Instead of failing the complete ExecutionGraph, a failing 
scheduleOrUpdateConsumers call
will be reported back to the caller. The caller can then decide what to do. 
Per default,
it will fail the calling task.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tillrohrmann/flink 
fixScheduleOrUpdateConsumers

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2700.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2700


commit e51f0a56762c7f12acc215a53ffce2af28d38583
Author: Till Rohrmann 
Date:   2016-10-27T09:41:29Z

[FLINK-4933] [exec graph] Don't let the EG fail in case of a failing 
scheduleOrUpdateConsumers call

Instead of failing the complete ExecutionGraph, a failing 
scheduleOrUpdateConsumers call
will be reported back to the caller. The caller can then decide what to do. 
Per default,
it will fail the calling task.




> ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph
> 
>
> Key: FLINK-4933
> URL: https://issues.apache.org/jira/browse/FLINK-4933
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.2.0, 1.1.3
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
> Fix For: 1.2.0, 1.1.4
>
>
> Currently the {{ExecutionGraph.scheduleOrUpdateConsumers}} can fail the whole 
> {{ExecutionGraph}} if it cannot find the corresponding {{Execution}}. This 
> situation can occur in the restarting scenario where we have a late callback 
> trying to update its consumers. In this case, the call should forward the 
> exception back to the caller and not fail the {{ExecutionGraph}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)