[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-03-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186961#comment-15186961
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/1657


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-03-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179633#comment-15179633
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user uce commented on the pull request:

https://github.com/apache/flink/pull/1657#issuecomment-192209616
  
If there are no objections, I am going to merge this to master later today.


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150499#comment-15150499
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

GitHub user uce opened a pull request:

https://github.com/apache/flink/pull/1657

[FLINK-3396] [runtime] Suppress job restart if adding to job graph store 
fails

@tillrohrmann, this leaves everything as is, but suppresses the restart in 
case of a failure to add the job graph to ZooKeeper (the standalone case should 
never fail). This applies only to the initial job submission.

The behaviour before was possibly leading to restart and the job being 
submitted w/o being added to ZooKeeper. Furthermore, the submission was not 
ACK'd in such a case. Now it will simply fail and the user can/has to react.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uce/flink 3396-submit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1657.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1657


commit 3847e304c93a30c18560719d8f169ca424d734e6
Author: Ufuk Celebi 
Date:   2016-02-12T21:09:29Z

[hotfix] Rename UnrecoverableException to SuppressRestartsException

commit 07753af54db020ddef5540ab0ae3dcbb3af69c54
Author: Ufuk Celebi 
Date:   2016-02-16T16:49:13Z

[FLINK-3396] [runtime] Suppress job restart if adding to job graph store 
fails

A failure to add the job graph to the submitted job graphs in ZooKeeper
could lead to a job restart w/o the job graph ever being added to the
submitted graphs store. Although the job submission was not ACK'd in
this case before, it received job status messages. Now the job will
not be ACK'd as before, but the job will be failed.




> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150374#comment-15150374
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user uce closed the pull request at:

https://github.com/apache/flink/pull/1656


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150365#comment-15150365
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

GitHub user uce opened a pull request:

https://github.com/apache/flink/pull/1656

[FLINK-3396] [runtime] Fix JobGraph submission and client ACK logic

A failure when recovering savepoint state could lead to not ACKing
the job submission. For detached submissions, this could have lead
to a submission timeout although the job eventually starts to run.

Moreover, a failure to restore savepoint state, could lead to a job
graph skipping the submitted graph store for HA.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uce/flink 3396-submit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1656.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1656


commit 3847e304c93a30c18560719d8f169ca424d734e6
Author: Ufuk Celebi 
Date:   2016-02-12T21:09:29Z

[hotfix] Rename UnrecoverableException to SuppressRestartsException

commit 92e900b3266b93204ea84050d189f00bbbca6678
Author: Ufuk Celebi 
Date:   2016-02-16T16:49:13Z

[FLINK-3396] [runtime] Fix JobGraph submission and client ACK logic

A failure when recovering savepoint state could lead to not ACKing
the job submission. For detached submissions, this could have lead
to a submission timeout although the job eventually starts to run.

Moreover, a failure to restore savepoint state, could lead to a job
graph skipping the submitted graph store for HA.




> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148774#comment-15148774
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user uce commented on the pull request:

https://github.com/apache/flink/pull/1633#issuecomment-184719993
  
Closing this...


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148775#comment-15148775
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user uce closed the pull request at:

https://github.com/apache/flink/pull/1633


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148700#comment-15148700
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/1633#discussion_r53019879
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
@@ -1079,6 +1079,9 @@ class JobManager(
   executionGraph.registerExecutionListener(gateway)
   executionGraph.registerJobStatusListener(gateway)
 }
+
+// All good. Submission succeeded!
+jobInfo.client ! 
decorateMessage(JobSubmitSuccess(jobGraph.getJobID))
--- End diff --

Oh boy... not my day today. Thanks for catching that. This was not expected.


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148695#comment-15148695
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1633#discussion_r53019468
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
@@ -1079,6 +1079,9 @@ class JobManager(
   executionGraph.registerExecutionListener(gateway)
   executionGraph.registerJobStatusListener(gateway)
 }
+
+// All good. Submission succeeded!
+jobInfo.client ! 
decorateMessage(JobSubmitSuccess(jobGraph.getJobID))
--- End diff --

Hmm but now we have the problem that the user might see a 
`JobSubmitSuccess` without the job being stored in the 
`SubmittedJobGraphStore`, right? This means that if the `JobManager` dies 
before the job is persisted, it will never be recovered. I think this violates 
the `JobSubmitSuccess` contract.


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148628#comment-15148628
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user uce commented on the pull request:

https://github.com/apache/flink/pull/1633#issuecomment-184694713
  
Travis has passed. Did you have another look at this @tillrohrmann?


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148495#comment-15148495
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/1633#discussion_r53000913
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
@@ -1073,57 +1073,73 @@ class JobManager(
   // execute the recovery/writing the jobGraph into the 
SubmittedJobGraphStore asynchronously
   // because it is a blocking operation
   future {
-try {
-  if (isRecovery) {
-executionGraph.restoreLatestCheckpointedState()
-  }
-  else {
-val snapshotSettings = jobGraph.getSnapshotSettings
-if (snapshotSettings != null) {
-  val savepointPath = snapshotSettings.getSavepointPath()
+val restoreStateSuccess =
+  try {
+if (isRecovery) {
+  executionGraph.restoreLatestCheckpointedState()
--- End diff --

Regarding the `JobSubmitSuccess`: we had it as a follow up to have more 
fine-grained integration with the the client and left it as a duplicate submit 
message for the time being (instead of something like `JobRecovered`).

The other behaviour is back to the previous state now. I hear you that it 
makes sense to integrate the state restore behaviour with the execution graph 
restart.


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148485#comment-15148485
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/1633#discussion_r52999704
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
@@ -1079,6 +1079,9 @@ class JobManager(
   executionGraph.registerExecutionListener(gateway)
   executionGraph.registerJobStatusListener(gateway)
 }
+
+// All good. Submission succeeded!
+jobInfo.client ! 
decorateMessage(JobSubmitSuccess(jobGraph.getJobID))
--- End diff --

Moved this one up to have correct ACKing behaviour.


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148480#comment-15148480
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/1633#discussion_r52998947
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
@@ -1073,57 +1073,73 @@ class JobManager(
   // execute the recovery/writing the jobGraph into the 
SubmittedJobGraphStore asynchronously
   // because it is a blocking operation
   future {
-try {
-  if (isRecovery) {
-executionGraph.restoreLatestCheckpointedState()
-  }
-  else {
-val snapshotSettings = jobGraph.getSnapshotSettings
-if (snapshotSettings != null) {
-  val savepointPath = snapshotSettings.getSavepointPath()
+val restoreStateSuccess =
+  try {
+if (isRecovery) {
+  executionGraph.restoreLatestCheckpointedState()
--- End diff --

Had an offline discussion with Stephan. He agrees with you that the failure 
in this case is too hard. I'll undo that change by ACK'ing the submission 
earlier.


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148478#comment-15148478
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1633#discussion_r52998567
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
@@ -1073,57 +1073,73 @@ class JobManager(
   // execute the recovery/writing the jobGraph into the 
SubmittedJobGraphStore asynchronously
   // because it is a blocking operation
   future {
-try {
-  if (isRecovery) {
-executionGraph.restoreLatestCheckpointedState()
-  }
-  else {
-val snapshotSettings = jobGraph.getSnapshotSettings
-if (snapshotSettings != null) {
-  val savepointPath = snapshotSettings.getSavepointPath()
+val restoreStateSuccess =
+  try {
+if (isRecovery) {
+  executionGraph.restoreLatestCheckpointedState()
--- End diff --

The behaviour right now for a failure while doing a job recovery would 
simply fail the `ExecutionGraph` triggering a restart. A successful job 
recovery would send a `JobSubmitSuccess` to the client. I'm not sure whether 
this is actually correct, since the client already received a 
`JobSubmitMessage` from the `JobManager` while initially submitting the job. 
But I think this will simply be ignored.

Thus, suppressing the restart behaviour in case of a job recovery would 
actually change the behaviour.

If it makes sense and if it is possible to recover from failures while 
recovering a job or restoring a savepoint, it would make sense to not directly 
fail the job without restarting. Maybe one should distinguish that based on the 
actually occurring exception.


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148466#comment-15148466
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/1633#discussion_r52997375
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
@@ -1073,57 +1073,73 @@ class JobManager(
   // execute the recovery/writing the jobGraph into the 
SubmittedJobGraphStore asynchronously
   // because it is a blocking operation
   future {
-try {
-  if (isRecovery) {
-executionGraph.restoreLatestCheckpointedState()
-  }
-  else {
-val snapshotSettings = jobGraph.getSnapshotSettings
-if (snapshotSettings != null) {
-  val savepointPath = snapshotSettings.getSavepointPath()
+val restoreStateSuccess =
+  try {
+if (isRecovery) {
+  executionGraph.restoreLatestCheckpointedState()
--- End diff --

But then I would not keep the behaviour as it is right now. Instead, we 
should then consider the job submitted before trying to recover any checkpoint 
state and keep the restart behaviour. What do you think?


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148413#comment-15148413
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1633#discussion_r52992451
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
@@ -1073,57 +1073,73 @@ class JobManager(
   // execute the recovery/writing the jobGraph into the 
SubmittedJobGraphStore asynchronously
   // because it is a blocking operation
   future {
-try {
-  if (isRecovery) {
-executionGraph.restoreLatestCheckpointedState()
-  }
-  else {
-val snapshotSettings = jobGraph.getSnapshotSettings
-if (snapshotSettings != null) {
-  val savepointPath = snapshotSettings.getSavepointPath()
+val restoreStateSuccess =
+  try {
+if (isRecovery) {
+  executionGraph.restoreLatestCheckpointedState()
--- End diff --

I'm not so sure about that, to be honest. What if the 
`restoreLatestCheckpointedState` fails because of some HDFS/ZooKeeper problems. 
Then you would like to try restarting the job, wouldn't you? The client should 
then be notified once all restarting attempts have been exhausted.


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148354#comment-15148354
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user uce commented on the pull request:

https://github.com/apache/flink/pull/1633#issuecomment-184599724
  
Thanks for review Till! I've [opened an 
issue](https://issues.apache.org/jira/browse/FLINK-3411) for the failure 
behaviour in case of checkpoint state recovery.

I'll merge this today.


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147615#comment-15147615
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/1633#issuecomment-184313039
  
Changes look good to me @uce. I had only one inline question concerning a 
semantic change.

Apart from that +1 for merging.


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147613#comment-15147613
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1633#discussion_r52924511
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
@@ -1073,57 +1073,73 @@ class JobManager(
   // execute the recovery/writing the jobGraph into the 
SubmittedJobGraphStore asynchronously
   // because it is a blocking operation
   future {
-try {
-  if (isRecovery) {
-executionGraph.restoreLatestCheckpointedState()
-  }
-  else {
-val snapshotSettings = jobGraph.getSnapshotSettings
-if (snapshotSettings != null) {
-  val savepointPath = snapshotSettings.getSavepointPath()
+val restoreStateSuccess =
+  try {
+if (isRecovery) {
+  executionGraph.restoreLatestCheckpointedState()
--- End diff --

Is it intended that now failures in the `restoreLatestCheckpointedState` 
are non recoverable as well? This seems to be different from the former 
implementation.


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147465#comment-15147465
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

Github user uce commented on the pull request:

https://github.com/apache/flink/pull/1633#issuecomment-184250973
  
@tillrohrmann, could you have a look at this change?


> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

2016-02-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145312#comment-15145312
 ] 

ASF GitHub Bot commented on FLINK-3396:
---

GitHub user uce opened a pull request:

https://github.com/apache/flink/pull/1633

[FLINK-3396] [runtime] Fail job submission after state restore failure

If state restore fails during job graph submission, the submitting client 
never gets notified about the submission failure. For detached submissions, 
this results in a time out rather than the actual failure cause. The actual 
failure can only be seen in the logs.

This PR changes the behaviour to fail the submission.

The failure cause is wrapped in a `SuppressRestartsException` in order to 
prevent the restart strategy from kicking in. We call 
`ExecutionGraph#fail(Throwable)` in the first place in order to have proper 
clean up.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uce/flink 3396-state_restore

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1633.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1633


commit 982857bc0f5248a694d908141df0330de2befa3c
Author: Ufuk Celebi 
Date:   2016-02-12T20:59:58Z

[FLINK-3396] [runtime] Fail job submission after state restore failure

Problem: If state restore fails during job graph submission, the submitting
client never gets notified about the submission failure. For detached
submissions, this results in a time out rather than the actual failure 
cause.
The actual failure can only be seen in the logs.

Solution: Fail the submission if state restore fails.

commit 9d2f4d88e99c3e50d8c001c8aa2cdb2848c80478
Author: Ufuk Celebi 
Date:   2016-02-12T21:09:29Z

[hotfix] Rename UnrecoverableException to SuppressRestartsException




> Job submission Savepoint restore logic flawed
> -
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)