[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails

2018-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419265#comment-16419265
 ] 

ASF GitHub Bot commented on FLINK-9097:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/5774


> Jobs can be dropped in HA when job submission fails
> ---
>
> Key: FLINK-9097
> URL: https://issues.apache.org/jira/browse/FLINK-9097
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> Jobs can be dropped in HA mode if the job submission step fails. In such a 
> case, we should fail fatally to let the {{Dispatcher}} restart and retry to 
> recover all jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails

2018-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418988#comment-16418988
 ] 

ASF GitHub Bot commented on FLINK-9097:
---

Github user tillrohrmann commented on the issue:

https://github.com/apache/flink/pull/5774
  
Thanks for the review @GJL. Merging this PR and #5784.


> Jobs can be dropped in HA when job submission fails
> ---
>
> Key: FLINK-9097
> URL: https://issues.apache.org/jira/browse/FLINK-9097
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> Jobs can be dropped in HA mode if the job submission step fails. In such a 
> case, we should fail fatally to let the {{Dispatcher}} restart and retry to 
> recover all jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails

2018-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418711#comment-16418711
 ] 

ASF GitHub Bot commented on FLINK-9097:
---

Github user tillrohrmann commented on the issue:

https://github.com/apache/flink/pull/5774
  
I had to put another commit on top of it to fix a problem with the failing 
`DispatcherTest#testWaitingForJobMasterLeadership` @GJL. The new commit makes 
sure that we first recover all jobs before we set the fencing token of the 
`Dispatcher`. That way, no other action can interfere with the job recover, 
e.g. another job submission.


> Jobs can be dropped in HA when job submission fails
> ---
>
> Key: FLINK-9097
> URL: https://issues.apache.org/jira/browse/FLINK-9097
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> Jobs can be dropped in HA mode if the job submission step fails. In such a 
> case, we should fail fatally to let the {{Dispatcher}} restart and retry to 
> recover all jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417298#comment-16417298
 ] 

ASF GitHub Bot commented on FLINK-9097:
---

Github user tillrohrmann commented on the issue:

https://github.com/apache/flink/pull/5774
  
Rebased on the latest master. Sorry for the inconveniences @GJL. I've 
addressed all your comments so far.


> Jobs can be dropped in HA when job submission fails
> ---
>
> Key: FLINK-9097
> URL: https://issues.apache.org/jira/browse/FLINK-9097
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> Jobs can be dropped in HA mode if the job submission step fails. In such a 
> case, we should fail fatally to let the {{Dispatcher}} restart and retry to 
> recover all jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417297#comment-16417297
 ] 

ASF GitHub Bot commented on FLINK-9097:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/5774#discussion_r177739972
  
--- Diff: 
flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/DispatcherTest.java
 ---
@@ -507,19 +539,21 @@ private TestingDispatcher(
jobManagerMetricGroup,
metricQueryServicePath,
archivedExecutionGraphStore,
-   new 
ExpectedJobIdJobManagerRunnerFactory(expectedJobId),
+   jobManagerRunnerFactory,
fatalErrorHandler,
null);
}
 
-   @Override
-   public CompletableFuture submitJob(final JobGraph 
jobGraph, final Time timeout) {
-   final CompletableFuture submitJobFuture = 
super.submitJob(jobGraph, timeout);
-
-   submitJobFuture.thenAccept(ignored -> 
submitJobLatch.countDown());
 
-   return submitJobFuture;
-   }
+//
+// @Override
--- End diff --

Will remove it


> Jobs can be dropped in HA when job submission fails
> ---
>
> Key: FLINK-9097
> URL: https://issues.apache.org/jira/browse/FLINK-9097
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> Jobs can be dropped in HA mode if the job submission step fails. In such a 
> case, we should fail fatally to let the {{Dispatcher}} restart and retry to 
> recover all jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417287#comment-16417287
 ] 

ASF GitHub Bot commented on FLINK-9097:
---

Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/5774#discussion_r177726769
  
--- Diff: 
flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/DispatcherTest.java
 ---
@@ -507,19 +539,21 @@ private TestingDispatcher(
jobManagerMetricGroup,
metricQueryServicePath,
archivedExecutionGraphStore,
-   new 
ExpectedJobIdJobManagerRunnerFactory(expectedJobId),
+   jobManagerRunnerFactory,
fatalErrorHandler,
null);
}
 
-   @Override
-   public CompletableFuture submitJob(final JobGraph 
jobGraph, final Time timeout) {
-   final CompletableFuture submitJobFuture = 
super.submitJob(jobGraph, timeout);
-
-   submitJobFuture.thenAccept(ignored -> 
submitJobLatch.countDown());
 
-   return submitJobFuture;
-   }
+//
+// @Override
--- End diff --

delete commented code?


> Jobs can be dropped in HA when job submission fails
> ---
>
> Key: FLINK-9097
> URL: https://issues.apache.org/jira/browse/FLINK-9097
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> Jobs can be dropped in HA mode if the job submission step fails. In such a 
> case, we should fail fatally to let the {{Dispatcher}} restart and retry to 
> recover all jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417281#comment-16417281
 ] 

ASF GitHub Bot commented on FLINK-9097:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/5774#discussion_r177734109
  
--- Diff: flink-runtime/src/test/resources/log4j-test.properties ---
@@ -16,7 +16,7 @@
 # limitations under the License.
 

 
-log4j.rootLogger=OFF, console
+log4j.rootLogger=INFO, console
--- End diff --

Good catch. Will revert it.


> Jobs can be dropped in HA when job submission fails
> ---
>
> Key: FLINK-9097
> URL: https://issues.apache.org/jira/browse/FLINK-9097
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> Jobs can be dropped in HA mode if the job submission step fails. In such a 
> case, we should fail fatally to let the {{Dispatcher}} restart and retry to 
> recover all jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417244#comment-16417244
 ] 

ASF GitHub Bot commented on FLINK-9097:
---

Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/5774#discussion_r177725828
  
--- Diff: flink-runtime/src/test/resources/log4j-test.properties ---
@@ -16,7 +16,7 @@
 # limitations under the License.
 

 
-log4j.rootLogger=OFF, console
+log4j.rootLogger=INFO, console
--- End diff --

No


> Jobs can be dropped in HA when job submission fails
> ---
>
> Key: FLINK-9097
> URL: https://issues.apache.org/jira/browse/FLINK-9097
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> Jobs can be dropped in HA mode if the job submission step fails. In such a 
> case, we should fail fatally to let the {{Dispatcher}} restart and retry to 
> recover all jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails

2018-03-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415378#comment-16415378
 ] 

ASF GitHub Bot commented on FLINK-9097:
---

GitHub user tillrohrmann opened a pull request:

https://github.com/apache/flink/pull/5774

[FLINK-9097] Fail fatally if job submission fails when recovering jobs

## What is the purpose of the change

In order to not drop jobs, we have to fail fatally if a job submission 
fails when
recovering jobs. In HA mode, this will restart the Dispatcher and let it 
retry
to recover all jobs.

This PR is based on #5746.

cc @GJL 

## Brief change log

- Restructured `Dispatcher#submitJob` method
- Registered callback to listen to job submission result
- Fail `Dispatcher` if job submission result is a failure if recovering a 
job

## Verifying this change

- Added `DispatcherTest#testJobSubmissionErrorAfterJobRecovery`

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (no)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
  - The serializers: (no)
  - The runtime per-record code paths (performance sensitive): (no)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
  - The S3 file system connector: (no)

## Documentation

  - Does this pull request introduce a new feature? (no)
  - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tillrohrmann/flink 
reintroduceFatalErrorHandler

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5774.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5774


commit 6c69755c077e78c46a40a0d9f35e435d7ef1618b
Author: Till Rohrmann 
Date:   2018-03-22T09:46:04Z

[hotfix] Extend TestingFatalErrorHandler to return an error future

commit 080132d8ec938eddb545ff3a80d0039402c48e94
Author: Till Rohrmann 
Date:   2018-03-22T09:46:28Z

[hotfix] Add BiFunctionWithException

commit ff19155c1bcca8610ca78ae41fa607ede94ddffc
Author: Till Rohrmann 
Date:   2018-03-21T21:36:33Z

[FLINK-8943] [ha] Fail Dispatcher if jobs cannot be recovered from HA store

In HA mode, the Dispatcher should fail if it cannot recover the persisted 
jobs. The idea
is that another Dispatcher will be brought up and tries it again. This is 
better than
simply dropping the not recovered jobs.

commit 4656c2adeb93500c02d63adbfb90b8eecabb474b
Author: Till Rohrmann 
Date:   2018-03-27T07:45:13Z

[hotfix] Re-introduce FatalErrorHandler to JobManagerRunner

commit 93bb2799c08b23398b6927fb599770260fad2c8f
Author: Till Rohrmann 
Date:   2018-03-27T08:00:56Z

[hotfix] Correct JavaDocs in SubmittedJobGraphStore and add Nullable 
annotation

commit 2ba75d09e38d23c242e147adf613af91328b219a
Author: Till Rohrmann 
Date:   2018-03-27T08:59:54Z

[FLINK-9097] Fail fatally if job submission fails when recovering jobs

In order to not drop jobs, we have to fail fatally if a job submission 
fails when
recovering jobs. In HA mode, this will restart the Dispatcher and let it 
retry
to recover all jobs.




> Jobs can be dropped in HA when job submission fails
> ---
>
> Key: FLINK-9097
> URL: https://issues.apache.org/jira/browse/FLINK-9097
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> Jobs can be dropped in HA mode if the job submission step fails. In such a 
> case, we should fail fatally to let the {{Dispatcher}} restart and retry to 
> recover all jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)