[jira] [Commented] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed

2019-08-14 Thread Chesnay Schepler (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907117#comment-16907117
 ] 

Chesnay Schepler commented on FLINK-11835:
--

I'm still somewhat working on it. I checked out Florian's branch and could 
reproduce one instance of the problem.

What seemed to happen is that the slot allocation fails on one of the first 
dispatchers, resulting in a failure of the job. Given that we shutdown the 
ResourceManager in the test it kinda makes sense that this can happen. The job 
was marked as done in ZK, and subsequent dispatchers never recovered the job, 
hence the job status can not be queried later.

The exact underlying cause for the RM shutdown usually resulting in a JM 
failover, but sometime in a job failure, is still unknown.

> ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
> --
>
> Key: FLINK-11835
> URL: https://issues.apache.org/jira/browse/FLINK-11835
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
>Reporter: Gary Yao
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
> Attachments: scratch_22.txt
>
>
> {noformat}
> 20:44:07.264 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>   Time elapsed: 4.625 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149)
> {noformat}
> https://api.travis-ci.org/v3/job/502210892/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed

2019-08-14 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907096#comment-16907096
 ] 

Till Rohrmann commented on FLINK-11835:
---

The update from Florian is the following: He created a git branch to make the 
problem reproducible: 
https://github.com/florianschmidt1994/flink/tree/detect-zookeeper-it-case-bug. 
In particular if one lets the thread sleep in {{JobManagerRunner::closeAsync}} 
(line 192 ff), the problem occurred.

The problem occurs in the second iteration when running the test in a 
loop/repeatedly. The problem seems to be that the {{Dispatcher}} requests the 
{{JobStatus}} of a job which no longer exists (for whatever reason). The 
{{JobStatus}} future will be completed exceptionally at {{Dispatcher.java:817}} 
because the {{JobManagerFuture}} for the given {{JobID}} is no longer in the 
{{JobManagerFutures}} collection.

> ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
> --
>
> Key: FLINK-11835
> URL: https://issues.apache.org/jira/browse/FLINK-11835
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
>Reporter: Gary Yao
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
> Attachments: scratch_22.txt
>
>
> {noformat}
> 20:44:07.264 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>   Time elapsed: 4.625 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149)
> {noformat}
> https://api.travis-ci.org/v3/job/502210892/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed

2019-08-14 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907097#comment-16907097
 ] 

Till Rohrmann commented on FLINK-11835:
---

[~Zentol] are you still working on this issue?


> ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
> --
>
> Key: FLINK-11835
> URL: https://issues.apache.org/jira/browse/FLINK-11835
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
>Reporter: Gary Yao
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
> Attachments: scratch_22.txt
>
>
> {noformat}
> 20:44:07.264 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>   Time elapsed: 4.625 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149)
> {noformat}
> https://api.travis-ci.org/v3/job/502210892/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed

2019-08-02 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898883#comment-16898883
 ] 

Till Rohrmann commented on FLINK-11835:
---

This could be the case but without looking into the logs and trying it out it 
is hard to tell. I think [~florianschmidt] looked into the issue but I can't 
recall his latest analysis results. I'll try to reach out to him.

> ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
> --
>
> Key: FLINK-11835
> URL: https://issues.apache.org/jira/browse/FLINK-11835
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
>Reporter: Gary Yao
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.9.0
>
> Attachments: scratch_22.txt
>
>
> {noformat}
> 20:44:07.264 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>   Time elapsed: 4.625 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149)
> {noformat}
> https://api.travis-ci.org/v3/job/502210892/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed

2019-08-01 Thread Chesnay Schepler (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898018#comment-16898018
 ] 

Chesnay Schepler commented on FLINK-11835:
--

[~trohrm...@apache.org] Could this simply be a case of us querying the job 
status while the currently leading dispatcher is still initializing the 
JobMaster? This happens asynchronously 
({{Dispatcher#waitForTerminatingJobManager}}); after introducing a delay I got 
a similar exception as we saw on Travis.

> ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
> --
>
> Key: FLINK-11835
> URL: https://issues.apache.org/jira/browse/FLINK-11835
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
>Reporter: Gary Yao
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.9.0
>
> Attachments: scratch_22.txt
>
>
> {noformat}
> 20:44:07.264 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>   Time elapsed: 4.625 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149)
> {noformat}
> https://api.travis-ci.org/v3/job/502210892/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed

2019-05-14 Thread Florian Schmidt (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839359#comment-16839359
 ] 

Florian Schmidt commented on FLINK-11835:
-

After running it ~800 times I was able to reproduce the bug. Log level was set 
to info and I uploaded the logs

> ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
> --
>
> Key: FLINK-11835
> URL: https://issues.apache.org/jira/browse/FLINK-11835
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
>Reporter: Gary Yao
>Assignee: Florian Schmidt
>Priority: Critical
>  Labels: test-stability
> Attachments: scratch_22.txt
>
>
> {noformat}
> 20:44:07.264 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>   Time elapsed: 4.625 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149)
> {noformat}
> https://api.travis-ci.org/v3/job/502210892/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed

2019-05-08 Thread Chesnay Schepler (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835453#comment-16835453
 ] 

Chesnay Schepler commented on FLINK-11835:
--

Another instance: https://travis-ci.org/apache/flink/jobs/529230782

> ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
> --
>
> Key: FLINK-11835
> URL: https://issues.apache.org/jira/browse/FLINK-11835
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
>Reporter: Gary Yao
>Priority: Critical
>  Labels: test-stability
>
> {noformat}
> 20:44:07.264 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>   Time elapsed: 4.625 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149)
> {noformat}
> https://api.travis-ci.org/v3/job/502210892/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed

2019-03-14 Thread Yun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792421#comment-16792421
 ] 

Yun Tang commented on FLINK-11835:
--

Another instance [https://api.travis-ci.org/v3/job/505826891/log.txt]

> ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
> --
>
> Key: FLINK-11835
> URL: https://issues.apache.org/jira/browse/FLINK-11835
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
>Reporter: Gary Yao
>Priority: Critical
>  Labels: test-stability
>
> {noformat}
> 20:44:07.264 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>   Time elapsed: 4.625 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149)
> {noformat}
> https://api.travis-ci.org/v3/job/502210892/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed

2019-03-11 Thread chunpinghe (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790135#comment-16790135
 ] 

chunpinghe commented on FLINK-11835:


i can't reproduce this bug.

> ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
> --
>
> Key: FLINK-11835
> URL: https://issues.apache.org/jira/browse/FLINK-11835
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
>Reporter: Gary Yao
>Priority: Critical
>  Labels: test-stability
>
> {noformat}
> 20:44:07.264 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>   Time elapsed: 4.625 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149)
> {noformat}
> https://api.travis-ci.org/v3/job/502210892/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed

2019-03-11 Thread chunpinghe (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789411#comment-16789411
 ] 

chunpinghe commented on FLINK-11835:


is it possible that the recoveryOperation wasn't  finished  which causes 
requestJobResult method to  throw FlinkJobNotFoundException.

requestJobResult should wait recoveryOperation complete ?

 

> ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
> --
>
> Key: FLINK-11835
> URL: https://issues.apache.org/jira/browse/FLINK-11835
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
>Reporter: Gary Yao
>Priority: Critical
>  Labels: test-stability
>
> {noformat}
> 20:44:07.264 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>   Time elapsed: 4.625 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149)
> {noformat}
> https://api.travis-ci.org/v3/job/502210892/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)