[jira] [Commented] (FLINK-11835) ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange failed

2019-11-12 Thread chunpinghe (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972997#comment-16972997
 ] 

chunpinghe commented on FLINK-11835:


{{As JobManagerRunner::closeAsync}} runs asynchronously, The submitted jobs 
have a chance to become finished if the unblock method is invoked before the 
task is cancelled.

I think we can fix this by waiting the job status to become `JobStatus.RUNNING` 
before we unblock   operators.

> ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange failed
> --
>
> Key: FLINK-11835
> URL: https://issues.apache.org/jira/browse/FLINK-11835
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
>Reporter: Gary Yao
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.10.0
>
> Attachments: scratch_22.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {noformat}
> 20:44:07.264 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>   Time elapsed: 4.625 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149)
> {noformat}
> https://api.travis-ci.org/v3/job/502210892/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-11929) remove useless transientBlobCache in ClusterEntrypoint

2019-03-15 Thread chunpinghe (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunpinghe updated FLINK-11929:
---
Component/s: Runtime / Coordination

> remove useless transientBlobCache in ClusterEntrypoint
> --
>
> Key: FLINK-11929
> URL: https://issues.apache.org/jira/browse/FLINK-11929
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: chunpinghe
>Assignee: chunpinghe
>Priority: Minor
> Fix For: 1.9.0
>
>
> the transientBlobCache in ClusterEntrypoint is initialized using 
> commonRpcService's Address instead of blobServer's.
> Besides, it is useless after FLINK-10411  from my side. I suggest to remove 
> it. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11929) remove useless transientBlobCache in ClusterEntrypoint

2019-03-15 Thread chunpinghe (JIRA)
chunpinghe created FLINK-11929:
--

 Summary: remove useless transientBlobCache in ClusterEntrypoint
 Key: FLINK-11929
 URL: https://issues.apache.org/jira/browse/FLINK-11929
 Project: Flink
  Issue Type: Improvement
Reporter: chunpinghe
Assignee: chunpinghe
 Fix For: 1.9.0


the transientBlobCache in ClusterEntrypoint is initialized using 
commonRpcService's Address instead of blobServer's.

Besides, it is useless after 
[FLINK-10411|https://issues.apache.org/jira/browse/FLINK-10411]  from my side. 
I suggest remove it. 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-11929) remove useless transientBlobCache in ClusterEntrypoint

2019-03-15 Thread chunpinghe (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunpinghe updated FLINK-11929:
---
External issue ID: FLINK-10411

> remove useless transientBlobCache in ClusterEntrypoint
> --
>
> Key: FLINK-11929
> URL: https://issues.apache.org/jira/browse/FLINK-11929
> Project: Flink
>  Issue Type: Improvement
>Reporter: chunpinghe
>Assignee: chunpinghe
>Priority: Minor
> Fix For: 1.9.0
>
>
> the transientBlobCache in ClusterEntrypoint is initialized using 
> commonRpcService's Address instead of blobServer's.
> Besides, it is useless after 
> [FLINK-10411|https://issues.apache.org/jira/browse/FLINK-10411]  from my 
> side. I suggest remove it. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-11929) remove useless transientBlobCache in ClusterEntrypoint

2019-03-15 Thread chunpinghe (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunpinghe updated FLINK-11929:
---
Description: 
the transientBlobCache in ClusterEntrypoint is initialized using 
commonRpcService's Address instead of blobServer's.

Besides, it is useless after FLINK-10411  from my side. I suggest to remove it. 

 

 

  was:
the transientBlobCache in ClusterEntrypoint is initialized using 
commonRpcService's Address instead of blobServer's.

Besides, it is useless after 
[FLINK-10411|https://issues.apache.org/jira/browse/FLINK-10411]  from my side. 
I suggest remove it. 

 

 


> remove useless transientBlobCache in ClusterEntrypoint
> --
>
> Key: FLINK-11929
> URL: https://issues.apache.org/jira/browse/FLINK-11929
> Project: Flink
>  Issue Type: Improvement
>Reporter: chunpinghe
>Assignee: chunpinghe
>Priority: Minor
> Fix For: 1.9.0
>
>
> the transientBlobCache in ClusterEntrypoint is initialized using 
> commonRpcService's Address instead of blobServer's.
> Besides, it is useless after FLINK-10411  from my side. I suggest to remove 
> it. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-11929) remove useless transientBlobCache in ClusterEntrypoint

2019-03-15 Thread chunpinghe (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunpinghe updated FLINK-11929:
---
External issue ID:   (was: FLINK-10411)

> remove useless transientBlobCache in ClusterEntrypoint
> --
>
> Key: FLINK-11929
> URL: https://issues.apache.org/jira/browse/FLINK-11929
> Project: Flink
>  Issue Type: Improvement
>Reporter: chunpinghe
>Assignee: chunpinghe
>Priority: Minor
> Fix For: 1.9.0
>
>
> the transientBlobCache in ClusterEntrypoint is initialized using 
> commonRpcService's Address instead of blobServer's.
> Besides, it is useless after 
> [FLINK-10411|https://issues.apache.org/jira/browse/FLINK-10411]  from my 
> side. I suggest remove it. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-11897) ExecutionGraphSuspendTest.testSuspendedOutOfRunning failed

2019-03-13 Thread chunpinghe (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunpinghe updated FLINK-11897:
---
Component/s: (was: Runtime / Operators)
 Runtime / Coordination

> ExecutionGraphSuspendTest.testSuspendedOutOfRunning failed 
> ---
>
> Key: FLINK-11897
> URL: https://issues.apache.org/jira/browse/FLINK-11897
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Tests
>Reporter: chunpinghe
>Assignee: chunpinghe
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.9.0
>
>
> 11:41:09.042 [INFO] Running 
> org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest 
> 11:41:11.009 [ERROR] Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time 
> elapsed: 1.964 s <<< FAILURE! - in 
> org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest 
> 11:41:11.010 [ERROR] 
> testSuspendedOutOfRunning(org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest)
>  Time elapsed: 0.052 s <<< FAILURE! java.lang.AssertionError: Expected: is 
> <0> but: was <3> at 
> org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.validateNoInteractions(ExecutionGraphSuspendTest.java:271)
>  at 
> org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.ensureCannotLeaveSuspendedState(ExecutionGraphSuspendTest.java:255)
>  at 
> org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.testSuspendedOutOfRunning(ExecutionGraphSuspendTest.java:110)
>  
> [https://api.travis-ci.org/v3/job/505154324/log.txt|https://api.travis-ci.org/v3/job/505154324/log.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-11897) ExecutionGraphSuspendTest.testSuspendedOutOfRunning failed

2019-03-13 Thread chunpinghe (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunpinghe updated FLINK-11897:
---
Labels: test-stability  (was: )

> ExecutionGraphSuspendTest.testSuspendedOutOfRunning failed 
> ---
>
> Key: FLINK-11897
> URL: https://issues.apache.org/jira/browse/FLINK-11897
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Operators, Tests
>Reporter: chunpinghe
>Assignee: chunpinghe
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.9.0
>
>
> 11:41:09.042 [INFO] Running 
> org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest 
> 11:41:11.009 [ERROR] Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time 
> elapsed: 1.964 s <<< FAILURE! - in 
> org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest 
> 11:41:11.010 [ERROR] 
> testSuspendedOutOfRunning(org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest)
>  Time elapsed: 0.052 s <<< FAILURE! java.lang.AssertionError: Expected: is 
> <0> but: was <3> at 
> org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.validateNoInteractions(ExecutionGraphSuspendTest.java:271)
>  at 
> org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.ensureCannotLeaveSuspendedState(ExecutionGraphSuspendTest.java:255)
>  at 
> org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.testSuspendedOutOfRunning(ExecutionGraphSuspendTest.java:110)
>  
> [https://api.travis-ci.org/v3/job/505154324/log.txt|https://api.travis-ci.org/v3/job/505154324/log.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11897) ExecutionGraphSuspendTest.testSuspendedOutOfRunning failed

2019-03-13 Thread chunpinghe (JIRA)
chunpinghe created FLINK-11897:
--

 Summary: ExecutionGraphSuspendTest.testSuspendedOutOfRunning 
failed 
 Key: FLINK-11897
 URL: https://issues.apache.org/jira/browse/FLINK-11897
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Operators, Tests
Reporter: chunpinghe
Assignee: chunpinghe
 Fix For: 1.9.0


11:41:09.042 [INFO] Running 
org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest 11:41:11.009 
[ERROR] Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.964 s 
<<< FAILURE! - in 
org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest 11:41:11.010 
[ERROR] 
testSuspendedOutOfRunning(org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest)
 Time elapsed: 0.052 s <<< FAILURE! java.lang.AssertionError: Expected: is <0> 
but: was <3> at 
org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.validateNoInteractions(ExecutionGraphSuspendTest.java:271)
 at 
org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.ensureCannotLeaveSuspendedState(ExecutionGraphSuspendTest.java:255)
 at 
org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.testSuspendedOutOfRunning(ExecutionGraphSuspendTest.java:110)

 

[https://api.travis-ci.org/v3/job/505154324/log.txt|https://api.travis-ci.org/v3/job/505154324/log.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-11886) update the output of cluster management script in jobmanager_high_availability doc

2019-03-12 Thread chunpinghe (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunpinghe updated FLINK-11886:
---
Attachment: ha_doc.png

> update the output of cluster management script in 
> jobmanager_high_availability doc
> --
>
> Key: FLINK-11886
> URL: https://issues.apache.org/jira/browse/FLINK-11886
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: chunpinghe
>Assignee: chunpinghe
>Priority: Major
> Fix For: 1.9.0
>
> Attachments: ha_doc.png, ha_doc_updated.png
>
>
> after flip6 released,the start and stop cluster scripts output "jobmanager" 
> as "standalonesession",
> "taskmanager" as "taskexecutor".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-11886) update the output of cluster management script in jobmanager_high_availability doc

2019-03-12 Thread chunpinghe (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunpinghe updated FLINK-11886:
---
Attachment: ha_doc_updated.png

> update the output of cluster management script in 
> jobmanager_high_availability doc
> --
>
> Key: FLINK-11886
> URL: https://issues.apache.org/jira/browse/FLINK-11886
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: chunpinghe
>Assignee: chunpinghe
>Priority: Major
> Fix For: 1.9.0
>
> Attachments: ha_doc.png, ha_doc_updated.png
>
>
> after flip6 released,the start and stop cluster scripts output "jobmanager" 
> as "standalonesession",
> "taskmanager" as "taskexecutor".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-11886) update the output of cluster management script in jobmanager_high_availability doc

2019-03-12 Thread chunpinghe (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunpinghe updated FLINK-11886:
---
Attachment: (was: jira_cluster.sh.png)

> update the output of cluster management script in 
> jobmanager_high_availability doc
> --
>
> Key: FLINK-11886
> URL: https://issues.apache.org/jira/browse/FLINK-11886
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: chunpinghe
>Assignee: chunpinghe
>Priority: Major
> Fix For: 1.9.0
>
>
> after flip6 released,the start and stop cluster scripts output "jobmanager" 
> as "standalonesession",
> "taskmanager" as "taskexecutor".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11886) update the output of cluster management script in jobmanager_high_availability doc

2019-03-12 Thread chunpinghe (JIRA)
chunpinghe created FLINK-11886:
--

 Summary: update the output of cluster management script in 
jobmanager_high_availability doc
 Key: FLINK-11886
 URL: https://issues.apache.org/jira/browse/FLINK-11886
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: chunpinghe
Assignee: chunpinghe
 Fix For: 1.9.0
 Attachments: jira_cluster.sh.png

after flip6 released,the start and stop cluster scripts output "jobmanager" as 
"standalonesession",

"taskmanager" as "taskexecutor".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed

2019-03-11 Thread chunpinghe (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790135#comment-16790135
 ] 

chunpinghe commented on FLINK-11835:


i can't reproduce this bug.

> ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
> --
>
> Key: FLINK-11835
> URL: https://issues.apache.org/jira/browse/FLINK-11835
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
>Reporter: Gary Yao
>Priority: Critical
>  Labels: test-stability
>
> {noformat}
> 20:44:07.264 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>   Time elapsed: 4.625 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149)
> {noformat}
> https://api.travis-ci.org/v3/job/502210892/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed

2019-03-11 Thread chunpinghe (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunpinghe updated FLINK-11835:
---
Comment: was deleted

(was: is it possible that the recoveryOperation hasn't  finished  which causes 
requestJobResult method to  throw FlinkJobNotFoundException.

requestJobResult should wait recoveryOperation complete ?

 )

> ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
> --
>
> Key: FLINK-11835
> URL: https://issues.apache.org/jira/browse/FLINK-11835
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
>Reporter: Gary Yao
>Priority: Critical
>  Labels: test-stability
>
> {noformat}
> 20:44:07.264 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>   Time elapsed: 4.625 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149)
> {noformat}
> https://api.travis-ci.org/v3/job/502210892/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed

2019-03-11 Thread chunpinghe (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789411#comment-16789411
 ] 

chunpinghe edited comment on FLINK-11835 at 3/12/19 1:14 AM:
-

is it possible that the recoveryOperation hasn't  finished  which causes 
requestJobResult method to  throw FlinkJobNotFoundException.

requestJobResult should wait recoveryOperation complete ?

 


was (Author: moxian):
is it possible that the recoveryOperation wasn't  finished  which causes 
requestJobResult method to  throw FlinkJobNotFoundException.

requestJobResult should wait recoveryOperation complete ?

 

> ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
> --
>
> Key: FLINK-11835
> URL: https://issues.apache.org/jira/browse/FLINK-11835
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
>Reporter: Gary Yao
>Priority: Critical
>  Labels: test-stability
>
> {noformat}
> 20:44:07.264 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>   Time elapsed: 4.625 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149)
> {noformat}
> https://api.travis-ci.org/v3/job/502210892/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed

2019-03-11 Thread chunpinghe (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789411#comment-16789411
 ] 

chunpinghe commented on FLINK-11835:


is it possible that the recoveryOperation wasn't  finished  which causes 
requestJobResult method to  throw FlinkJobNotFoundException.

requestJobResult should wait recoveryOperation complete ?

 

> ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
> --
>
> Key: FLINK-11835
> URL: https://issues.apache.org/jira/browse/FLINK-11835
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
>Reporter: Gary Yao
>Priority: Critical
>  Labels: test-stability
>
> {noformat}
> 20:44:07.264 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>   Time elapsed: 4.625 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (2e957dc4f49feaed042eb8b4a7932610)
>   at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149)
> {noformat}
> https://api.travis-ci.org/v3/job/502210892/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.

2019-03-07 Thread chunpinghe (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunpinghe updated FLINK-10884:
---
Comment: was deleted

(was: what's your solution?

yarn will check the physical memory used by container by default, you can 
disable it by set {color:#6a8759}yarn.nodemanager.pmem-check-enabled 
{color:#33}to false. in your example, if your container use too much 
offheap memory(directory memory , or jni malloc) lead to total memory exceeds 
3g then the container will be killed anyhow.{color}
{color}

{color:#6a8759}{color:#33}so, if your container was always killed by 
nodemanager you shoud check if the total memory you provided for it is not 
sufficient or your code has memory leak (mainly native memory 
leak){color}{color}

 

 )

> Flink on yarn  TM container will be killed by nodemanager because of  the 
> exceeded  physical memory.
> 
>
> Key: FLINK-10884
> URL: https://issues.apache.org/jira/browse/FLINK-10884
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / Coordination
>Affects Versions: 1.5.5, 1.6.2, 1.7.0
> Environment: version  : 1.6.2 
> module : flink on yarn
> centos  jdk1.8
> hadoop 2.7
>Reporter: wgcn
>Assignee: wgcn
>Priority: Major
>  Labels: pull-request-available, yarn
>
> TM container will be killed by nodemanager because of  the exceeded  
> [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi]
>  memory. I found the lanuch context   lanuching TM container  that  
> "container memory =   heap memory+ offHeapSizeMB"  at the class 
> org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters   
> from line 160 to 166  I set a safety margin for the whole memory container 
> using. For example  if the container  limit 3g  memory,  the sum memory that  
>  "heap memory+ offHeapSizeMB"  is equal to  2.4g to prevent the container 
> being killed.Do we have the 
> [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC]
>  solution  or I can commit my solution



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11852) Improve Processing function example

2019-03-07 Thread chunpinghe (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786624#comment-16786624
 ] 

chunpinghe commented on FLINK-11852:


you are right,it's meaningful!

> Improve Processing function example
> ---
>
> Key: FLINK-11852
> URL: https://issues.apache.org/jira/browse/FLINK-11852
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.7.2
>Reporter: Flavio Pompermaier
>Priority: Minor
>
> In the processing function documentation 
> ([https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/process_function.html)]
>  there's an "abusive" usage of the timers since a new timer is registered for 
> every new tuple coming in. This could cause problems in terms of allocated 
> objects and could burden the overall application.
> It could worth to mention this problem and remove useless timers, e.g.:
>  
> {code:java}
> CountWithTimestamp current = state.value();
> if (current == null) {
>      current = new CountWithTimestamp();
>      current.key = value.f0;
>  } else {
>     ctx.timerService().deleteEventTimeTimer(current.lastModified + timeout);
>  }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.

2018-11-14 Thread chunpinghe (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687577#comment-16687577
 ] 

chunpinghe commented on FLINK-10884:


what's your solution?

yarn will check the physical memory used by container by default, you can 
disable it by set {color:#6a8759}yarn.nodemanager.pmem-check-enabled 
{color:#33}to false. in your example, if your container use too much 
offheap memory(directory memory , or jni malloc) lead to total memory exceeds 
3g then the container will be killed anyhow.{color}
{color}

{color:#6a8759}{color:#33}so, if your container was always killed by 
nodemanager you shoud check if the total memory you provided for it is not 
sufficient or your code has memory leak (mainly native memory 
leak){color}{color}

 

 

> Flink on yarn  TM container will be killed by nodemanager because of  the 
> exceeded  physical memory.
> 
>
> Key: FLINK-10884
> URL: https://issues.apache.org/jira/browse/FLINK-10884
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Core
>Affects Versions: 1.6.2
> Environment: version  : 1.6.2 
> module : flink on yarn
> centos  jdk1.8
> hadoop 2.7
>Reporter: wgcn
>Priority: Major
>  Labels: yarn
>
> TM container will be killed by nodemanager because of  the exceeded  
> [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi]
>  memory. I found the lanuch context   lanuching TM container  that  
> "container memory =   heap memory+ offHeapSizeMB"  at the class 
> org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters   
> from line 160 to 166  I set a safety margin for the whole memory container 
> using. For example  if the container  limit 3g  memory,  the sum memory that  
>  "heap memory+ offHeapSizeMB"  is equal to  2.4g to prevent the container 
> being killed.Do we have the 
> [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC]
>  solution  or I can commit my solution



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)