[jira] [Commented] (FLINK-11835) ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange failed
[ https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972997#comment-16972997 ] chunpinghe commented on FLINK-11835: {{As JobManagerRunner::closeAsync}} runs asynchronously, The submitted jobs have a chance to become finished if the unblock method is invoked before the task is cancelled. I think we can fix this by waiting the job status to become `JobStatus.RUNNING` before we unblock operators. > ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange failed > -- > > Key: FLINK-11835 > URL: https://issues.apache.org/jira/browse/FLINK-11835 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.0 >Reporter: Gary Yao >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.10.0 > > Attachments: scratch_22.txt > > Time Spent: 10m > Remaining Estimate: 0h > > {noformat} > 20:44:07.264 [ERROR] > testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase) > Time elapsed: 4.625 s <<< ERROR! > java.util.concurrent.ExecutionException: > org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find > Flink job (2e957dc4f49feaed042eb8b4a7932610) > at > org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152) > Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could > not find Flink job (2e957dc4f49feaed042eb8b4a7932610) > at > org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149) > {noformat} > https://api.travis-ci.org/v3/job/502210892/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-11929) remove useless transientBlobCache in ClusterEntrypoint
[ https://issues.apache.org/jira/browse/FLINK-11929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunpinghe updated FLINK-11929: --- Component/s: Runtime / Coordination > remove useless transientBlobCache in ClusterEntrypoint > -- > > Key: FLINK-11929 > URL: https://issues.apache.org/jira/browse/FLINK-11929 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Reporter: chunpinghe >Assignee: chunpinghe >Priority: Minor > Fix For: 1.9.0 > > > the transientBlobCache in ClusterEntrypoint is initialized using > commonRpcService's Address instead of blobServer's. > Besides, it is useless after FLINK-10411 from my side. I suggest to remove > it. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11929) remove useless transientBlobCache in ClusterEntrypoint
chunpinghe created FLINK-11929: -- Summary: remove useless transientBlobCache in ClusterEntrypoint Key: FLINK-11929 URL: https://issues.apache.org/jira/browse/FLINK-11929 Project: Flink Issue Type: Improvement Reporter: chunpinghe Assignee: chunpinghe Fix For: 1.9.0 the transientBlobCache in ClusterEntrypoint is initialized using commonRpcService's Address instead of blobServer's. Besides, it is useless after [FLINK-10411|https://issues.apache.org/jira/browse/FLINK-10411] from my side. I suggest remove it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11929) remove useless transientBlobCache in ClusterEntrypoint
[ https://issues.apache.org/jira/browse/FLINK-11929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunpinghe updated FLINK-11929: --- External issue ID: FLINK-10411 > remove useless transientBlobCache in ClusterEntrypoint > -- > > Key: FLINK-11929 > URL: https://issues.apache.org/jira/browse/FLINK-11929 > Project: Flink > Issue Type: Improvement >Reporter: chunpinghe >Assignee: chunpinghe >Priority: Minor > Fix For: 1.9.0 > > > the transientBlobCache in ClusterEntrypoint is initialized using > commonRpcService's Address instead of blobServer's. > Besides, it is useless after > [FLINK-10411|https://issues.apache.org/jira/browse/FLINK-10411] from my > side. I suggest remove it. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11929) remove useless transientBlobCache in ClusterEntrypoint
[ https://issues.apache.org/jira/browse/FLINK-11929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunpinghe updated FLINK-11929: --- Description: the transientBlobCache in ClusterEntrypoint is initialized using commonRpcService's Address instead of blobServer's. Besides, it is useless after FLINK-10411 from my side. I suggest to remove it. was: the transientBlobCache in ClusterEntrypoint is initialized using commonRpcService's Address instead of blobServer's. Besides, it is useless after [FLINK-10411|https://issues.apache.org/jira/browse/FLINK-10411] from my side. I suggest remove it. > remove useless transientBlobCache in ClusterEntrypoint > -- > > Key: FLINK-11929 > URL: https://issues.apache.org/jira/browse/FLINK-11929 > Project: Flink > Issue Type: Improvement >Reporter: chunpinghe >Assignee: chunpinghe >Priority: Minor > Fix For: 1.9.0 > > > the transientBlobCache in ClusterEntrypoint is initialized using > commonRpcService's Address instead of blobServer's. > Besides, it is useless after FLINK-10411 from my side. I suggest to remove > it. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11929) remove useless transientBlobCache in ClusterEntrypoint
[ https://issues.apache.org/jira/browse/FLINK-11929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunpinghe updated FLINK-11929: --- External issue ID: (was: FLINK-10411) > remove useless transientBlobCache in ClusterEntrypoint > -- > > Key: FLINK-11929 > URL: https://issues.apache.org/jira/browse/FLINK-11929 > Project: Flink > Issue Type: Improvement >Reporter: chunpinghe >Assignee: chunpinghe >Priority: Minor > Fix For: 1.9.0 > > > the transientBlobCache in ClusterEntrypoint is initialized using > commonRpcService's Address instead of blobServer's. > Besides, it is useless after > [FLINK-10411|https://issues.apache.org/jira/browse/FLINK-10411] from my > side. I suggest remove it. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11897) ExecutionGraphSuspendTest.testSuspendedOutOfRunning failed
[ https://issues.apache.org/jira/browse/FLINK-11897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunpinghe updated FLINK-11897: --- Component/s: (was: Runtime / Operators) Runtime / Coordination > ExecutionGraphSuspendTest.testSuspendedOutOfRunning failed > --- > > Key: FLINK-11897 > URL: https://issues.apache.org/jira/browse/FLINK-11897 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Reporter: chunpinghe >Assignee: chunpinghe >Priority: Critical > Labels: test-stability > Fix For: 1.9.0 > > > 11:41:09.042 [INFO] Running > org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest > 11:41:11.009 [ERROR] Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time > elapsed: 1.964 s <<< FAILURE! - in > org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest > 11:41:11.010 [ERROR] > testSuspendedOutOfRunning(org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest) > Time elapsed: 0.052 s <<< FAILURE! java.lang.AssertionError: Expected: is > <0> but: was <3> at > org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.validateNoInteractions(ExecutionGraphSuspendTest.java:271) > at > org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.ensureCannotLeaveSuspendedState(ExecutionGraphSuspendTest.java:255) > at > org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.testSuspendedOutOfRunning(ExecutionGraphSuspendTest.java:110) > > [https://api.travis-ci.org/v3/job/505154324/log.txt|https://api.travis-ci.org/v3/job/505154324/log.txt] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11897) ExecutionGraphSuspendTest.testSuspendedOutOfRunning failed
[ https://issues.apache.org/jira/browse/FLINK-11897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunpinghe updated FLINK-11897: --- Labels: test-stability (was: ) > ExecutionGraphSuspendTest.testSuspendedOutOfRunning failed > --- > > Key: FLINK-11897 > URL: https://issues.apache.org/jira/browse/FLINK-11897 > Project: Flink > Issue Type: Bug > Components: Runtime / Operators, Tests >Reporter: chunpinghe >Assignee: chunpinghe >Priority: Critical > Labels: test-stability > Fix For: 1.9.0 > > > 11:41:09.042 [INFO] Running > org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest > 11:41:11.009 [ERROR] Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time > elapsed: 1.964 s <<< FAILURE! - in > org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest > 11:41:11.010 [ERROR] > testSuspendedOutOfRunning(org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest) > Time elapsed: 0.052 s <<< FAILURE! java.lang.AssertionError: Expected: is > <0> but: was <3> at > org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.validateNoInteractions(ExecutionGraphSuspendTest.java:271) > at > org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.ensureCannotLeaveSuspendedState(ExecutionGraphSuspendTest.java:255) > at > org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.testSuspendedOutOfRunning(ExecutionGraphSuspendTest.java:110) > > [https://api.travis-ci.org/v3/job/505154324/log.txt|https://api.travis-ci.org/v3/job/505154324/log.txt] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11897) ExecutionGraphSuspendTest.testSuspendedOutOfRunning failed
chunpinghe created FLINK-11897: -- Summary: ExecutionGraphSuspendTest.testSuspendedOutOfRunning failed Key: FLINK-11897 URL: https://issues.apache.org/jira/browse/FLINK-11897 Project: Flink Issue Type: Bug Components: Runtime / Operators, Tests Reporter: chunpinghe Assignee: chunpinghe Fix For: 1.9.0 11:41:09.042 [INFO] Running org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest 11:41:11.009 [ERROR] Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.964 s <<< FAILURE! - in org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest 11:41:11.010 [ERROR] testSuspendedOutOfRunning(org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest) Time elapsed: 0.052 s <<< FAILURE! java.lang.AssertionError: Expected: is <0> but: was <3> at org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.validateNoInteractions(ExecutionGraphSuspendTest.java:271) at org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.ensureCannotLeaveSuspendedState(ExecutionGraphSuspendTest.java:255) at org.apache.flink.runtime.executiongraph.ExecutionGraphSuspendTest.testSuspendedOutOfRunning(ExecutionGraphSuspendTest.java:110) [https://api.travis-ci.org/v3/job/505154324/log.txt|https://api.travis-ci.org/v3/job/505154324/log.txt] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11886) update the output of cluster management script in jobmanager_high_availability doc
[ https://issues.apache.org/jira/browse/FLINK-11886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunpinghe updated FLINK-11886: --- Attachment: ha_doc.png > update the output of cluster management script in > jobmanager_high_availability doc > -- > > Key: FLINK-11886 > URL: https://issues.apache.org/jira/browse/FLINK-11886 > Project: Flink > Issue Type: Improvement > Components: Documentation >Reporter: chunpinghe >Assignee: chunpinghe >Priority: Major > Fix For: 1.9.0 > > Attachments: ha_doc.png, ha_doc_updated.png > > > after flip6 released,the start and stop cluster scripts output "jobmanager" > as "standalonesession", > "taskmanager" as "taskexecutor". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11886) update the output of cluster management script in jobmanager_high_availability doc
[ https://issues.apache.org/jira/browse/FLINK-11886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunpinghe updated FLINK-11886: --- Attachment: ha_doc_updated.png > update the output of cluster management script in > jobmanager_high_availability doc > -- > > Key: FLINK-11886 > URL: https://issues.apache.org/jira/browse/FLINK-11886 > Project: Flink > Issue Type: Improvement > Components: Documentation >Reporter: chunpinghe >Assignee: chunpinghe >Priority: Major > Fix For: 1.9.0 > > Attachments: ha_doc.png, ha_doc_updated.png > > > after flip6 released,the start and stop cluster scripts output "jobmanager" > as "standalonesession", > "taskmanager" as "taskexecutor". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11886) update the output of cluster management script in jobmanager_high_availability doc
[ https://issues.apache.org/jira/browse/FLINK-11886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunpinghe updated FLINK-11886: --- Attachment: (was: jira_cluster.sh.png) > update the output of cluster management script in > jobmanager_high_availability doc > -- > > Key: FLINK-11886 > URL: https://issues.apache.org/jira/browse/FLINK-11886 > Project: Flink > Issue Type: Improvement > Components: Documentation >Reporter: chunpinghe >Assignee: chunpinghe >Priority: Major > Fix For: 1.9.0 > > > after flip6 released,the start and stop cluster scripts output "jobmanager" > as "standalonesession", > "taskmanager" as "taskexecutor". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11886) update the output of cluster management script in jobmanager_high_availability doc
chunpinghe created FLINK-11886: -- Summary: update the output of cluster management script in jobmanager_high_availability doc Key: FLINK-11886 URL: https://issues.apache.org/jira/browse/FLINK-11886 Project: Flink Issue Type: Improvement Components: Documentation Reporter: chunpinghe Assignee: chunpinghe Fix For: 1.9.0 Attachments: jira_cluster.sh.png after flip6 released,the start and stop cluster scripts output "jobmanager" as "standalonesession", "taskmanager" as "taskexecutor". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
[ https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790135#comment-16790135 ] chunpinghe commented on FLINK-11835: i can't reproduce this bug. > ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed > -- > > Key: FLINK-11835 > URL: https://issues.apache.org/jira/browse/FLINK-11835 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.0 >Reporter: Gary Yao >Priority: Critical > Labels: test-stability > > {noformat} > 20:44:07.264 [ERROR] > testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase) > Time elapsed: 4.625 s <<< ERROR! > java.util.concurrent.ExecutionException: > org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find > Flink job (2e957dc4f49feaed042eb8b4a7932610) > at > org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152) > Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could > not find Flink job (2e957dc4f49feaed042eb8b4a7932610) > at > org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149) > {noformat} > https://api.travis-ci.org/v3/job/502210892/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
[ https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunpinghe updated FLINK-11835: --- Comment: was deleted (was: is it possible that the recoveryOperation hasn't finished which causes requestJobResult method to throw FlinkJobNotFoundException. requestJobResult should wait recoveryOperation complete ? ) > ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed > -- > > Key: FLINK-11835 > URL: https://issues.apache.org/jira/browse/FLINK-11835 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.0 >Reporter: Gary Yao >Priority: Critical > Labels: test-stability > > {noformat} > 20:44:07.264 [ERROR] > testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase) > Time elapsed: 4.625 s <<< ERROR! > java.util.concurrent.ExecutionException: > org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find > Flink job (2e957dc4f49feaed042eb8b4a7932610) > at > org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152) > Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could > not find Flink job (2e957dc4f49feaed042eb8b4a7932610) > at > org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149) > {noformat} > https://api.travis-ci.org/v3/job/502210892/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
[ https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789411#comment-16789411 ] chunpinghe edited comment on FLINK-11835 at 3/12/19 1:14 AM: - is it possible that the recoveryOperation hasn't finished which causes requestJobResult method to throw FlinkJobNotFoundException. requestJobResult should wait recoveryOperation complete ? was (Author: moxian): is it possible that the recoveryOperation wasn't finished which causes requestJobResult method to throw FlinkJobNotFoundException. requestJobResult should wait recoveryOperation complete ? > ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed > -- > > Key: FLINK-11835 > URL: https://issues.apache.org/jira/browse/FLINK-11835 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.0 >Reporter: Gary Yao >Priority: Critical > Labels: test-stability > > {noformat} > 20:44:07.264 [ERROR] > testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase) > Time elapsed: 4.625 s <<< ERROR! > java.util.concurrent.ExecutionException: > org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find > Flink job (2e957dc4f49feaed042eb8b4a7932610) > at > org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152) > Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could > not find Flink job (2e957dc4f49feaed042eb8b4a7932610) > at > org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149) > {noformat} > https://api.travis-ci.org/v3/job/502210892/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11835) ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
[ https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789411#comment-16789411 ] chunpinghe commented on FLINK-11835: is it possible that the recoveryOperation wasn't finished which causes requestJobResult method to throw FlinkJobNotFoundException. requestJobResult should wait recoveryOperation complete ? > ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed > -- > > Key: FLINK-11835 > URL: https://issues.apache.org/jira/browse/FLINK-11835 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.0 >Reporter: Gary Yao >Priority: Critical > Labels: test-stability > > {noformat} > 20:44:07.264 [ERROR] > testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase) > Time elapsed: 4.625 s <<< ERROR! > java.util.concurrent.ExecutionException: > org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find > Flink job (2e957dc4f49feaed042eb8b4a7932610) > at > org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152) > Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could > not find Flink job (2e957dc4f49feaed042eb8b4a7932610) > at > org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149) > {noformat} > https://api.travis-ci.org/v3/job/502210892/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.
[ https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunpinghe updated FLINK-10884: --- Comment: was deleted (was: what's your solution? yarn will check the physical memory used by container by default, you can disable it by set {color:#6a8759}yarn.nodemanager.pmem-check-enabled {color:#33}to false. in your example, if your container use too much offheap memory(directory memory , or jni malloc) lead to total memory exceeds 3g then the container will be killed anyhow.{color} {color} {color:#6a8759}{color:#33}so, if your container was always killed by nodemanager you shoud check if the total memory you provided for it is not sufficient or your code has memory leak (mainly native memory leak){color}{color} ) > Flink on yarn TM container will be killed by nodemanager because of the > exceeded physical memory. > > > Key: FLINK-10884 > URL: https://issues.apache.org/jira/browse/FLINK-10884 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Runtime / Coordination >Affects Versions: 1.5.5, 1.6.2, 1.7.0 > Environment: version : 1.6.2 > module : flink on yarn > centos jdk1.8 > hadoop 2.7 >Reporter: wgcn >Assignee: wgcn >Priority: Major > Labels: pull-request-available, yarn > > TM container will be killed by nodemanager because of the exceeded > [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi] > memory. I found the lanuch context lanuching TM container that > "container memory = heap memory+ offHeapSizeMB" at the class > org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters > from line 160 to 166 I set a safety margin for the whole memory container > using. For example if the container limit 3g memory, the sum memory that > "heap memory+ offHeapSizeMB" is equal to 2.4g to prevent the container > being killed.Do we have the > [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC] > solution or I can commit my solution -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11852) Improve Processing function example
[ https://issues.apache.org/jira/browse/FLINK-11852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786624#comment-16786624 ] chunpinghe commented on FLINK-11852: you are right,it's meaningful! > Improve Processing function example > --- > > Key: FLINK-11852 > URL: https://issues.apache.org/jira/browse/FLINK-11852 > Project: Flink > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.7.2 >Reporter: Flavio Pompermaier >Priority: Minor > > In the processing function documentation > ([https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/process_function.html)] > there's an "abusive" usage of the timers since a new timer is registered for > every new tuple coming in. This could cause problems in terms of allocated > objects and could burden the overall application. > It could worth to mention this problem and remove useless timers, e.g.: > > {code:java} > CountWithTimestamp current = state.value(); > if (current == null) { > current = new CountWithTimestamp(); > current.key = value.f0; > } else { > ctx.timerService().deleteEventTimeTimer(current.lastModified + timeout); > }{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.
[ https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687577#comment-16687577 ] chunpinghe commented on FLINK-10884: what's your solution? yarn will check the physical memory used by container by default, you can disable it by set {color:#6a8759}yarn.nodemanager.pmem-check-enabled {color:#33}to false. in your example, if your container use too much offheap memory(directory memory , or jni malloc) lead to total memory exceeds 3g then the container will be killed anyhow.{color} {color} {color:#6a8759}{color:#33}so, if your container was always killed by nodemanager you shoud check if the total memory you provided for it is not sufficient or your code has memory leak (mainly native memory leak){color}{color} > Flink on yarn TM container will be killed by nodemanager because of the > exceeded physical memory. > > > Key: FLINK-10884 > URL: https://issues.apache.org/jira/browse/FLINK-10884 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Core >Affects Versions: 1.6.2 > Environment: version : 1.6.2 > module : flink on yarn > centos jdk1.8 > hadoop 2.7 >Reporter: wgcn >Priority: Major > Labels: yarn > > TM container will be killed by nodemanager because of the exceeded > [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi] > memory. I found the lanuch context lanuching TM container that > "container memory = heap memory+ offHeapSizeMB" at the class > org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters > from line 160 to 166 I set a safety margin for the whole memory container > using. For example if the container limit 3g memory, the sum memory that > "heap memory+ offHeapSizeMB" is equal to 2.4g to prevent the container > being killed.Do we have the > [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC] > solution or I can commit my solution -- This message was sent by Atlassian JIRA (v7.6.3#76005)