[jira] [Comment Edited] (FLINK-13489) Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout
[ https://issues.apache.org/jira/browse/FLINK-13489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902748#comment-16902748 ] Till Rohrmann edited comment on FLINK-13489 at 8/8/19 7:17 AM: --- The latest [release-1.9 cron job|https://travis-ci.org/apache/flink/builds/568771238] passed. Hence I assume that only the Akka timeout problem is left. Assuming that this is most likely a test instability and not a fundamental problem, I'll decrease the priority of this issue to critical. We should nevertheless try to fix the test instability [~kevin.cyj]. was (Author: till.rohrmann): The latest [release-1.9 cron job]|https://travis-ci.org/apache/flink/builds/568771238] passed. Hence I assume that only the Akka timeout problem is left. Assuming that this is most likely a test instability and not a fundamental problem, I'll decrease the priority of this issue to critical. We should nevertheless try to fix the test instability [~kevin.cyj]. > Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout > -- > > Key: FLINK-13489 > URL: https://issues.apache.org/jira/browse/FLINK-13489 > Project: Flink > Issue Type: Bug > Components: Test Infrastructure >Reporter: Tzu-Li (Gordon) Tai >Assignee: Yingjie Cao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.9.0 > > Time Spent: 20m > Remaining Estimate: 0h > > https://api.travis-ci.org/v3/job/564925128/log.txt > {code} > > The program finished with the following exception: > org.apache.flink.client.program.ProgramInvocationException: Job failed. > (JobID: 1b4f1807cc749628cfc1bdf04647527a) > at > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:250) > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338) > at > org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:60) > at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1507) > at > org.apache.flink.deployment.HeavyDeploymentStressTestProgram.main(HeavyDeploymentStressTestProgram.java:70) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:576) > at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:438) > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274) > at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746) > at > org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273) > at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083) > Caused by: org.apache.flink.runtime.client.JobExecutionException: Job > execution failed. > at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) > at > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:247) > ... 21 more > Caused by: java.util.concurrent.TimeoutException: Heartbeat of TaskManager > with id ea456d6a590eca7598c19c4d35e56db9 timed out. > at > org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1149) > at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) > at >
[jira] [Comment Edited] (FLINK-13489) Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout
[ https://issues.apache.org/jira/browse/FLINK-13489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901917#comment-16901917 ] Till Rohrmann edited comment on FLINK-13489 at 8/7/19 9:21 AM: --- Thanks for the update [~kevin.cyj]. The corresponding JIRA issue for the failing tpch end-to-end test is FLINK-13607. I would suggest that we unblock this issue if we see that the cron jobs are passing a couple of times (given that the akka timeout issue happens rarely). was (Author: till.rohrmann): Thanks for the update [~kevin.cyj]. The corresponding JIRA issue for the failing tpch end-to-end test is FLINK-13607. > Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout > -- > > Key: FLINK-13489 > URL: https://issues.apache.org/jira/browse/FLINK-13489 > Project: Flink > Issue Type: Bug > Components: Test Infrastructure >Reporter: Tzu-Li (Gordon) Tai >Assignee: Yingjie Cao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.9.0 > > > https://api.travis-ci.org/v3/job/564925128/log.txt > {code} > > The program finished with the following exception: > org.apache.flink.client.program.ProgramInvocationException: Job failed. > (JobID: 1b4f1807cc749628cfc1bdf04647527a) > at > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:250) > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338) > at > org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:60) > at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1507) > at > org.apache.flink.deployment.HeavyDeploymentStressTestProgram.main(HeavyDeploymentStressTestProgram.java:70) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:576) > at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:438) > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274) > at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746) > at > org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273) > at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083) > Caused by: org.apache.flink.runtime.client.JobExecutionException: Job > execution failed. > at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) > at > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:247) > ... 21 more > Caused by: java.util.concurrent.TimeoutException: Heartbeat of TaskManager > with id ea456d6a590eca7598c19c4d35e56db9 timed out. > at > org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1149) > at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > at
[jira] [Comment Edited] (FLINK-13489) Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout
[ https://issues.apache.org/jira/browse/FLINK-13489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901816#comment-16901816 ] Till Rohrmann edited comment on FLINK-13489 at 8/7/19 7:37 AM: --- -[~kevin.cyj] are you still working on this issue? I've seen the test still fail after FLINK-13579.- In the latest {{release-1.9}} cron job the tests weren't executed because the tpch end-to-end test failed consistently. was (Author: till.rohrmann): [~kevin.cyj] are you still working on this issue? I've seen the test still fail after FLINK-13579. > Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout > -- > > Key: FLINK-13489 > URL: https://issues.apache.org/jira/browse/FLINK-13489 > Project: Flink > Issue Type: Bug > Components: Test Infrastructure >Reporter: Tzu-Li (Gordon) Tai >Assignee: Yingjie Cao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.9.0 > > > https://api.travis-ci.org/v3/job/564925128/log.txt > {code} > > The program finished with the following exception: > org.apache.flink.client.program.ProgramInvocationException: Job failed. > (JobID: 1b4f1807cc749628cfc1bdf04647527a) > at > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:250) > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338) > at > org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:60) > at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1507) > at > org.apache.flink.deployment.HeavyDeploymentStressTestProgram.main(HeavyDeploymentStressTestProgram.java:70) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:576) > at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:438) > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274) > at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746) > at > org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273) > at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083) > Caused by: org.apache.flink.runtime.client.JobExecutionException: Job > execution failed. > at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) > at > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:247) > ... 21 more > Caused by: java.util.concurrent.TimeoutException: Heartbeat of TaskManager > with id ea456d6a590eca7598c19c4d35e56db9 timed out. > at > org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1149) > at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > at
[jira] [Comment Edited] (FLINK-13489) Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout
[ https://issues.apache.org/jira/browse/FLINK-13489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899343#comment-16899343 ] Till Rohrmann edited comment on FLINK-13489 at 8/5/19 9:50 AM: --- [~StephanEwen] I run the test for many times, but only encountered the akka timeout problem only once, and nerve encountered the heartbeat timeout problem. But unfortunately, I did not get the JM/TM log of that failure. Latter, I modified the test script to print gc and JM/TM log out and run the test for many times, but the timeout problem did not occur. I noticed the gc time is a little long, many 2, 3, 4 seconds (these are for successfully finished job). I guess the previous failure may result by GC. Another problem is that the Travis test platform seems not stable, the test time varies. As for containerized.heap-cutoff-min, it is because it was used for memory calculation. If the default value (600) is used, the standalone cluster can not start up. I agree with you that this config option should not be considered by standalone mode, but it seems reusing the same code (I think it also should be fixed). The flowing is the exception stack: {code} 2019-08-01 18:42:29,289 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start cluster entrypoint StandaloneSessionClusterEntrypoint. org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint StandaloneSessionClusterEntrypoint. at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:182) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:501) at org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint.main(StandaloneSessionClusterEntrypoint.java:65) Caused by: org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent. at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:259) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:210) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:164) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:163) ... 2 more Caused by: java.lang.IllegalArgumentException: The configuration value 'containerized.heap-cutoff-min'='600' is larger than the total container memory 512 at org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters.calculateCutoffMB(ContaineredTaskManagerParameters.java:133) at org.apache.flink.runtime.util.ResourceManagerUtil.getResourceManagerConfiguration(ResourceManagerUtil.java:34) at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:171) ... 6 more {code} was (Author: kevin.cyj): [~StephanEwen] I run the test for many times, but only encountered the akka timeout problem only once, and nerve encountered the heartbeat timeout problem. But unfortunately, I did not get the JM/TM log of that failure. Latter, I modified the test script to print gc and JM/TM log out and run the test for many times, but the timeout problem did not occur. I noticed the gc time is a little long, many 2, 3, 4 seconds (these are for successfully finished job). I guess the previous failure may result by GC. Another problem is that the Travis test platform seems not stable, the test time varies. As for containerized.heap-cutoff-min, it is because it was used for memory calculation. If the default value (600) is used, the standalone cluster can not start up. I agree with you that this config option should not be considered by standalone mode, but it seems reusing the same code (I think it also should be fixed). The flowing is the exception stack: 2019-08-01 18:42:29,289 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start cluster entrypoint StandaloneSessionClusterEntrypoint. org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint StandaloneSessionClusterEntrypoint. at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:182) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:501) at org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint.main(StandaloneSessionClusterEntrypoint.java:65) Caused by: org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent. at
[jira] [Comment Edited] (FLINK-13489) Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout
[ https://issues.apache.org/jira/browse/FLINK-13489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899957#comment-16899957 ] Yingjie Cao edited comment on FLINK-13489 at 8/5/19 9:50 AM: - The GC time is still high after I Increased the heap size of TM and JM (though lower than before, there are still many 1 ~ 2 seconds entries). And I notice that the real time if much longer than user plus sys time. Usually, the user and sys time are longer than real time because of multi thread GC. Longer real time usually means that the GC threads are waiting for CPU or IO. Like the GC thread, the RPC main thread may also need to wait for CPU or IO. If so, may be the proper solution for this case is just to increase the akka and heartbeat timeout config value. I will collect more information about this case. was (Author: kevin.cyj): The GC time is still high after I Increased the heap size of TM and JM (though lower than before, there are still many 1 ~ 2 seconds entries). And I notice that the real time if much longer than user plus sys time. Usually, the real and sys time are longer than real time because of multi thread GC. Longer real time usually means that the GC threads are waiting for CPU or IO. Like the GC thread, the RPC main thread may also need to wait for CPU or IO. If so, may be the proper solution for this case is just to increase the akka and heartbeat timeout config value. I will collect more information about this case. > Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout > -- > > Key: FLINK-13489 > URL: https://issues.apache.org/jira/browse/FLINK-13489 > Project: Flink > Issue Type: Bug > Components: Test Infrastructure >Reporter: Tzu-Li (Gordon) Tai >Assignee: Yingjie Cao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.9.0 > > > https://api.travis-ci.org/v3/job/564925128/log.txt > {code} > > The program finished with the following exception: > org.apache.flink.client.program.ProgramInvocationException: Job failed. > (JobID: 1b4f1807cc749628cfc1bdf04647527a) > at > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:250) > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338) > at > org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:60) > at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1507) > at > org.apache.flink.deployment.HeavyDeploymentStressTestProgram.main(HeavyDeploymentStressTestProgram.java:70) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:576) > at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:438) > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274) > at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746) > at > org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273) > at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083) > Caused by: org.apache.flink.runtime.client.JobExecutionException: Job > execution failed. > at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) > at > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:247) > ... 21 more > Caused by: java.util.concurrent.TimeoutException: Heartbeat of TaskManager > with id ea456d6a590eca7598c19c4d35e56db9 timed out. > at >
[jira] [Comment Edited] (FLINK-13489) Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout
[ https://issues.apache.org/jira/browse/FLINK-13489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899343#comment-16899343 ] Yingjie Cao edited comment on FLINK-13489 at 8/3/19 3:34 AM: - [~StephanEwen] I run the test for many times, but only encountered the akka timeout problem only once, and nerve encountered the heartbeat timeout problem. But unfortunately, I did not get the JM/TM log of that failure. Latter, I modified the test script to print gc and JM/TM log out and run the test for many times, but the timeout problem did not occur. I noticed the gc time is a little long, many 2, 3, 4 seconds (these are for successfully finished job). I guess the previous failure may result by GC. Another problem is that the Travis test platform seems not stable, the test time varies. As for containerized.heap-cutoff-min, it is because it was used for memory calculation. If the default value (600) is used, the standalone cluster can not start up. I agree with you that this config option should not be considered by standalone mode, but it seems reusing the same code (I think it also should be fixed). The flowing is the exception stack: 2019-08-01 18:42:29,289 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start cluster entrypoint StandaloneSessionClusterEntrypoint. org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint StandaloneSessionClusterEntrypoint. at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:182) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:501) at org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint.main(StandaloneSessionClusterEntrypoint.java:65) Caused by: org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent. at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:259) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:210) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:164) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:163) ... 2 more Caused by: java.lang.IllegalArgumentException: The configuration value 'containerized.heap-cutoff-min'='600' is larger than the total container memory 512 at org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters.calculateCutoffMB(ContaineredTaskManagerParameters.java:133) at org.apache.flink.runtime.util.ResourceManagerUtil.getResourceManagerConfiguration(ResourceManagerUtil.java:34) at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:171) ... 6 more was (Author: kevin.cyj): [~StephanEwen] I run the test for many times, but only encountered the akka timeout problem only once, and nerve encountered the heartbeat timeout problem. But unfortunately, I did not get the JM/TM log of that failure. Latter, I modified the test script to print gc and JM/TM log out and run the test for many times, but the timeout problem did not occur. I noticed the gc time is a little long, many 2, 3, 4 seconds (these are for successfully finished job). I guess the previous failure may result by GC. Another problem is that the Travis test platform seems not stable, the test time varies. As for containerized.heap-cutoff-min, it is because it was used for memory calculation. If the default value (600) is used, the standalone cluster can start up. I agree with you that this config option should not be considered by standalone mode, but it seems reusing the same code (I think it also should be fixed). The flowing is the exception stack: 2019-08-01 18:42:29,289 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start cluster entrypoint StandaloneSessionClusterEntrypoint. org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint StandaloneSessionClusterEntrypoint. at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:182) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:501) at org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint.main(StandaloneSessionClusterEntrypoint.java:65) Caused by: org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent. at