[jira] [Comment Edited] (FLINK-13489) Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout

2019-08-08 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902748#comment-16902748
 ] 

Till Rohrmann edited comment on FLINK-13489 at 8/8/19 7:17 AM:
---

The latest [release-1.9 cron 
job|https://travis-ci.org/apache/flink/builds/568771238] passed. Hence I assume 
that only the Akka timeout problem is left. Assuming that this is most likely a 
test instability and not a fundamental problem, I'll decrease the priority of 
this issue to critical. We should nevertheless try to fix the test instability 
[~kevin.cyj].


was (Author: till.rohrmann):
The latest [release-1.9 cron 
job]|https://travis-ci.org/apache/flink/builds/568771238] passed. Hence I 
assume that only the Akka timeout problem is left. Assuming that this is most 
likely a test instability and not a fundamental problem, I'll decrease the 
priority of this issue to critical. We should nevertheless try to fix the test 
instability [~kevin.cyj].

> Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout
> --
>
> Key: FLINK-13489
> URL: https://issues.apache.org/jira/browse/FLINK-13489
> Project: Flink
>  Issue Type: Bug
>  Components: Test Infrastructure
>Reporter: Tzu-Li (Gordon) Tai
>Assignee: Yingjie Cao
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> https://api.travis-ci.org/v3/job/564925128/log.txt
> {code}
> 
>  The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: 1b4f1807cc749628cfc1bdf04647527a)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:250)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
>   at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:60)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1507)
>   at 
> org.apache.flink.deployment.HeavyDeploymentStressTestProgram.main(HeavyDeploymentStressTestProgram.java:70)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:576)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:438)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746)
>   at 
> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:247)
>   ... 21 more
> Caused by: java.util.concurrent.TimeoutException: Heartbeat of TaskManager 
> with id ea456d6a590eca7598c19c4d35e56db9 timed out.
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1149)
>   at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
>   at 
> 

[jira] [Comment Edited] (FLINK-13489) Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout

2019-08-07 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901917#comment-16901917
 ] 

Till Rohrmann edited comment on FLINK-13489 at 8/7/19 9:21 AM:
---

Thanks for the update [~kevin.cyj]. The corresponding JIRA issue for the 
failing tpch end-to-end test is FLINK-13607.

I would suggest that we unblock this issue if we see that the cron jobs are 
passing a couple of times (given that the akka timeout issue happens rarely).


was (Author: till.rohrmann):
Thanks for the update [~kevin.cyj]. The corresponding JIRA issue for the 
failing tpch end-to-end test is FLINK-13607.

> Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout
> --
>
> Key: FLINK-13489
> URL: https://issues.apache.org/jira/browse/FLINK-13489
> Project: Flink
>  Issue Type: Bug
>  Components: Test Infrastructure
>Reporter: Tzu-Li (Gordon) Tai
>Assignee: Yingjie Cao
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>
> https://api.travis-ci.org/v3/job/564925128/log.txt
> {code}
> 
>  The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: 1b4f1807cc749628cfc1bdf04647527a)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:250)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
>   at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:60)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1507)
>   at 
> org.apache.flink.deployment.HeavyDeploymentStressTestProgram.main(HeavyDeploymentStressTestProgram.java:70)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:576)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:438)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746)
>   at 
> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:247)
>   ... 21 more
> Caused by: java.util.concurrent.TimeoutException: Heartbeat of TaskManager 
> with id ea456d6a590eca7598c19c4d35e56db9 timed out.
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1149)
>   at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
>   at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at 

[jira] [Comment Edited] (FLINK-13489) Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout

2019-08-07 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901816#comment-16901816
 ] 

Till Rohrmann edited comment on FLINK-13489 at 8/7/19 7:37 AM:
---

-[~kevin.cyj] are you still working on this issue? I've seen the test still 
fail after FLINK-13579.-

In the latest {{release-1.9}} cron job the tests weren't executed because the 
tpch end-to-end test failed consistently.


was (Author: till.rohrmann):
[~kevin.cyj] are you still working on this issue? I've seen the test still fail 
after FLINK-13579.

> Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout
> --
>
> Key: FLINK-13489
> URL: https://issues.apache.org/jira/browse/FLINK-13489
> Project: Flink
>  Issue Type: Bug
>  Components: Test Infrastructure
>Reporter: Tzu-Li (Gordon) Tai
>Assignee: Yingjie Cao
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>
> https://api.travis-ci.org/v3/job/564925128/log.txt
> {code}
> 
>  The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: 1b4f1807cc749628cfc1bdf04647527a)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:250)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
>   at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:60)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1507)
>   at 
> org.apache.flink.deployment.HeavyDeploymentStressTestProgram.main(HeavyDeploymentStressTestProgram.java:70)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:576)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:438)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746)
>   at 
> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:247)
>   ... 21 more
> Caused by: java.util.concurrent.TimeoutException: Heartbeat of TaskManager 
> with id ea456d6a590eca7598c19c4d35e56db9 timed out.
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1149)
>   at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
>   at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>   at 

[jira] [Comment Edited] (FLINK-13489) Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout

2019-08-05 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899343#comment-16899343
 ] 

Till Rohrmann edited comment on FLINK-13489 at 8/5/19 9:50 AM:
---

[~StephanEwen] I run the test for many times, but only encountered the akka 
timeout problem only once, and nerve encountered the heartbeat timeout problem. 
But unfortunately, I did not get the JM/TM log of that failure. Latter, I 
modified the test script to print gc and JM/TM log out and run the test for 
many times, but the timeout problem did not occur. I noticed the gc time is a 
little long, many 2, 3, 4 seconds (these are for successfully finished job). I 
guess the previous failure may result by GC. Another problem is that the Travis 
test platform seems not stable, the test time varies.

As for containerized.heap-cutoff-min, it is because it was used for memory 
calculation. If the default value (600) is used, the standalone cluster can not 
start up. I agree with you that this config option should not be considered by 
standalone mode, but it seems reusing the same code (I think it also should be 
fixed). The flowing is the exception stack:

{code}
 2019-08-01 18:42:29,289 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start cluster 
entrypoint StandaloneSessionClusterEntrypoint.
 org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to 
initialize the cluster entrypoint StandaloneSessionClusterEntrypoint.
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:182)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:501)
 at 
org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint.main(StandaloneSessionClusterEntrypoint.java:65)
 Caused by: org.apache.flink.util.FlinkException: Could not create the 
DispatcherResourceManagerComponent.
 at 
org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:259)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:210)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:164)
 at 
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:163)
 ... 2 more
 Caused by: java.lang.IllegalArgumentException: The configuration value 
'containerized.heap-cutoff-min'='600' is larger than the total container memory 
512
 at 
org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters.calculateCutoffMB(ContaineredTaskManagerParameters.java:133)
 at 
org.apache.flink.runtime.util.ResourceManagerUtil.getResourceManagerConfiguration(ResourceManagerUtil.java:34)
 at 
org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:171)
 ... 6 more
{code}


was (Author: kevin.cyj):
[~StephanEwen] I run the test for many times, but only encountered the akka 
timeout problem only once, and nerve encountered the heartbeat timeout problem. 
But unfortunately, I did not get the JM/TM log of that failure. Latter, I 
modified the test script to print gc and JM/TM log out and run the test for 
many times, but the timeout problem did not occur. I noticed the gc time is a 
little long, many 2, 3, 4 seconds (these are for successfully finished job). I 
guess the previous failure may result by GC. Another problem is that the Travis 
test platform seems not stable, the test time varies.

As for containerized.heap-cutoff-min, it is because it was used for memory 
calculation. If the default value (600) is used, the standalone cluster can not 
start up. I agree with you that this config option should not be considered by 
standalone mode, but it seems reusing the same code (I think it also should be 
fixed). The flowing is the exception stack:


 2019-08-01 18:42:29,289 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start cluster 
entrypoint StandaloneSessionClusterEntrypoint.
 org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to 
initialize the cluster entrypoint StandaloneSessionClusterEntrypoint.
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:182)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:501)
 at 
org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint.main(StandaloneSessionClusterEntrypoint.java:65)
 Caused by: org.apache.flink.util.FlinkException: Could not create the 
DispatcherResourceManagerComponent.
 at 

[jira] [Comment Edited] (FLINK-13489) Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout

2019-08-05 Thread Yingjie Cao (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899957#comment-16899957
 ] 

Yingjie Cao edited comment on FLINK-13489 at 8/5/19 9:50 AM:
-

The GC time is still high after I Increased the heap size of TM and JM (though 
lower than before, there are still many 1 ~ 2 seconds entries). And I notice 
that the real time if much longer than user plus sys time. Usually, the user 
and sys time are longer than real time because of multi thread GC. Longer real 
time usually means that the GC threads are waiting for CPU or IO. Like the GC 
thread, the RPC main thread  may also need to wait for CPU or IO. If so, may be 
the proper solution for this case is just to increase the akka and heartbeat 
timeout config value.  I will collect more information about this case.


was (Author: kevin.cyj):
The GC time is still high after I Increased the heap size of TM and JM (though 
lower than before, there are still many 1 ~ 2 seconds entries). And I notice 
that the real time if much longer than user plus sys time. Usually, the real 
and sys time are longer than real time because of multi thread GC. Longer real 
time usually means that the GC threads are waiting for CPU or IO. Like the GC 
thread, the RPC main thread  may also need to wait for CPU or IO. If so, may be 
the proper solution for this case is just to increase the akka and heartbeat 
timeout config value.  I will collect more information about this case.

> Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout
> --
>
> Key: FLINK-13489
> URL: https://issues.apache.org/jira/browse/FLINK-13489
> Project: Flink
>  Issue Type: Bug
>  Components: Test Infrastructure
>Reporter: Tzu-Li (Gordon) Tai
>Assignee: Yingjie Cao
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>
> https://api.travis-ci.org/v3/job/564925128/log.txt
> {code}
> 
>  The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: 1b4f1807cc749628cfc1bdf04647527a)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:250)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
>   at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:60)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1507)
>   at 
> org.apache.flink.deployment.HeavyDeploymentStressTestProgram.main(HeavyDeploymentStressTestProgram.java:70)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:576)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:438)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746)
>   at 
> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:247)
>   ... 21 more
> Caused by: java.util.concurrent.TimeoutException: Heartbeat of TaskManager 
> with id ea456d6a590eca7598c19c4d35e56db9 timed out.
>   at 
> 

[jira] [Comment Edited] (FLINK-13489) Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout

2019-08-02 Thread Yingjie Cao (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899343#comment-16899343
 ] 

Yingjie Cao edited comment on FLINK-13489 at 8/3/19 3:34 AM:
-

[~StephanEwen] I run the test for many times, but only encountered the akka 
timeout problem only once, and nerve encountered the heartbeat timeout problem. 
But unfortunately, I did not get the JM/TM log of that failure. Latter, I 
modified the test script to print gc and JM/TM log out and run the test for 
many times, but the timeout problem did not occur. I noticed the gc time is a 
little long, many 2, 3, 4 seconds (these are for successfully finished job). I 
guess the previous failure may result by GC. Another problem is that the Travis 
test platform seems not stable, the test time varies.

As for containerized.heap-cutoff-min, it is because it was used for memory 
calculation. If the default value (600) is used, the standalone cluster can not 
start up. I agree with you that this config option should not be considered by 
standalone mode, but it seems reusing the same code (I think it also should be 
fixed). The flowing is the exception stack:


 2019-08-01 18:42:29,289 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start cluster 
entrypoint StandaloneSessionClusterEntrypoint.
 org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to 
initialize the cluster entrypoint StandaloneSessionClusterEntrypoint.
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:182)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:501)
 at 
org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint.main(StandaloneSessionClusterEntrypoint.java:65)
 Caused by: org.apache.flink.util.FlinkException: Could not create the 
DispatcherResourceManagerComponent.
 at 
org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:259)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:210)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:164)
 at 
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:163)
 ... 2 more
 Caused by: java.lang.IllegalArgumentException: The configuration value 
'containerized.heap-cutoff-min'='600' is larger than the total container memory 
512
 at 
org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters.calculateCutoffMB(ContaineredTaskManagerParameters.java:133)
 at 
org.apache.flink.runtime.util.ResourceManagerUtil.getResourceManagerConfiguration(ResourceManagerUtil.java:34)
 at 
org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:171)
 ... 6 more


was (Author: kevin.cyj):
[~StephanEwen] I run the test for many times, but only encountered the akka 
timeout problem only once, and nerve encountered the heartbeat timeout problem. 
But unfortunately, I did not get the JM/TM log of that failure. Latter, I 
modified the test script to print gc and JM/TM log out and run the test for 
many times, but the timeout problem did not occur. I noticed the gc time is a 
little long, many 2, 3, 4 seconds (these are for successfully finished job). I 
guess the previous failure may result by GC. Another problem is that the Travis 
test platform seems not stable, the test time varies.

 

As for containerized.heap-cutoff-min, it is because it was used for memory 
calculation. If the default value (600) is used, the standalone cluster can 
start up. I agree with you that this config option should not be considered by 
standalone mode, but it seems reusing the same code (I think it also should be 
fixed). The flowing is the exception stack:
2019-08-01 18:42:29,289 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start cluster 
entrypoint StandaloneSessionClusterEntrypoint.
org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to 
initialize the cluster entrypoint StandaloneSessionClusterEntrypoint.
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:182)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:501)
at 
org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint.main(StandaloneSessionClusterEntrypoint.java:65)
Caused by: org.apache.flink.util.FlinkException: Could not create the 
DispatcherResourceManagerComponent.
at