Hi. 

Did you find a reason for the detaching ?
I sometimes see the same on our system running Flink 1.4 on dc/os. I have 
enabled taskmanager.Debug.memory.startlogthread for debugging. 

Med venlig hilsen / Best regards
Lasse Nedergaard


> Den 20. jan. 2018 kl. 12.57 skrev Kien Truong <duckientru...@gmail.com>:
> 
> Hi,
> 
> You should enable and check your garbage collection log.
> 
> We've encountered case where Task Manager disassociated due to long GC pause.
> 
> Regards,
> 
> Kien
>> On 1/20/2018 1:27 AM, ashish pok wrote:
>> Hi All,
>> 
>> We have hit some load related issues and was wondering if any one has some 
>> suggestions. We are noticing task managers and job managers being detached 
>> from each other under load and never really sync up again. As a result, 
>> Flink session shows 0 slots available for processing. Even though, apps are 
>> configured to restart it isn't really helping as there are no slots 
>> available to run the apps.
>> 
>> 
>> Here are excerpt from logs that seemed relevant. (I am trimming out rest of 
>> the logs for brevity)
>> 
>> Job Manager:
>> 2018-01-19 12:38:00,423 INFO  org.apache.flink.runtime.jobmanager.JobManager 
>>                -  Starting JobManager (Version: 1.4.0, Rev:3a9d9f2, 
>> Date:06.12.2017 @ 11:08:40 UTC)
>> 
>> 2018-01-19 12:38:00,792 INFO  org.apache.flink.runtime.jobmanager.JobManager 
>>                -  Maximum heap size: 16384 MiBytes
>> 2018-01-19 12:38:00,794 INFO  org.apache.flink.runtime.jobmanager.JobManager 
>>                -  Hadoop version: 2.6.5
>> 2018-01-19 12:38:00,794 INFO  org.apache.flink.runtime.jobmanager.JobManager 
>>                -  JVM Options:
>> 2018-01-19 12:38:00,794 INFO  org.apache.flink.runtime.jobmanager.JobManager 
>>                -     -Xms16384m
>> 2018-01-19 12:38:00,794 INFO  org.apache.flink.runtime.jobmanager.JobManager 
>>                -     -Xmx16384m
>> 2018-01-19 12:38:00,795 INFO  org.apache.flink.runtime.jobmanager.JobManager 
>>                -     -XX:+UseG1GC
>> 
>> 2018-01-19 12:38:00,908 INFO  
>> org.apache.flink.configuration.GlobalConfiguration            - Loading 
>> configuration property: jobmanager.rpc.port, 6123
>> 2018-01-19 12:38:00,908 INFO  
>> org.apache.flink.configuration.GlobalConfiguration            - Loading 
>> configuration property: jobmanager.heap.mb, 16384
>> 
>> 
>> 2018-01-19 12:53:34,671 WARN  akka.remote.RemoteWatcher                      
>>                - Detected unreachable: [akka.tcp://flink@<jm-host>:37840]
>> 2018-01-19 12:53:34,676 INFO  org.apache.flink.runtime.jobmanager.JobManager 
>>                - Task manager 
>> akka.tcp://flink@<jm-host>:37840/user/taskmanager terminated.
>> 
>> -- So once Flink session boots up, we are hitting it with pretty heavy load, 
>> which typically results in the WARN above
>> 
>> Task Manager:
>> 2018-01-19 12:38:01,002 INFO  
>> org.apache.flink.runtime.taskmanager.TaskManager              -  Starting 
>> TaskManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC)
>> 2018-01-19 12:38:01,367 INFO  
>> org.apache.flink.runtime.taskmanager.TaskManager              -  Hadoop 
>> version: 2.6.5
>> 2018-01-19 12:38:01,367 INFO  
>> org.apache.flink.runtime.taskmanager.TaskManager              -  JVM Options:
>> 2018-01-19 12:38:01,367 INFO  
>> org.apache.flink.runtime.taskmanager.TaskManager              -     
>> -Xms16384M
>> 2018-01-19 12:38:01,367 INFO  
>> org.apache.flink.runtime.taskmanager.TaskManager              -     
>> -Xmx16384M
>> 2018-01-19 12:38:01,367 INFO  
>> org.apache.flink.runtime.taskmanager.TaskManager              -     
>> -XX:MaxDirectMemorySize=8388607T
>> 2018-01-19 12:38:01,367 INFO  
>> org.apache.flink.runtime.taskmanager.TaskManager              -     
>> -XX:+UseG1GC
>> 
>> 2018-01-19 12:38:01,392 INFO  
>> org.apache.flink.configuration.GlobalConfiguration            - Loading 
>> configuration property: jobmanager.rpc.port, 6123
>> 2018-01-19 12:38:01,392 INFO  
>> org.apache.flink.configuration.GlobalConfiguration            - Loading 
>> configuration property: jobmanager.heap.mb, 16384
>> 
>> 
>> 2018-01-19 12:54:48,626 WARN  akka.remote.RemoteWatcher                      
>>                - Detected unreachable: [akka.tcp://flink@<jm-host>:6123]
>> 2018-01-19 12:54:48,690 INFO  akka.remote.Remoting                           
>>                - Quarantined address [akka.tcp://flink@<jm-host>:6123] is 
>> still unreachable or has not been restarted. Keeping it quarantined.
>> 018-01-19 12:54:48,774 WARN  akka.remote.Remoting                            
>>               - Tried to associate with unreachable remote address 
>> [akka.tcp://flink@<tm-host>:6123]. Address is now gated for 5000 ms, all 
>> messages to this address will be delivered to dead letters. Reason: [The 
>> remote system has               a UID that has been quarantined. Association 
>> aborted.] 
>> 2018-01-19 12:54:48,833 WARN  akka.remote.Remoting                           
>>                - Tried to associate with unreachable remote address 
>> [akka.tcp://flink@<tm-host>:6123]. Address is now gated for 5000 ms, all 
>> messages to this address will be delivered to dead letters. Reason: [The 
>> remote system has               quarantined this system. No further 
>> associations to the remote system are possible until this system is 
>> restarted.] 
>> <bunch of ERRORs on operations not shutdown properly - assuming because JM 
>> is unreachable>
>> 
>> 2018-01-19 12:56:51,244 INFO  
>> org.apache.flink.runtime.taskmanager.TaskManager              - Trying to 
>> register at JobManager akka.tcp://flink@<jm-host>:6123/user/jobmanager 
>> (attempt 10, timeout: 30000 milliseconds)
>> 2018-01-19 12:56:51,253 WARN  akka.remote.Remoting                           
>>                - Tried to associate with unreachable remote address 
>> [akka.tcp://flink@<jm-host>:6123]. Address is now gated for 5000 ms, all 
>> messages to this address will be delivered to dead letters. Reason: [The 
>> remote system has quarantined this system. No further associations to the 
>> remote system are possible until this system is restarted.] 
>> 
>> So bottom line is, JM and TM couldn't communicate under load, which is 
>> obviously not good. I tried to bump up akka.tcp.timeout as well but it didnt 
>> help either. So my question here is after all processing is halted and there 
>> is no new data being picked up, shouldn't this environment self-heal? Any 
>> other things I can be looking at other than extending timeouts?
>> 
>> Thanks,
>> 
>> Ashish
>> 
>> 
>> 
>> 
>> 
>> 

Reply via email to