Hi Till, I didn't try with newer versions as it is not possible to update the Flink version atm. If you could give any pointers for debugging that would be great.
On Thu, Oct 11, 2018 at 2:44 AM Till Rohrmann <trohrm...@apache.org> wrote: > Hi Abdul, > > have you tried whether this problem also occurs with newer Flink versions > (1.5.4 or 1.6.1)? > > Cheers, > Till > > On Thu, Oct 11, 2018 at 9:24 AM Dawid Wysakowicz <dwysakow...@apache.org> > wrote: > >> Hi Abdul, >> >> I've added Till and Gary to cc, who might be able to help you. >> >> Best, >> >> Dawid >> >> On 11/10/18 03:05, Abdul Qadeer wrote: >> >> Hi, >> >> >> We are facing an issue in standalone HA mode in Flink 1.4.0 where >> Taskmanager restarts and is not able to register with the Jobmanager. It >> times out awaiting *AcknowledgeRegistration/AlreadyRegistered* message >> from Jobmanager Actor and keeps sending *RegisterTaskManager *message. >> The logs at Jobmanager don’t show anything about registration >> failure/request. It doesn’t print *log*.debug(*s"RegisterTaskManager: $* >> msg*"*) (from JobManager.scala) either. The network connection between >> taskmanager and jobmanager seems fine; tcpdump shows message sent to >> jobmanager and TCP ACK received from jobmanager. Note that the >> communication is happening between docker containers. >> >> >> Following are the logs from Taskmanager: >> >> >> >> {"timeMillis":1539189572438,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying >> to register at JobManager akka.tcp:// >> flink@192.168.83.51:6123/user/jobmanager (attempt 1400, timeout: 30000 >> milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":48,"threadPriority":5} >> >> {"timeMillis":1539189580229,"thread":"Curator-Framework-0-SendThread(zookeeper.maglev-system.svc.cluster.local:2181)","level":"DEBUG","loggerName":"org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn","message":"Got >> ping response for sessionid: 0x10000260ea5002d after >> 0ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":101,"threadPriority":5} >> >> {"timeMillis":1539189600247,"thread":"Curator-Framework-0-SendThread(zookeeper.maglev-system.svc.cluster.local:2181)","level":"DEBUG","loggerName":"org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn","message":"Got >> ping response for sessionid: 0x10000260ea5002d after >> 0ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":101,"threadPriority":5} >> >> {"timeMillis":1539189602458,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying >> to register at JobManager akka.tcp:// >> flink@192.168.83.51:6123/user/jobmanager (attempt 1401, timeout: 30000 >> milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":48,"threadPriority":5} >> >> {"timeMillis":1539189620251,"thread":"Curator-Framework-0-SendThread(zookeeper.maglev-system.svc.cluster.local:2181)","level":"DEBUG","loggerName":"org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn","message":"Got >> ping response for sessionid: 0x10000260ea5002d after >> 0ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":101,"threadPriority":5} >> >> {"timeMillis":1539189632478,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying >> to register at JobManager akka.tcp:// >> flink@192.168.83.51:6123/user/jobmanager (attempt 1402, timeout: 30000 >> milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":48,"threadPriority":5} >> >> >>