Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-16 Thread Till Rohrmann
Done, you are assigned now Weike. Cheers, Till On Fri, Oct 16, 2020 at 1:33 PM DONG, Weike wrote: > Hi Till, > > Thank you for the kind reminder, and I have created a JIRA ticket for this > issue https://issues.apache.org/jira/browse/FLINK-19677 > > Could you please assign it to me? I will try

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-16 Thread DONG, Weike
Hi Till, Thank you for the kind reminder, and I have created a JIRA ticket for this issue https://issues.apache.org/jira/browse/FLINK-19677 Could you please assign it to me? I will try to submit a PR this weekend to fix this : ) Sincerely, Weike On Fri, Oct 16, 2020 at 5:54 PM Till Rohrmann

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-16 Thread Till Rohrmann
Great, thanks a lot Weike. I think the first step would be to open a JIRA issue, get assigned and then start on fixing it and opening a PR. Cheers, Till On Fri, Oct 16, 2020 at 10:02 AM DONG, Weike wrote: > Hi all, > > Thanks for all the replies, and I agree with Yang, as we have found that >

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-16 Thread DONG, Weike
Hi all, Thanks for all the replies, and I agree with Yang, as we have found that for a pod without a service (like TaskManager pod), the reverse DNS lookup would always fail, so this lookup is not necessary for the Kubernetes environment. I am glad to help fix this issue to make Flink better : )

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-15 Thread Yang Wang
I am afraid the InetAddress cache could not take effect. Because Kubernetes only creates A and SRV records for Services. It doesn't generate pods' A records as you may expect. Refer here[1][2] for more information. So the DNS reverse lookup will always fail. IIRC, the default timeout is 5s. This

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-15 Thread Chesnay Schepler
The InetAddress caches the result of getCanonicalHostName(), so it is not a problem to call it twice. On 10/15/2020 1:57 PM, Till Rohrmann wrote: Hi Weike, thanks for getting back to us with your findings. Looking at the `TaskManagerLocation`, we are actually calling

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-15 Thread Till Rohrmann
Hi Weike, thanks for getting back to us with your findings. Looking at the `TaskManagerLocation`, we are actually calling `InetAddress.getCanonicalHostName` twice for every creation of a `TaskManagerLocation` instance. This does not look right. I think it should be fine to make the look up

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-15 Thread DONG, Weike
Hi Till and community, By the way, initially I resolved the IPs several times but results returned rather quickly (less than 1ms, possibly due to DNS cache on the server), so I thought it might not be the DNS issue. However, after debugging and logging, it is found that the lookup time exhibited

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-15 Thread DONG, Weike
Hi Till and community, Increasing `kubernetes.jobmanager.cpu` in the configuration makes this issue alleviated but not disappeared. After adding DEBUG logs to the internals of *flink-runtime*, we have found the culprit is inetAddress.getCanonicalHostName() in

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-13 Thread Till Rohrmann
Hi Weike, could you try setting kubernetes.jobmanager.cpu: 4 in your flink-conf.yaml? I fear that a single CPU is too low for the JobManager component. Cheers, Till On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann wrote: > Hi Weike, > > thanks for posting the logs. I will take a look at them.

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-13 Thread Till Rohrmann
Hi Weike, thanks for posting the logs. I will take a look at them. My suspicion would be that there is some operation blocking the JobMaster's main thread which causes the registrations from the TMs to time out. Maybe the logs allow me to validate/falsify this suspicion. Cheers, Till On Mon,

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-12 Thread DONG, Weike
Hi community, I have uploaded the log files of JobManager and TaskManager-1-1 (one of the 50 TaskManagers) with DEBUG log level and default Flink configuration, and it clearly shows that TaskManager failed to register with JobManager after 10 attempts. Here is the link: JobManager:

TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-12 Thread DONG, Weike
Hi community, Recently we have noticed a strange behavior for Flink jobs on Kubernetes per-job mode: when the parallelism increases, the time it takes for the TaskManagers to register with *JobManager *becomes abnormally long (for a task with parallelism of 50, it could take 60 ~ 120 seconds or