Hi Zhijiang,
Thanks for your reply. I first also suspect the same reason. But once I
read the connected host log, the netty server starts to listen on the
correct port after 2 seconds of task manager start. I compared the log of
the connected host and connecting host log, it seems requesting parti
Hi Till,
I will try the local test according to your suggestion. Uber Flink version
is mainly adding something to integrate with Uber deployment and other
infra components. There is no change for Flink original code flow.
I also found that the issue can be avoided with the same setting in other
c
Hi Wenrui,
I suspect another issue which might cause connection failure. You can check
whether the netty server already binds and listens port successfully in time
before the client requests connection. If there exists some time-consuming
process during TM startup which might delay netty server
Hi Till,
This job is not on AthenaX but on a special uber version Flink. I tried to
ping the connected host from connecting host. It seems very stable. For the
connection timeout, I do set it as 20min but it still report the timeout
after 2 minutes. Could you let me know how do you test locally ab
Hi Wenrui,
I executed AutoParallelismITCase#testProgramWithAutoParallelism and set a
breakpoint in NettClient.java:102 to see whether the configured timeout
value is correctly set. Moreover, I did the same for
AbstractNioChannel.java:207 and it looked as if the correct timeout value
was set.
What
Hi Wenrui,
the exception now occurs while finishing the connection creation. I'm not
sure whether this is so different. Could it be that your network is
overloaded or not very reliable? Have you tried running your Flink job
outside of AthenaX?
Cheers,
Till
On Tue, Jan 8, 2019 at 2:50 PM Wenrui M
Hi Till,
Thanks for your reply. Our cluster is Yarn cluster. I found that if we
decrease the total parallel the timeout issue can be avoided. But we do
need that amount of taskManagers to process data. In addition, once I
increase the netty server threads to 128, the error is changed to to
followi
Hi Wenrui,
the code to set the connect timeout looks ok to me [1]. I also tested it
locally and checked that the timeout is correctly registered in Netty's
AbstractNioChannel [2].
Increasing the number of threads to 128 should not be necessary. But it
could indicate that there is some long lastin
Hi Till,
Thanks for your reply and help on this issue.
I increased taskmanager.network.netty.client.connectTimeoutSec to 1200
which is 20 minutes. But it seems the connection not respects this timeout.
In addition, I increase both taskmanager.network.request-backoff.max
and taskmanager.registrati
Hi Wenrui,
from the logs I cannot spot anything suspicious. Which configuration
parameters have you changed exactly? Does the JobManager log contain
anything suspicious?
The current Flink version changed quite a bit wrt 1.4. Thus, it might be
worth a try to run the job with the latest Flink versi
10 matches
Mail list logo