Hi After 1.14.0 I think Flink should work well even at the 1000*1000 scale + 10s akka.timeout in the deploy stage. So thank you for any further feedback after you investigate.
BTW: I think you might look at https://issues.apache.org/jira/browse/FLINK-24295, which might cause the problem. Best, Guowei On Mon, Jan 24, 2022 at 4:31 PM Paul Lam <paullin3...@gmail.com> wrote: > Hi Guowei, > > Thanks a lot for your reply. > > I’m using 1.14.0. The timeout happens at job deployment time. A subtask > would run for a short period of `akka.ask.timeout` before fails due to the > timeout. > > I noticed that jobmanager have a very hight CPU usage at the moment, like > 2000%. I’m reasoning about the cause by profiling. > > Best, > Paul Lam > > 2022年1月21日 09:56,Guowei Ma <guowei....@gmail.com> 写道: > > Hi, Paul > > Would you like to share some information such as the Flink version you > used and the memory of TM and JM. > And when does the timeout happen? Such as at begin of the job or during > the running of the job > > Best, > Guowei > > > On Thu, Jan 20, 2022 at 4:45 PM Paul Lam <paullin3...@gmail.com> wrote: > >> Hi, >> >> I’m tuning a Flink job with 1000+ parallelism, which frequently fails >> with Akka TimeOutException (it was fine with 200 parallelism). >> >> I see some posts recommend increasing `akka.ask.timeout` to 120s. I’m not >> familiar with Akka but it looks like a very long time compared to the >> default 10s and as a response timeout. >> >> So I’m wondering what’s the reasonable range for this option? And why >> would the Actor fail to respond in time (the message was dropped due to >> pressure)? >> >> Any input would be appreciated! Thanks a lot. >> >> Best, >> Paul Lam >> >> >