Thanks for the info! It's very helpful. -chad
On Sun, Aug 11, 2019 at 4:21 AM Zhu Zhu <reed...@gmail.com> wrote: > Hi Chad, > > We have (Blink) jobs each running with over 10 thousands of TMs. > In our experience, the main regression caused by large scale TMs is the in > TM allocation stage in ResourceManager, that some times it fails to > allocate enough TMs before the allocation timeout. > It does not deteriorate much once the Flink cluster has reached a stable > state. > > The main loads, In my mind, increases with the task scale and edge scale > of a submitted job. > JM can be overwhelmed by frequent and slow GCs caused by task deployment > if the JM memory is not fine tuned. > The JM can also be slower due to more PRCs to JM main thread and increased > computation complexity of each RPC handling. > > Thanks, > Zhu Zhu > > qi luo <luoqi...@gmail.com> 于2019年8月11日周日 下午6:17写道: > >> Hi Chad, >> >> In our cases, 1~2k TMs with up to ~10k TM slots are used in one job. In >> general, the CPU/memory of Job Manager should be increased with more TMs. >> >> Regards, >> Qi >> >> > On Aug 11, 2019, at 2:03 AM, Chad Dombrova <chad...@gmail.com> wrote: >> > >> > Hi, >> > I'm still on my task management investigation, and I'm curious to know >> how many task managers people are reliably using with Flink. We're >> currently using AWS | Thinkbox Deadline, and we're able to easily utilize >> over 300 workers, and I've heard from other customers who use several >> thousand, so I'm curious how Flink compares in this regard. Also, what >> aspects of the system begin to deteriorate at higher scales? >> > >> > thanks in advance! >> > >> > -chad >> > >> >>