Thanks for the info!  It's very helpful.

-chad


On Sun, Aug 11, 2019 at 4:21 AM Zhu Zhu <reed...@gmail.com> wrote:

> Hi Chad,
>
> We have (Blink) jobs each running with over 10 thousands of TMs.
> In our experience, the main regression caused by large scale TMs is the in
> TM allocation stage in ResourceManager, that some times it fails to
> allocate enough TMs before the allocation timeout.
> It does not deteriorate much once the Flink cluster has reached a stable
> state.
>
> The main loads, In my mind, increases with the task scale and edge scale
> of a submitted job.
> JM can be overwhelmed by frequent and slow GCs caused by task deployment
> if the JM memory is not fine tuned.
> The JM can also be slower due to more PRCs to JM main thread and increased
> computation complexity of each RPC handling.
>
> Thanks,
> Zhu Zhu
>
> qi luo <luoqi...@gmail.com> 于2019年8月11日周日 下午6:17写道:
>
>> Hi Chad,
>>
>> In our cases, 1~2k TMs with up to ~10k TM slots are used in one job. In
>> general, the CPU/memory of Job Manager should be increased with more TMs.
>>
>> Regards,
>> Qi
>>
>> > On Aug 11, 2019, at 2:03 AM, Chad Dombrova <chad...@gmail.com> wrote:
>> >
>> > Hi,
>> > I'm still on my task management investigation, and I'm curious to know
>> how many task managers people are reliably using with Flink.  We're
>> currently using AWS | Thinkbox Deadline, and we're able to easily utilize
>> over 300 workers, and I've heard from other customers who use several
>> thousand, so I'm curious how Flink compares in this regard.  Also, what
>> aspects of the system begin to deteriorate at higher scales?
>> >
>> > thanks in advance!
>> >
>> > -chad
>> >
>>
>>

Reply via email to