Thanks Tomas. We're still quite far from the 10k-20k machines limit :-)
Currently, our framework scheduler generates many (millions) of mostly small tasks (some in the ~100ms, some in the few seconds). I understand that the network is the main bottleneck, but we sometimes experience lost tasks, and sometimes I see master logs indicating that the master is unable to talk with the zookeeper service (which is on the same host), and I was wondering if it's related to CPU/RAM of the master machine. Is 1 CPU enough? 2? 4? 1GiB RAM? 4? 8? On Thu, Jan 8, 2015 at 5:00 PM, Tomas Barton <[email protected]> wrote: > Hi Itamar, > > there's definitely certain limit of machines which can Mesos master > handle. This limit is between 10 000 - 20 000 (that's number > reported by Twitter). This bottleneck is caused by event loop which > handles communication at master. > > With hundreds of machines you should be fine. Only in case that your > framework scheduler would demand > too many resources for computing allocations you might encounter some > problems. > > How does the strength of the master & scheduler machines affect the >> overall cluster performance? > > > I would say that the network is usually the main bottleneck. Adding extra > RAM won't improve mesos-master > performance. Of course if there's high CPU load on master you might > observe performance regression. Also > this depends on granularity of your tasks, if you have few long running > tasks or many short tasks (which runs > just hundreds of ms). > > Tomas > > > On 6 January 2015 at 10:12, Itamar Ostricher <[email protected]> wrote: > >> Are there recommendations regarding master / scheduler machines resources >> as function of cluster size? >> >> Say I have a cluster with hundreds of slave machines and thousands of >> CPUs, with a single framework that will schedule millions of tasks. >> How does the strength of the master & scheduler machines affect the >> overall cluster performance? >> >> Thanks, >> - Itamar. >> > >

