Interesting. I knew I needed to look into ZooKeeper more than I did :-)

I don't know what's "distributed mode" in ZooKeeper. I can tell you we use
a single host for the master, and configure all machines with
"zk://master-host-name:2181/mesos" in /etc/mesos/zk before the mesos
services are started.

We don't assign a dedicated device to ZooKeeper, so maybe it bites us...

On Thu, Jan 8, 2015 at 9:33 PM, Tomas Barton <barton.to...@gmail.com> wrote:

> Is ZooKeeper running in distributed mode?
>
> ZooKeeper is writes periodically all data to disk (transaction log), so
> the bottleneck could be ZooKeeper rather than
> not enough CPUs. ZooKeeper limits each key to 1MB, typically 512MB should
> be enough for ZooKeeper (or 4GB
> might not be enough, depends on your use-case).
>
> from ZooKeeper docs:
>
> ZooKeeper's transaction log must be on a dedicated device. (A dedicated
> partition is not enough.) ZooKeeper writes the log sequentially, without
> seeking Sharing your log device with other processes can cause seeks and
> contention, which in turn can cause multi-second delays.
>
>  In particular, you should not create a situation in which ZooKeeper swaps
> to disk. The disk is death to ZooKeeper. Everything is ordered, so if
> processing one request swaps the disk, all other queued requests will
> probably do the same. the disk. DON'T SWAP.
>
>
> On 8 January 2015 at 16:47, Itamar Ostricher <ita...@yowza3d.com> wrote:
>
>> Thanks Tomas.
>>
>> We're still quite far from the 10k-20k machines limit :-)
>>
>> Currently, our framework scheduler generates many (millions) of mostly
>> small tasks (some in the ~100ms, some in the few seconds).
>> I understand that the network is the main bottleneck, but we sometimes
>> experience lost tasks, and sometimes I see master logs indicating that the
>> master is unable to talk with the zookeeper service (which is on the same
>> host), and I was wondering if it's related to CPU/RAM of the master machine.
>> Is 1 CPU enough? 2? 4?
>> 1GiB RAM? 4? 8?
>>
>> On Thu, Jan 8, 2015 at 5:00 PM, Tomas Barton <barton.to...@gmail.com>
>> wrote:
>>
>>> Hi Itamar,
>>>
>>> there's definitely certain limit of machines which can Mesos master
>>> handle. This limit is between 10 000 - 20 000 (that's number
>>> reported by Twitter). This bottleneck is caused by event loop which
>>> handles communication at master.
>>>
>>> With hundreds of machines you should be fine. Only in case that your
>>> framework scheduler would demand
>>> too many resources for computing allocations you might encounter some
>>> problems.
>>>
>>> How does the strength of the master & scheduler machines affect the
>>>> overall cluster performance?
>>>
>>>
>>> I would say that the network is usually the main bottleneck. Adding
>>> extra RAM won't improve mesos-master
>>> performance. Of course if there's high CPU load on master you might
>>> observe performance regression. Also
>>> this depends on granularity of your tasks, if you have few long running
>>> tasks or many short tasks (which runs
>>> just hundreds of ms).
>>>
>>> Tomas
>>>
>>>
>>> On 6 January 2015 at 10:12, Itamar Ostricher <ita...@yowza3d.com> wrote:
>>>
>>>> Are there recommendations regarding master / scheduler machines
>>>> resources as function of cluster size?
>>>>
>>>> Say I have a cluster with hundreds of slave machines and thousands of
>>>> CPUs, with a single framework that will schedule millions of tasks.
>>>> How does the strength of the master & scheduler machines affect the
>>>> overall cluster performance?
>>>>
>>>> Thanks,
>>>> - Itamar.
>>>>
>>>
>>>
>>
>

Reply via email to