Re: cgroup CPUSET for mesos agent

Charles-François Natali Mon, 06 Jul 2020 14:34:07 -0700

Maybe give it a try then, it might help.

Cheers,



Le lun. 6 juil. 2020 à 21:20, Milind Chabbi <[email protected]> a écrit :
>
>
>
> On Mon, Jul 6, 2020 at 1:18 PM Charles-François Natali <[email protected]> 
> wrote:
>>
>> >> Also, there are some obvious limitations with this: for example
>> >> binding processes to a specific NUMA node means that you might not
>> >> benefit from CPU bursting (e.g. if there's some available CPU on
>> >> another NUMA node).
>> >
>> >
>> > True. I would like the bust to be limited to only the cores on a single 
>> > socket.
>> > Data locality can be more important than available parallelism, sometimes.
>> >
>> >>
>> >> Also NUMA binding has actually quite a few possible settings: for
>> >> example you might also want to bind the memory allocations, etc, which
>> >> means a simple flag might not be enough to achieve what you want.
>> >>
>> >
>> > True. I would like to rely on the default "first touch" policy and if the 
>> > container is restricted to a socket, the data will be allocated on the 
>> > same NUMA node, as long as memory is available.
>> >
>>
>> Yes so it sounds like you probably want some fine-grained control over
>> the numa policy, which would probably be difficult to implement in the
>> agent.
>>
>> >> One possibility I can think of might be to write your own executor -
>> >> we wrote our own executor at work for various reasons.
>> >> It's a bit of work, but it would give you unlimited flexibility in how
>> >> you start your tasks, bind them etc.
>> >>
>> >
>> > I am new to the mesos code base, I would appreciate any pointers or 
>> > examples.
>>
>> For the executor have you read
>> http://mesos.apache.org/documentation/latest/executor-http-api/ ?
>> For code you can have a look e.g. at the command executor:
>> https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp
>>
>> Or a trivial example in Python:
>> https://github.com/douban/pymesos/blob/master/examples/executor.py
>>
>> >> Also out of curiosity - is automatic NUMA balancing enabled on your
>> >> agents (kernel.numa_balancing sysctl)?
>> >
>> >
>> > Interesting. I was unaware of this sysctl flag. On looking up more, I 
>> > realize that it may not work for our use case.
>> > It migrates pages to cores used by a container. If no CPUSET was assigned 
>> > to begin with, for the Go and java programs with 10s (some times 1000s) of 
>> > CPU threads, I notice that the data gets 50-50 split on a 2-socket system.
>> > For real-time queries that last for 100s of milliseconds, I don't see 
>> > kernel's automatic migration being very effective; in fact, it may worsen 
>> > the situation.
>> > Have you had success with kernel.numa_balancing? What was the scenario 
>> > where it helped?
>>
>> Yes the reason I was asking is that it might actually be causing you
>> some pain if it's enabled, depending on your workloads.
>> The only times I had to use this sysctl was actually to disable it -
>> in my experience it was causing some latency spikes: I'm not talking a
>> few usec you might expect from a soft page fault, but single-digit ms
>> latencies.
>> Obviously it depends on the workloads and can probably help most of
>> the time, since I believe it's enabled by default on NUMA systems.
>> I guess the best way to find out is to try :).
>>
>> > I notice that the data gets 50-50 split on a 2-socket system
>>
>> Do you mean for a single process - by looking at /proc/<pid>/numa_maps ?
>> Is it with or without numa balancing?
>>
>>
> By looking up `numastat -p pid`. Numa balancing is off.
>
>>
>>
>> >
>> >>
>> >>
>> >> Cheers,
>> >>
>> >> Charles
>> >>
>> >>
>> >> Le lun. 6 juil. 2020 à 19:36, Milind Chabbi <[email protected]> a écrit :
>> >> >
>> >> > Hi,
>> >> >
>> >> > I have noticed that without explicit flags, the mesos-agent does not 
>> >> > restrict a cgroup of a container to any CPUSET. This has quite 
>> >> > deleterious consequences in our usage model where the OS threads in 
>> >> > containerized processes migrate to any NUMA sockets over time and lose 
>> >> > locality to memory they allocated with the first touch policy. It would 
>> >> > take a lot of effort to specify the exact CPUSET at the container 
>> >> > launch time.
>> >> >
>> >> > I am wondering if the mesos agent can expose a flag (e.g., 
>> >> > --best-effort-numa-locality) so that if the requested number of CPU 
>> >> > share and memory demands meet the requirements, then the container can 
>> >> > be launched with the cgroup affinity set to a single NUMA socket and 
>> >> > avoid any deleterious effects of unrestricted CPU migration.
>> >> >
>> >> > -Milind

Re: cgroup CPUSET for mesos agent

Reply via email to