Re: cgroup CPUSET for mesos agent

Milind Chabbi Mon, 06 Jul 2020 12:22:08 -0700

Thanks for your email, Charles.

On Mon, Jul 6, 2020 at 12:03 PM Charles-François Natali <cf.nat...@gmail.com>
wrote:


> Hi Milind,
>
> (I'm just a user not a developer so take what I say with a grain of salt
> :-).
>
> AFAICT the agent/containerisation code is not NUMA-aware, so it
> probably wouldn't be trivial.
>
> Also, there are some obvious limitations with this: for example
> binding processes to a specific NUMA node means that you might not
> benefit from CPU bursting (e.g. if there's some available CPU on
> another NUMA node).
>

True. I would like the bust to be limited to only the cores on a single
socket.
Data locality can be more important than available parallelism, sometimes.


> Also NUMA binding has actually quite a few possible settings: for
> example you might also want to bind the memory allocations, etc, which
> means a simple flag might not be enough to achieve what you want.
>
>
True. I would like to rely on the default "first touch" policy and if the
container is restricted to a socket, the data will be allocated on the same
NUMA node, as long as memory is available.


> One possibility I can think of might be to write your own executor -
> we wrote our own executor at work for various reasons.
> It's a bit of work, but it would give you unlimited flexibility in how
> you start your tasks, bind them etc.
>
>
I am new to the mesos code base, I would appreciate any pointers or
examples.


> Also out of curiosity - is automatic NUMA balancing enabled on your
> agents (kernel.numa_balancing sysctl)?
>

Interesting. I was unaware of this sysctl flag. On looking up more
<https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-numa-auto_numa_balancing>,
I realize that it may not work for our use case.
It migrates pages to cores used by a container. If no CPUSET was assigned
to begin with, for the Go and java programs with 10s (some times 1000s) of
CPU threads, I notice that the data gets 50-50 split on a 2-socket system.
For real-time queries that last for 100s of milliseconds, I don't see
kernel's automatic migration being very effective; in fact, it may worsen
the situation.
Have you had success with kernel.numa_balancing? What was the scenario
where it helped?


>
> Cheers,
>
> Charles
>
>
> Le lun. 6 juil. 2020 à 19:36, Milind Chabbi <mil...@uber.com> a écrit :
> >
> > Hi,
> >
> > I have noticed that without explicit flags, the mesos-agent does not
> restrict a cgroup of a container to any CPUSET. This has quite deleterious
> consequences in our usage model where the OS threads in containerized
> processes migrate to any NUMA sockets over time and lose locality to memory
> they allocated with the first touch policy. It would take a lot of effort
> to specify the exact CPUSET at the container launch time.
> >
> > I am wondering if the mesos agent can expose a flag (e.g.,
> --best-effort-numa-locality) so that if the requested number of CPU share
> and memory demands meet the requirements, then the container can be
> launched with the cgroup affinity set to a single NUMA socket and avoid any
> deleterious effects of unrestricted CPU migration.
> >
> > -Milind
>

Re: cgroup CPUSET for mesos agent

Reply via email to