>> Also, there are some obvious limitations with this: for example
>> binding processes to a specific NUMA node means that you might not
>> benefit from CPU bursting (e.g. if there's some available CPU on
>> another NUMA node).
>
>
> True. I would like the bust to be limited to only the cores on a single 
> socket.
> Data locality can be more important than available parallelism, sometimes.
>
>>
>> Also NUMA binding has actually quite a few possible settings: for
>> example you might also want to bind the memory allocations, etc, which
>> means a simple flag might not be enough to achieve what you want.
>>
>
> True. I would like to rely on the default "first touch" policy and if the 
> container is restricted to a socket, the data will be allocated on the same 
> NUMA node, as long as memory is available.
>

Yes so it sounds like you probably want some fine-grained control over
the numa policy, which would probably be difficult to implement in the
agent.

>> One possibility I can think of might be to write your own executor -
>> we wrote our own executor at work for various reasons.
>> It's a bit of work, but it would give you unlimited flexibility in how
>> you start your tasks, bind them etc.
>>
>
> I am new to the mesos code base, I would appreciate any pointers or examples.

For the executor have you read
http://mesos.apache.org/documentation/latest/executor-http-api/ ?
For code you can have a look e.g. at the command executor:
https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp

Or a trivial example in Python:
https://github.com/douban/pymesos/blob/master/examples/executor.py

>> Also out of curiosity - is automatic NUMA balancing enabled on your
>> agents (kernel.numa_balancing sysctl)?
>
>
> Interesting. I was unaware of this sysctl flag. On looking up more, I realize 
> that it may not work for our use case.
> It migrates pages to cores used by a container. If no CPUSET was assigned to 
> begin with, for the Go and java programs with 10s (some times 1000s) of CPU 
> threads, I notice that the data gets 50-50 split on a 2-socket system.
> For real-time queries that last for 100s of milliseconds, I don't see 
> kernel's automatic migration being very effective; in fact, it may worsen the 
> situation.
> Have you had success with kernel.numa_balancing? What was the scenario where 
> it helped?

Yes the reason I was asking is that it might actually be causing you
some pain if it's enabled, depending on your workloads.
The only times I had to use this sysctl was actually to disable it -
in my experience it was causing some latency spikes: I'm not talking a
few usec you might expect from a soft page fault, but single-digit ms
latencies.
Obviously it depends on the workloads and can probably help most of
the time, since I believe it's enabled by default on NUMA systems.
I guess the best way to find out is to try :).

> I notice that the data gets 50-50 split on a 2-socket system

Do you mean for a single process - by looking at /proc/<pid>/numa_maps ?
Is it with or without numa balancing?



>
>>
>>
>> Cheers,
>>
>> Charles
>>
>>
>> Le lun. 6 juil. 2020 à 19:36, Milind Chabbi <mil...@uber.com> a écrit :
>> >
>> > Hi,
>> >
>> > I have noticed that without explicit flags, the mesos-agent does not 
>> > restrict a cgroup of a container to any CPUSET. This has quite deleterious 
>> > consequences in our usage model where the OS threads in containerized 
>> > processes migrate to any NUMA sockets over time and lose locality to 
>> > memory they allocated with the first touch policy. It would take a lot of 
>> > effort to specify the exact CPUSET at the container launch time.
>> >
>> > I am wondering if the mesos agent can expose a flag (e.g., 
>> > --best-effort-numa-locality) so that if the requested number of CPU share 
>> > and memory demands meet the requirements, then the container can be 
>> > launched with the cgroup affinity set to a single NUMA socket and avoid 
>> > any deleterious effects of unrestricted CPU migration.
>> >
>> > -Milind

Reply via email to