>> Also, there are some obvious limitations with this: for example >> binding processes to a specific NUMA node means that you might not >> benefit from CPU bursting (e.g. if there's some available CPU on >> another NUMA node). > > > True. I would like the bust to be limited to only the cores on a single > socket. > Data locality can be more important than available parallelism, sometimes. > >> >> Also NUMA binding has actually quite a few possible settings: for >> example you might also want to bind the memory allocations, etc, which >> means a simple flag might not be enough to achieve what you want. >> > > True. I would like to rely on the default "first touch" policy and if the > container is restricted to a socket, the data will be allocated on the same > NUMA node, as long as memory is available. >
Yes so it sounds like you probably want some fine-grained control over the numa policy, which would probably be difficult to implement in the agent. >> One possibility I can think of might be to write your own executor - >> we wrote our own executor at work for various reasons. >> It's a bit of work, but it would give you unlimited flexibility in how >> you start your tasks, bind them etc. >> > > I am new to the mesos code base, I would appreciate any pointers or examples. For the executor have you read http://mesos.apache.org/documentation/latest/executor-http-api/ ? For code you can have a look e.g. at the command executor: https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp Or a trivial example in Python: https://github.com/douban/pymesos/blob/master/examples/executor.py >> Also out of curiosity - is automatic NUMA balancing enabled on your >> agents (kernel.numa_balancing sysctl)? > > > Interesting. I was unaware of this sysctl flag. On looking up more, I realize > that it may not work for our use case. > It migrates pages to cores used by a container. If no CPUSET was assigned to > begin with, for the Go and java programs with 10s (some times 1000s) of CPU > threads, I notice that the data gets 50-50 split on a 2-socket system. > For real-time queries that last for 100s of milliseconds, I don't see > kernel's automatic migration being very effective; in fact, it may worsen the > situation. > Have you had success with kernel.numa_balancing? What was the scenario where > it helped? Yes the reason I was asking is that it might actually be causing you some pain if it's enabled, depending on your workloads. The only times I had to use this sysctl was actually to disable it - in my experience it was causing some latency spikes: I'm not talking a few usec you might expect from a soft page fault, but single-digit ms latencies. Obviously it depends on the workloads and can probably help most of the time, since I believe it's enabled by default on NUMA systems. I guess the best way to find out is to try :). > I notice that the data gets 50-50 split on a 2-socket system Do you mean for a single process - by looking at /proc/<pid>/numa_maps ? Is it with or without numa balancing? > >> >> >> Cheers, >> >> Charles >> >> >> Le lun. 6 juil. 2020 à 19:36, Milind Chabbi <mil...@uber.com> a écrit : >> > >> > Hi, >> > >> > I have noticed that without explicit flags, the mesos-agent does not >> > restrict a cgroup of a container to any CPUSET. This has quite deleterious >> > consequences in our usage model where the OS threads in containerized >> > processes migrate to any NUMA sockets over time and lose locality to >> > memory they allocated with the first touch policy. It would take a lot of >> > effort to specify the exact CPUSET at the container launch time. >> > >> > I am wondering if the mesos agent can expose a flag (e.g., >> > --best-effort-numa-locality) so that if the requested number of CPU share >> > and memory demands meet the requirements, then the container can be >> > launched with the cgroup affinity set to a single NUMA socket and avoid >> > any deleterious effects of unrestricted CPU migration. >> > >> > -Milind