Maybe give it a try then, it might help. Cheers,
Le lun. 6 juil. 2020 à 21:20, Milind Chabbi <mil...@uber.com> a écrit : > > > > On Mon, Jul 6, 2020 at 1:18 PM Charles-François Natali <cf.nat...@gmail.com> > wrote: >> >> >> Also, there are some obvious limitations with this: for example >> >> binding processes to a specific NUMA node means that you might not >> >> benefit from CPU bursting (e.g. if there's some available CPU on >> >> another NUMA node). >> > >> > >> > True. I would like the bust to be limited to only the cores on a single >> > socket. >> > Data locality can be more important than available parallelism, sometimes. >> > >> >> >> >> Also NUMA binding has actually quite a few possible settings: for >> >> example you might also want to bind the memory allocations, etc, which >> >> means a simple flag might not be enough to achieve what you want. >> >> >> > >> > True. I would like to rely on the default "first touch" policy and if the >> > container is restricted to a socket, the data will be allocated on the >> > same NUMA node, as long as memory is available. >> > >> >> Yes so it sounds like you probably want some fine-grained control over >> the numa policy, which would probably be difficult to implement in the >> agent. >> >> >> One possibility I can think of might be to write your own executor - >> >> we wrote our own executor at work for various reasons. >> >> It's a bit of work, but it would give you unlimited flexibility in how >> >> you start your tasks, bind them etc. >> >> >> > >> > I am new to the mesos code base, I would appreciate any pointers or >> > examples. >> >> For the executor have you read >> http://mesos.apache.org/documentation/latest/executor-http-api/ ? >> For code you can have a look e.g. at the command executor: >> https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp >> >> Or a trivial example in Python: >> https://github.com/douban/pymesos/blob/master/examples/executor.py >> >> >> Also out of curiosity - is automatic NUMA balancing enabled on your >> >> agents (kernel.numa_balancing sysctl)? >> > >> > >> > Interesting. I was unaware of this sysctl flag. On looking up more, I >> > realize that it may not work for our use case. >> > It migrates pages to cores used by a container. If no CPUSET was assigned >> > to begin with, for the Go and java programs with 10s (some times 1000s) of >> > CPU threads, I notice that the data gets 50-50 split on a 2-socket system. >> > For real-time queries that last for 100s of milliseconds, I don't see >> > kernel's automatic migration being very effective; in fact, it may worsen >> > the situation. >> > Have you had success with kernel.numa_balancing? What was the scenario >> > where it helped? >> >> Yes the reason I was asking is that it might actually be causing you >> some pain if it's enabled, depending on your workloads. >> The only times I had to use this sysctl was actually to disable it - >> in my experience it was causing some latency spikes: I'm not talking a >> few usec you might expect from a soft page fault, but single-digit ms >> latencies. >> Obviously it depends on the workloads and can probably help most of >> the time, since I believe it's enabled by default on NUMA systems. >> I guess the best way to find out is to try :). >> >> > I notice that the data gets 50-50 split on a 2-socket system >> >> Do you mean for a single process - by looking at /proc/<pid>/numa_maps ? >> Is it with or without numa balancing? >> >> > By looking up `numastat -p pid`. Numa balancing is off. > >> >> >> > >> >> >> >> >> >> Cheers, >> >> >> >> Charles >> >> >> >> >> >> Le lun. 6 juil. 2020 à 19:36, Milind Chabbi <mil...@uber.com> a écrit : >> >> > >> >> > Hi, >> >> > >> >> > I have noticed that without explicit flags, the mesos-agent does not >> >> > restrict a cgroup of a container to any CPUSET. This has quite >> >> > deleterious consequences in our usage model where the OS threads in >> >> > containerized processes migrate to any NUMA sockets over time and lose >> >> > locality to memory they allocated with the first touch policy. It would >> >> > take a lot of effort to specify the exact CPUSET at the container >> >> > launch time. >> >> > >> >> > I am wondering if the mesos agent can expose a flag (e.g., >> >> > --best-effort-numa-locality) so that if the requested number of CPU >> >> > share and memory demands meet the requirements, then the container can >> >> > be launched with the cgroup affinity set to a single NUMA socket and >> >> > avoid any deleterious effects of unrestricted CPU migration. >> >> > >> >> > -Milind