Re: [BULK]Re: cgroup CPUSET for mesos agent

2021-01-14 Thread Vinod Kone
Great to hear! Thanks for the update.

On Thu, Jan 14, 2021 at 5:18 PM Charles-François Natali 
wrote:

> It's a bit old but in case it could help, we recently implemented this
> at work - here's how we did it:
> - the NUMA topology is exposed via agent custom resources
> - the framework does the allocation of the corresponding resources to
> the tasks according to the NUMA topology: e.g. if the task requests 2
> CPUs within the same NUMA node, the framework would allocate them
> - a custom executor then implements the CPU affinity/cpuset using the
> resources provided by the framework
>
> It works really nicely.
>
> Cheers,
>
> Charles
>
>
> Le mar. 7 juil. 2020 à 18:12, Milind Chabbi  a écrit :
> >
> > Grégoire, thanks for your reply. This is super helpful to make a
> stronger case around the affinity benefits.
> > Would you be able to offer additional details that you mentioned? I am
> definitely interested.
> > Is your isolator source code publicly available?
> >
> > -Milind
> >
> > On Tue, Jul 7, 2020 at 3:14 AM Grégoire Seux  wrote:
> >>
> >> Hello,
> >>
> >> I'd like to give you a return of experience because we've worked on
> this last year.
> >> We've used CFS bandwidth isolation for several years and encountered
> many issues (lack of predictability, bugs present in old linux kernels and
> lack of cache/memory locality). At some point, we've implemented a custom
> isolator to manage cpusets (using
> https://github.com/criteo/mesos-command-modules/ as a base to write an
> isolator in a scripting language).
> >>
> >> The isolator had a very simple behavior: upon new task, look at which
> cpus are not within a cpuset cgroup, select (if possible) cpus from the
> same numa node and create cpuset cgroup for the starting task.
> >> In practice, it provided a general decrease of cpu consumption (up to
> 8% of some cpu intensive applications) and better ability to reason about
> the cpu isolation model.
> >> The allocation is optimistic: it tries to use cpus from the same numa
> node but if it's not possible, task is spread accross nodes. In practice it
> happens very rarely because of one small optimization to assign cpus from
> the most loaded numa node (decreasing fragmentation of available cpus
> accross numa nodes).
> >>
> >> I'd be glad to give more details if you are interested
> >>
> >> --
> >> Grégoire
>


Re: [BULK]Re: cgroup CPUSET for mesos agent

2021-01-14 Thread Charles-François Natali
It's a bit old but in case it could help, we recently implemented this
at work - here's how we did it:
- the NUMA topology is exposed via agent custom resources
- the framework does the allocation of the corresponding resources to
the tasks according to the NUMA topology: e.g. if the task requests 2
CPUs within the same NUMA node, the framework would allocate them
- a custom executor then implements the CPU affinity/cpuset using the
resources provided by the framework

It works really nicely.

Cheers,

Charles


Le mar. 7 juil. 2020 à 18:12, Milind Chabbi  a écrit :
>
> Grégoire, thanks for your reply. This is super helpful to make a stronger 
> case around the affinity benefits.
> Would you be able to offer additional details that you mentioned? I am 
> definitely interested.
> Is your isolator source code publicly available?
>
> -Milind
>
> On Tue, Jul 7, 2020 at 3:14 AM Grégoire Seux  wrote:
>>
>> Hello,
>>
>> I'd like to give you a return of experience because we've worked on this 
>> last year.
>> We've used CFS bandwidth isolation for several years and encountered many 
>> issues (lack of predictability, bugs present in old linux kernels and lack 
>> of cache/memory locality). At some point, we've implemented a custom 
>> isolator to manage cpusets (using 
>> https://github.com/criteo/mesos-command-modules/ as a base to write an 
>> isolator in a scripting language).
>>
>> The isolator had a very simple behavior: upon new task, look at which cpus 
>> are not within a cpuset cgroup, select (if possible) cpus from the same numa 
>> node and create cpuset cgroup for the starting task.
>> In practice, it provided a general decrease of cpu consumption (up to 8% of 
>> some cpu intensive applications) and better ability to reason about the cpu 
>> isolation model.
>> The allocation is optimistic: it tries to use cpus from the same numa node 
>> but if it's not possible, task is spread accross nodes. In practice it 
>> happens very rarely because of one small optimization to assign cpus from 
>> the most loaded numa node (decreasing fragmentation of available cpus 
>> accross numa nodes).
>>
>> I'd be glad to give more details if you are interested
>>
>> --
>> Grégoire


Re: [BULK]Re: cgroup CPUSET for mesos agent

2020-07-07 Thread Milind Chabbi
Grégoire, thanks for your reply. This is super helpful to make a
stronger case around the affinity benefits.
Would you be able to offer additional details that you mentioned? I am
definitely interested.
Is your isolator source code publicly available?

-Milind

On Tue, Jul 7, 2020 at 3:14 AM Grégoire Seux  wrote:

> Hello,
>
> I'd like to give you a return of experience because we've worked on this
> last year.
> We've used CFS bandwidth isolation for several years and encountered many
> issues (lack of predictability, bugs present in old linux kernels and lack
> of cache/memory locality). At some point, we've implemented a custom
> isolator to manage cpusets (using
> https://github.com/criteo/mesos-command-modules/ as a base to write an
> isolator in a scripting language).
>
> The isolator had a very simple behavior: upon new task, look at which cpus
> are not within a cpuset cgroup, select (if possible) cpus from the same
> numa node and create cpuset cgroup for the starting task.
> In practice, it provided a general decrease of cpu consumption (up to 8%
> of some cpu intensive applications) and better ability to reason about the
> cpu isolation model.
> The allocation is optimistic: it tries to use cpus from the same numa node
> but if it's not possible, task is spread accross nodes. In practice it
> happens very rarely because of one small optimization to assign cpus from
> the most loaded numa node (decreasing fragmentation of available cpus
> accross numa nodes).
>
> I'd be glad to give more details if you are interested
>
> --
> Grégoire
>


Re: [BULK]Re: cgroup CPUSET for mesos agent

2020-07-07 Thread Grégoire Seux
Hello,

I'd like to give you a return of experience because we've worked on this last 
year.
We've used CFS bandwidth isolation for several years and encountered many 
issues (lack of predictability, bugs present in old linux kernels and lack of 
cache/memory locality). At some point, we've implemented a custom isolator to 
manage cpusets (using https://github.com/criteo/mesos-command-modules/ as a 
base to write an isolator in a scripting language).

The isolator had a very simple behavior: upon new task, look at which cpus are 
not within a cpuset cgroup, select (if possible) cpus from the same numa node 
and create cpuset cgroup for the starting task.
In practice, it provided a general decrease of cpu consumption (up to 8% of 
some cpu intensive applications) and better ability to reason about the cpu 
isolation model.
The allocation is optimistic: it tries to use cpus from the same numa node but 
if it's not possible, task is spread accross nodes. In practice it happens very 
rarely because of one small optimization to assign cpus from the most loaded 
numa node (decreasing fragmentation of available cpus accross numa nodes).

I'd be glad to give more details if you are interested

--
Grégoire


Re: cgroup CPUSET for mesos agent

2020-07-06 Thread Charles-François Natali
Maybe give it a try then, it might help.

Cheers,


Le lun. 6 juil. 2020 à 21:20, Milind Chabbi  a écrit :
>
>
>
> On Mon, Jul 6, 2020 at 1:18 PM Charles-François Natali  
> wrote:
>>
>> >> Also, there are some obvious limitations with this: for example
>> >> binding processes to a specific NUMA node means that you might not
>> >> benefit from CPU bursting (e.g. if there's some available CPU on
>> >> another NUMA node).
>> >
>> >
>> > True. I would like the bust to be limited to only the cores on a single 
>> > socket.
>> > Data locality can be more important than available parallelism, sometimes.
>> >
>> >>
>> >> Also NUMA binding has actually quite a few possible settings: for
>> >> example you might also want to bind the memory allocations, etc, which
>> >> means a simple flag might not be enough to achieve what you want.
>> >>
>> >
>> > True. I would like to rely on the default "first touch" policy and if the 
>> > container is restricted to a socket, the data will be allocated on the 
>> > same NUMA node, as long as memory is available.
>> >
>>
>> Yes so it sounds like you probably want some fine-grained control over
>> the numa policy, which would probably be difficult to implement in the
>> agent.
>>
>> >> One possibility I can think of might be to write your own executor -
>> >> we wrote our own executor at work for various reasons.
>> >> It's a bit of work, but it would give you unlimited flexibility in how
>> >> you start your tasks, bind them etc.
>> >>
>> >
>> > I am new to the mesos code base, I would appreciate any pointers or 
>> > examples.
>>
>> For the executor have you read
>> http://mesos.apache.org/documentation/latest/executor-http-api/ ?
>> For code you can have a look e.g. at the command executor:
>> https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp
>>
>> Or a trivial example in Python:
>> https://github.com/douban/pymesos/blob/master/examples/executor.py
>>
>> >> Also out of curiosity - is automatic NUMA balancing enabled on your
>> >> agents (kernel.numa_balancing sysctl)?
>> >
>> >
>> > Interesting. I was unaware of this sysctl flag. On looking up more, I 
>> > realize that it may not work for our use case.
>> > It migrates pages to cores used by a container. If no CPUSET was assigned 
>> > to begin with, for the Go and java programs with 10s (some times 1000s) of 
>> > CPU threads, I notice that the data gets 50-50 split on a 2-socket system.
>> > For real-time queries that last for 100s of milliseconds, I don't see 
>> > kernel's automatic migration being very effective; in fact, it may worsen 
>> > the situation.
>> > Have you had success with kernel.numa_balancing? What was the scenario 
>> > where it helped?
>>
>> Yes the reason I was asking is that it might actually be causing you
>> some pain if it's enabled, depending on your workloads.
>> The only times I had to use this sysctl was actually to disable it -
>> in my experience it was causing some latency spikes: I'm not talking a
>> few usec you might expect from a soft page fault, but single-digit ms
>> latencies.
>> Obviously it depends on the workloads and can probably help most of
>> the time, since I believe it's enabled by default on NUMA systems.
>> I guess the best way to find out is to try :).
>>
>> > I notice that the data gets 50-50 split on a 2-socket system
>>
>> Do you mean for a single process - by looking at /proc//numa_maps ?
>> Is it with or without numa balancing?
>>
>>
> By looking up `numastat -p pid`. Numa balancing is off.
>
>>
>>
>> >
>> >>
>> >>
>> >> Cheers,
>> >>
>> >> Charles
>> >>
>> >>
>> >> Le lun. 6 juil. 2020 à 19:36, Milind Chabbi  a écrit :
>> >> >
>> >> > Hi,
>> >> >
>> >> > I have noticed that without explicit flags, the mesos-agent does not 
>> >> > restrict a cgroup of a container to any CPUSET. This has quite 
>> >> > deleterious consequences in our usage model where the OS threads in 
>> >> > containerized processes migrate to any NUMA sockets over time and lose 
>> >> > locality to memory they allocated with the first touch policy. It would 
>> >> > take a lot of effort to specify the exact CPUSET at the container 
>> >> > launch time.
>> >> >
>> >> > I am wondering if the mesos agent can expose a flag (e.g., 
>> >> > --best-effort-numa-locality) so that if the requested number of CPU 
>> >> > share and memory demands meet the requirements, then the container can 
>> >> > be launched with the cgroup affinity set to a single NUMA socket and 
>> >> > avoid any deleterious effects of unrestricted CPU migration.
>> >> >
>> >> > -Milind


Re: cgroup CPUSET for mesos agent

2020-07-06 Thread Milind Chabbi
On Mon, Jul 6, 2020 at 1:18 PM Charles-François Natali 
wrote:

> >> Also, there are some obvious limitations with this: for example
> >> binding processes to a specific NUMA node means that you might not
> >> benefit from CPU bursting (e.g. if there's some available CPU on
> >> another NUMA node).
> >
> >
> > True. I would like the bust to be limited to only the cores on a single
> socket.
> > Data locality can be more important than available parallelism,
> sometimes.
> >
> >>
> >> Also NUMA binding has actually quite a few possible settings: for
> >> example you might also want to bind the memory allocations, etc, which
> >> means a simple flag might not be enough to achieve what you want.
> >>
> >
> > True. I would like to rely on the default "first touch" policy and if
> the container is restricted to a socket, the data will be allocated on the
> same NUMA node, as long as memory is available.
> >
>
> Yes so it sounds like you probably want some fine-grained control over
> the numa policy, which would probably be difficult to implement in the
> agent.
>
> >> One possibility I can think of might be to write your own executor -
> >> we wrote our own executor at work for various reasons.
> >> It's a bit of work, but it would give you unlimited flexibility in how
> >> you start your tasks, bind them etc.
> >>
> >
> > I am new to the mesos code base, I would appreciate any pointers or
> examples.
>
> For the executor have you read
> http://mesos.apache.org/documentation/latest/executor-http-api/ ?
> For code you can have a look e.g. at the command executor:
> https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp
>
> Or a trivial example in Python:
> https://github.com/douban/pymesos/blob/master/examples/executor.py
>
> >> Also out of curiosity - is automatic NUMA balancing enabled on your
> >> agents (kernel.numa_balancing sysctl)?
> >
> >
> > Interesting. I was unaware of this sysctl flag. On looking up more, I
> realize that it may not work for our use case.
> > It migrates pages to cores used by a container. If no CPUSET was
> assigned to begin with, for the Go and java programs with 10s (some times
> 1000s) of CPU threads, I notice that the data gets 50-50 split on a
> 2-socket system.
> > For real-time queries that last for 100s of milliseconds, I don't see
> kernel's automatic migration being very effective; in fact, it may worsen
> the situation.
> > Have you had success with kernel.numa_balancing? What was the scenario
> where it helped?
>
> Yes the reason I was asking is that it might actually be causing you
> some pain if it's enabled, depending on your workloads.
> The only times I had to use this sysctl was actually to disable it -
> in my experience it was causing some latency spikes: I'm not talking a
> few usec you might expect from a soft page fault, but single-digit ms
> latencies.
> Obviously it depends on the workloads and can probably help most of
> the time, since I believe it's enabled by default on NUMA systems.
> I guess the best way to find out is to try :).
>
> > I notice that the data gets 50-50 split on a 2-socket system
>
> Do you mean for a single process - by looking at /proc//numa_maps ?
> Is it with or without numa balancing?
>
>
> By looking up `numastat -p pid`. Numa balancing is off.


>
> >
> >>
> >>
> >> Cheers,
> >>
> >> Charles
> >>
> >>
> >> Le lun. 6 juil. 2020 à 19:36, Milind Chabbi  a écrit :
> >> >
> >> > Hi,
> >> >
> >> > I have noticed that without explicit flags, the mesos-agent does not
> restrict a cgroup of a container to any CPUSET. This has quite deleterious
> consequences in our usage model where the OS threads in containerized
> processes migrate to any NUMA sockets over time and lose locality to memory
> they allocated with the first touch policy. It would take a lot of effort
> to specify the exact CPUSET at the container launch time.
> >> >
> >> > I am wondering if the mesos agent can expose a flag (e.g.,
> --best-effort-numa-locality) so that if the requested number of CPU share
> and memory demands meet the requirements, then the container can be
> launched with the cgroup affinity set to a single NUMA socket and avoid any
> deleterious effects of unrestricted CPU migration.
> >> >
> >> > -Milind
>


Re: cgroup CPUSET for mesos agent

2020-07-06 Thread Charles-François Natali
>> Also, there are some obvious limitations with this: for example
>> binding processes to a specific NUMA node means that you might not
>> benefit from CPU bursting (e.g. if there's some available CPU on
>> another NUMA node).
>
>
> True. I would like the bust to be limited to only the cores on a single 
> socket.
> Data locality can be more important than available parallelism, sometimes.
>
>>
>> Also NUMA binding has actually quite a few possible settings: for
>> example you might also want to bind the memory allocations, etc, which
>> means a simple flag might not be enough to achieve what you want.
>>
>
> True. I would like to rely on the default "first touch" policy and if the 
> container is restricted to a socket, the data will be allocated on the same 
> NUMA node, as long as memory is available.
>

Yes so it sounds like you probably want some fine-grained control over
the numa policy, which would probably be difficult to implement in the
agent.

>> One possibility I can think of might be to write your own executor -
>> we wrote our own executor at work for various reasons.
>> It's a bit of work, but it would give you unlimited flexibility in how
>> you start your tasks, bind them etc.
>>
>
> I am new to the mesos code base, I would appreciate any pointers or examples.

For the executor have you read
http://mesos.apache.org/documentation/latest/executor-http-api/ ?
For code you can have a look e.g. at the command executor:
https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp

Or a trivial example in Python:
https://github.com/douban/pymesos/blob/master/examples/executor.py

>> Also out of curiosity - is automatic NUMA balancing enabled on your
>> agents (kernel.numa_balancing sysctl)?
>
>
> Interesting. I was unaware of this sysctl flag. On looking up more, I realize 
> that it may not work for our use case.
> It migrates pages to cores used by a container. If no CPUSET was assigned to 
> begin with, for the Go and java programs with 10s (some times 1000s) of CPU 
> threads, I notice that the data gets 50-50 split on a 2-socket system.
> For real-time queries that last for 100s of milliseconds, I don't see 
> kernel's automatic migration being very effective; in fact, it may worsen the 
> situation.
> Have you had success with kernel.numa_balancing? What was the scenario where 
> it helped?

Yes the reason I was asking is that it might actually be causing you
some pain if it's enabled, depending on your workloads.
The only times I had to use this sysctl was actually to disable it -
in my experience it was causing some latency spikes: I'm not talking a
few usec you might expect from a soft page fault, but single-digit ms
latencies.
Obviously it depends on the workloads and can probably help most of
the time, since I believe it's enabled by default on NUMA systems.
I guess the best way to find out is to try :).

> I notice that the data gets 50-50 split on a 2-socket system

Do you mean for a single process - by looking at /proc//numa_maps ?
Is it with or without numa balancing?



>
>>
>>
>> Cheers,
>>
>> Charles
>>
>>
>> Le lun. 6 juil. 2020 à 19:36, Milind Chabbi  a écrit :
>> >
>> > Hi,
>> >
>> > I have noticed that without explicit flags, the mesos-agent does not 
>> > restrict a cgroup of a container to any CPUSET. This has quite deleterious 
>> > consequences in our usage model where the OS threads in containerized 
>> > processes migrate to any NUMA sockets over time and lose locality to 
>> > memory they allocated with the first touch policy. It would take a lot of 
>> > effort to specify the exact CPUSET at the container launch time.
>> >
>> > I am wondering if the mesos agent can expose a flag (e.g., 
>> > --best-effort-numa-locality) so that if the requested number of CPU share 
>> > and memory demands meet the requirements, then the container can be 
>> > launched with the cgroup affinity set to a single NUMA socket and avoid 
>> > any deleterious effects of unrestricted CPU migration.
>> >
>> > -Milind


Re: cgroup CPUSET for mesos agent

2020-07-06 Thread Milind Chabbi
Thanks for your email, Charles.

On Mon, Jul 6, 2020 at 12:03 PM Charles-François Natali 
wrote:

> Hi Milind,
>
> (I'm just a user not a developer so take what I say with a grain of salt
> :-).
>
> AFAICT the agent/containerisation code is not NUMA-aware, so it
> probably wouldn't be trivial.
>
> Also, there are some obvious limitations with this: for example
> binding processes to a specific NUMA node means that you might not
> benefit from CPU bursting (e.g. if there's some available CPU on
> another NUMA node).
>

True. I would like the bust to be limited to only the cores on a single
socket.
Data locality can be more important than available parallelism, sometimes.


> Also NUMA binding has actually quite a few possible settings: for
> example you might also want to bind the memory allocations, etc, which
> means a simple flag might not be enough to achieve what you want.
>
>
True. I would like to rely on the default "first touch" policy and if the
container is restricted to a socket, the data will be allocated on the same
NUMA node, as long as memory is available.


> One possibility I can think of might be to write your own executor -
> we wrote our own executor at work for various reasons.
> It's a bit of work, but it would give you unlimited flexibility in how
> you start your tasks, bind them etc.
>
>
I am new to the mesos code base, I would appreciate any pointers or
examples.


> Also out of curiosity - is automatic NUMA balancing enabled on your
> agents (kernel.numa_balancing sysctl)?
>

Interesting. I was unaware of this sysctl flag. On looking up more
,
I realize that it may not work for our use case.
It migrates pages to cores used by a container. If no CPUSET was assigned
to begin with, for the Go and java programs with 10s (some times 1000s) of
CPU threads, I notice that the data gets 50-50 split on a 2-socket system.
For real-time queries that last for 100s of milliseconds, I don't see
kernel's automatic migration being very effective; in fact, it may worsen
the situation.
Have you had success with kernel.numa_balancing? What was the scenario
where it helped?


>
> Cheers,
>
> Charles
>
>
> Le lun. 6 juil. 2020 à 19:36, Milind Chabbi  a écrit :
> >
> > Hi,
> >
> > I have noticed that without explicit flags, the mesos-agent does not
> restrict a cgroup of a container to any CPUSET. This has quite deleterious
> consequences in our usage model where the OS threads in containerized
> processes migrate to any NUMA sockets over time and lose locality to memory
> they allocated with the first touch policy. It would take a lot of effort
> to specify the exact CPUSET at the container launch time.
> >
> > I am wondering if the mesos agent can expose a flag (e.g.,
> --best-effort-numa-locality) so that if the requested number of CPU share
> and memory demands meet the requirements, then the container can be
> launched with the cgroup affinity set to a single NUMA socket and avoid any
> deleterious effects of unrestricted CPU migration.
> >
> > -Milind
>


Re: cgroup CPUSET for mesos agent

2020-07-06 Thread Charles-François Natali
Hi Milind,

(I'm just a user not a developer so take what I say with a grain of salt :-).

AFAICT the agent/containerisation code is not NUMA-aware, so it
probably wouldn't be trivial.

Also, there are some obvious limitations with this: for example
binding processes to a specific NUMA node means that you might not
benefit from CPU bursting (e.g. if there's some available CPU on
another NUMA node).
Also NUMA binding has actually quite a few possible settings: for
example you might also want to bind the memory allocations, etc, which
means a simple flag might not be enough to achieve what you want.

One possibility I can think of might be to write your own executor -
we wrote our own executor at work for various reasons.
It's a bit of work, but it would give you unlimited flexibility in how
you start your tasks, bind them etc.

Also out of curiosity - is automatic NUMA balancing enabled on your
agents (kernel.numa_balancing sysctl)?

Cheers,

Charles


Le lun. 6 juil. 2020 à 19:36, Milind Chabbi  a écrit :
>
> Hi,
>
> I have noticed that without explicit flags, the mesos-agent does not restrict 
> a cgroup of a container to any CPUSET. This has quite deleterious 
> consequences in our usage model where the OS threads in containerized 
> processes migrate to any NUMA sockets over time and lose locality to memory 
> they allocated with the first touch policy. It would take a lot of effort to 
> specify the exact CPUSET at the container launch time.
>
> I am wondering if the mesos agent can expose a flag (e.g., 
> --best-effort-numa-locality) so that if the requested number of CPU share and 
> memory demands meet the requirements, then the container can be launched with 
> the cgroup affinity set to a single NUMA socket and avoid any deleterious 
> effects of unrestricted CPU migration.
>
> -Milind