Re: [PATCH] util: NUMA aware memory preallocation

2022-06-08 Thread David Hildenbrand
On 11.05.22 11:34, Daniel P. Berrangé wrote:
> On Wed, May 11, 2022 at 11:31:23AM +0200, David Hildenbrand wrote:
 Long story short, management application has no way of learning
 TIDs of allocator threads so it can't make them run NUMA aware.
>>>
>>> This feels like the key issue. The preallocation threads are
>>> invisible to libvirt, regardless of whether we're doing coldplug
>>> or hotplug of memory-backends. Indeed the threads are invisible
>>> to all of QEMU, except the memory backend code.
>>>
>>> Conceptually we need 1 or more explicit worker threads, that we
>>> can assign CPU affinity to, and then QEMU can place jobs on them.
>>> I/O threads serve this role, but limited to blockdev work. We
>>> need a generalization of I/O threads, for arbitrary jobs that
>>> QEMU might want to farm out to specific numa nodes.
>>
>> At least the "-object iothread" thingy can already be used for actions
>> outside of blockdev. virtio-balloon uses one for free page hinting.
> 
> Ah that's good to know, so my idea probably isn't so much work as
> I thought it might be.

Looking into the details, iothreads are the wrong tool to use here.

I can imagine use cases where you'd want to perform preallcoation
* Only one some specific CPUs of a NUMA node (especially, not ones where 
  we pinned VCPUs)
* On CPUs that are on a different NUMA node then the actual backend 
  memory ... just thinking about all of the CPU-less nodes for PMEM and 
  CXL that we're starting to see.


So ideally, we'd let the user configure either
* Set of physical CPUs to use (low hanging fruit)
* Set of NUMA nodes to use (harder)
and allow libvirt to easily configure it similarly by pinning threads.

As CPU affinity is inherited when creating a new thread, so here is one
IMHO reasonable simple thing to get the job done and allow for such
flexibility.


Introduce a "thread-context" user-creatable object that is used to
configure CPU affinity.

Internally, thread-context creates exactly one thread called "TC $id".
That thread serves no purpose besides spawning new threads that inherit the
affinity.

Internal users (like preallocation, but we could reuse the same concept for 
other
non-io threads, such as userfaultfd, and some other potential non-io thread
users) simply use that thread context to spawn new threads.


Spawned threads get called "TC $id/$threadname", whereby $threadname is the
ordinary name supplied by the internal user. This could be used to identify
long-running threads if needed in the future.

It's worth nothing that this is not a thread pool.


a) Ordinary cmdline usage:

-object thread-context,id="tc1",cpus=0-9,cpus=12

QEMU tries setting the CPU affinity and fails if that's impossible.

-object 
memory-backend-file,...,prealloc=on,prealloc-threads=16,prealloc-thread-context=tc1

When a context is set, preallcoation code will use the thread-context to spawn 
threads.


b) Libvirt, ..., usage:

-object thread-context,id="tc1"

Then, libvirt identifies and sets the affinity for the "TC tc1" thread.

-object 
memory-backend-file,...,prealloc=on,prealloc-threads=16,prealloc-thread-context=tc1



thread-context can be reused for successive preallcoation etc, obviously.


-- 
Thanks,

David / dhildenb




Re: [PATCH] util: NUMA aware memory preallocation

2022-05-12 Thread Daniel P . Berrangé
On Thu, May 12, 2022 at 09:41:29AM +0200, Paolo Bonzini wrote:
> On 5/11/22 18:54, Daniel P. Berrangé wrote:
> > On Wed, May 11, 2022 at 01:07:47PM +0200, Paolo Bonzini wrote:
> > > On 5/11/22 12:10, Daniel P. Berrangé wrote:
> > > > I expect creating/deleting I/O threads is cheap in comparison to
> > > > the work done for preallocation. If libvirt is using -preconfig
> > > > and object-add to create the memory backend, then we could have
> > > > option of creating the I/O threads dynamically in -preconfig mode,
> > > > create the memory backend, and then delete the I/O threads again.
> > > 
> > > I think this is very overengineered.  Michal's patch is doing the obvious
> > > thing and if it doesn't work that's because Libvirt is trying to 
> > > micromanage
> > > QEMU.
> > 
> > Calling it micromanaging is putting a very negative connotation on
> > this. What we're trying todo is enforce a host resource policy for
> > QEMU, in a way that a compromised QEMU can't escape, which is a
> > valuable protection.
> 
> I'm sorry if that was a bit exaggerated, but the negative connotation was
> intentional.
> 
> > > As mentioned on IRC, if the reason is to prevent moving around threads in
> > > realtime (SCHED_FIFO, SCHED_RR) classes, that should be fixed at the 
> > > kernel
> > > level.
> > 
> > We use cgroups where it is available to us, but we don't always have
> > the freedom that we'd like.
> 
> I understand.  I'm thinking of a new flag to sched_setscheduler that fixes
> the CPU affinity and policy of the thread and prevents changing it in case
> QEMU is compromised later.  The seccomp/SELinux sandboxes can prevent
> setting the SCHED_FIFO class without this flag.
> 
> In addition, my hunch is that this works only because the RT setup of QEMU
> is not safe against priority inversion.  IIRC the iothreads are set with a
> non-realtime priority, but actually they should have a _higher_ priority
> than the CPU threads, and the thread pool I/O bound workers should have an
> even higher priority; otherwise you have a priority inversion situation
> where an interrupt is pending that would wake up the CPU, but the iothreads
> cannot process it because they have a lower priority than the CPU.

At least for RHEL deployments of KVM-RT, IIC the expectation is that
the VCPUs with RT priority never do I/O, and that there is at least 1
additional non-RT vCPU from which the OS performs I/O. IOW, the RT
VCPU works in a completely self contained manner with no interaction
to any other QEMU threads. If that's not the case, then you would
have to make sure those other threads have priority / schedular
adjustments to avoid priority inversion

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH] util: NUMA aware memory preallocation

2022-05-12 Thread Paolo Bonzini

On 5/11/22 18:54, Daniel P. Berrangé wrote:

On Wed, May 11, 2022 at 01:07:47PM +0200, Paolo Bonzini wrote:

On 5/11/22 12:10, Daniel P. Berrangé wrote:

I expect creating/deleting I/O threads is cheap in comparison to
the work done for preallocation. If libvirt is using -preconfig
and object-add to create the memory backend, then we could have
option of creating the I/O threads dynamically in -preconfig mode,
create the memory backend, and then delete the I/O threads again.


I think this is very overengineered.  Michal's patch is doing the obvious
thing and if it doesn't work that's because Libvirt is trying to micromanage
QEMU.


Calling it micromanaging is putting a very negative connotation on
this. What we're trying todo is enforce a host resource policy for
QEMU, in a way that a compromised QEMU can't escape, which is a
valuable protection.


I'm sorry if that was a bit exaggerated, but the negative connotation 
was intentional.



As mentioned on IRC, if the reason is to prevent moving around threads in
realtime (SCHED_FIFO, SCHED_RR) classes, that should be fixed at the kernel
level.


We use cgroups where it is available to us, but we don't always have
the freedom that we'd like.


I understand.  I'm thinking of a new flag to sched_setscheduler that 
fixes the CPU affinity and policy of the thread and prevents changing it 
in case QEMU is compromised later.  The seccomp/SELinux sandboxes can 
prevent setting the SCHED_FIFO class without this flag.


In addition, my hunch is that this works only because the RT setup of 
QEMU is not safe against priority inversion.  IIRC the iothreads are set 
with a non-realtime priority, but actually they should have a _higher_ 
priority than the CPU threads, and the thread pool I/O bound workers 
should have an even higher priority; otherwise you have a priority 
inversion situation where an interrupt is pending that would wake up the 
CPU, but the iothreads cannot process it because they have a lower 
priority than the CPU.


So the iothread and the associated util/thread-pool.c thread pool are 
the wrong tools to solve Michal's issue; they are not meant for 
background CPU-bound work, even though they _might_ work due to their 
incorrect RT setup.


Paolo



Re: [PATCH] util: NUMA aware memory preallocation

2022-05-11 Thread Daniel P . Berrangé
On Wed, May 11, 2022 at 01:07:47PM +0200, Paolo Bonzini wrote:
> On 5/11/22 12:10, Daniel P. Berrangé wrote:
> > If all we needs is NUMA affinity, not CPU affinity, then it would
> > be sufficient to create 1 I/O thread per host NUMA node that the
> > VM needs to use. The job running in the I/O can spawn further
> > threads and inherit the NUMA affinity.  This might be more clever
> > than it is needed though.
> > 
> > I expect creating/deleting I/O threads is cheap in comparison to
> > the work done for preallocation. If libvirt is using -preconfig
> > and object-add to create the memory backend, then we could have
> > option of creating the I/O threads dynamically in -preconfig mode,
> > create the memory backend, and then delete the I/O threads again.
> 
> I think this is very overengineered.  Michal's patch is doing the obvious
> thing and if it doesn't work that's because Libvirt is trying to micromanage
> QEMU.

Calling it micromanaging is putting a very negative connotation on
this. What we're trying todo is enforce a host resource policy for
QEMU, in a way that a compromised QEMU can't escape, which is a
valuable protection. 

> As mentioned on IRC, if the reason is to prevent moving around threads in
> realtime (SCHED_FIFO, SCHED_RR) classes, that should be fixed at the kernel
> level.

We use cgroups where it is available to us, but we don't always have
the freedom that we'd like. Sometimes the deployment scenario of
libvirt means that we're stuck with whatever cgroup libvirtd is
launched in and sched_setaffinity is our only way to confine QEMU,
so wanting to prevent use of sched_setaffinity is reasonable IMHO.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH] util: NUMA aware memory preallocation

2022-05-11 Thread David Hildenbrand
On 11.05.22 17:08, Daniel P. Berrangé wrote:
> On Wed, May 11, 2022 at 03:16:55PM +0200, Michal Prívozník wrote:
>> On 5/10/22 11:12, Daniel P. Berrangé wrote:
>>> On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote:
 When allocating large amounts of memory the task is offloaded
 onto threads. These threads then use various techniques to
 allocate the memory fully (madvise(), writing into the memory).
 However, these threads are free to run on any CPU, which becomes
 problematic on NUMA machines because it may happen that a thread
 is running on a distant node.

 Ideally, this is something that a management application would
 resolve, but we are not anywhere close to that, Firstly, memory
 allocation happens before monitor socket is even available. But
 okay, that's what -preconfig is for. But then the problem is that
 'object-add' would not return until all memory is preallocated.

 Long story short, management application has no way of learning
 TIDs of allocator threads so it can't make them run NUMA aware.
>>>
>>> So I'm wondering what the impact of this problem is for various
>>> scenarios.
>>
>> The scenario which I tested this with was no  but using
>> 'virsh emulatorpin' afterwards to pin emulator thread somewhere. For
>> those which are unfamiliar with libvirt, this is about placing the main
>> qemu TID (with the main eventloop) into a CGroup that restricts on what
>> CPUs it can run.
>>
>>>
>>> The default config for a KVM guest with libvirt is no CPU pinning
>>> at all. The kernel auto-places CPUs and decides on where RAM is to
>>> be allocated. So in this case, whether or not libvirt can talk to
>>> QMP in time to query threads is largely irrelevant, as we don't
>>> want todo placement in any case.
>>>
>>> In theory the kernel should allocate RAM on the node local to
>>> where the process is currently executing. So as long as the
>>> guest RAM fits in available free RAM on the local node, RAM
>>> should be allocated from the node that matches the CPU running
>>> the QEMU main thread.
>>>
>>> The challenge is if we spawn N more threads to do pre-alloc,
>>> these can be spread onto other nodes. I wonder if the kernel
>>> huas any preference for keeping threads within a process on
>>> the same NUMA node ?
>>
>> That's not exactly what I saw. I would have thought too that initially
>> the prealloc thread could be spawned just anywhere but after few
>> iterations the scheduler realized what NUMA node the thread is close to
>> and automatically schedule it to run there. Well, it didn't happen.
> 
> Thinking about it, this does make sense to some extent. When a
> thread is first spawned, how can the kernel know what region of
> memory it is about to start touching ? So at the very least the
> kernel schedular can get it wrong initially. It would need something
> to watch memory acces patterns to determine whether the initial
> decision was right or wrong, and fine tune it later.
> 
> Seems like the kernel typically tries todo the opposite to what
> we thought, and instead of moving CPUs, has ways to move the
> memory instead.
> 
> https://www.kernel.org/doc/html/latest/vm/page_migration.html
> 
>>> Overall, if libvirt is not applying pinning to the QEMU guest,
>>> then we're 100% reliant on the kernel todo something sensible,
>>> both for normal QEMU execution and for prealloc. Since we're
>>> not doing placement of QEMU RAM or CPUs, the logic in this
>>> patch won't do anything either.
>>>
>>>
>>> If the guest has more RAM than can fit on the local NUMA node,
>>> then we're doomed no matter what, even ignoring prealloc, there
>>> will be cross-node traffic. This scenario requires the admin to
>>> setup proper CPU /memory pinning for QEMU in libvirt.
>>>
>>> If libvirt is doing CPU pinning (as instructed by the mgmt app
>>> above us), then when we first start QEMU, the process thread
>>> leader will get given affinity by libvirt prior to exec. This
>>> affinity will be the union of affinity for all CPUs that will
>>> be later configured.
>>>
>>> The typical case for CPU pinning, is that everything fits in
>>> one NUMA node, and so in this case, we don't need todo anything
>>> more. The prealloc threads will already be constrained to the
>>> right place by the affinity of the QEMU thread leader, so the
>>> logic in this patch will run, but it won't do anything that
>>> was not already done.
>>>
>>> So we're left with the hardest case, where the guest is explicitly
>>> spread across multiple NUMA nodes. In this case the thread leader
>>> affinity will span many NUMA nodes, and so the prealloc threads
>>> will freely be placed across any CPU that is in the union of CPUs
>>> the guest is placed on. Just as with thue non-pinned case, the
>>> prealloc will be at the mercy of the kernel making sensible
>>> placement decisions.
>>
>> Indeed, but it's at least somewhat restricted. NB, in real scenario
>> users will map guest NUMA nodes 

Re: [PATCH] util: NUMA aware memory preallocation

2022-05-11 Thread Daniel P . Berrangé
On Wed, May 11, 2022 at 03:16:55PM +0200, Michal Prívozník wrote:
> On 5/10/22 11:12, Daniel P. Berrangé wrote:
> > On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote:
> >> When allocating large amounts of memory the task is offloaded
> >> onto threads. These threads then use various techniques to
> >> allocate the memory fully (madvise(), writing into the memory).
> >> However, these threads are free to run on any CPU, which becomes
> >> problematic on NUMA machines because it may happen that a thread
> >> is running on a distant node.
> >>
> >> Ideally, this is something that a management application would
> >> resolve, but we are not anywhere close to that, Firstly, memory
> >> allocation happens before monitor socket is even available. But
> >> okay, that's what -preconfig is for. But then the problem is that
> >> 'object-add' would not return until all memory is preallocated.
> >>
> >> Long story short, management application has no way of learning
> >> TIDs of allocator threads so it can't make them run NUMA aware.
> > 
> > So I'm wondering what the impact of this problem is for various
> > scenarios.
> 
> The scenario which I tested this with was no  but using
> 'virsh emulatorpin' afterwards to pin emulator thread somewhere. For
> those which are unfamiliar with libvirt, this is about placing the main
> qemu TID (with the main eventloop) into a CGroup that restricts on what
> CPUs it can run.
> 
> > 
> > The default config for a KVM guest with libvirt is no CPU pinning
> > at all. The kernel auto-places CPUs and decides on where RAM is to
> > be allocated. So in this case, whether or not libvirt can talk to
> > QMP in time to query threads is largely irrelevant, as we don't
> > want todo placement in any case.
> > 
> > In theory the kernel should allocate RAM on the node local to
> > where the process is currently executing. So as long as the
> > guest RAM fits in available free RAM on the local node, RAM
> > should be allocated from the node that matches the CPU running
> > the QEMU main thread.
> > 
> > The challenge is if we spawn N more threads to do pre-alloc,
> > these can be spread onto other nodes. I wonder if the kernel
> > huas any preference for keeping threads within a process on
> > the same NUMA node ?
> 
> That's not exactly what I saw. I would have thought too that initially
> the prealloc thread could be spawned just anywhere but after few
> iterations the scheduler realized what NUMA node the thread is close to
> and automatically schedule it to run there. Well, it didn't happen.

Thinking about it, this does make sense to some extent. When a
thread is first spawned, how can the kernel know what region of
memory it is about to start touching ? So at the very least the
kernel schedular can get it wrong initially. It would need something
to watch memory acces patterns to determine whether the initial
decision was right or wrong, and fine tune it later.

Seems like the kernel typically tries todo the opposite to what
we thought, and instead of moving CPUs, has ways to move the
memory instead.

https://www.kernel.org/doc/html/latest/vm/page_migration.html

> > Overall, if libvirt is not applying pinning to the QEMU guest,
> > then we're 100% reliant on the kernel todo something sensible,
> > both for normal QEMU execution and for prealloc. Since we're
> > not doing placement of QEMU RAM or CPUs, the logic in this
> > patch won't do anything either.
> > 
> > 
> > If the guest has more RAM than can fit on the local NUMA node,
> > then we're doomed no matter what, even ignoring prealloc, there
> > will be cross-node traffic. This scenario requires the admin to
> > setup proper CPU /memory pinning for QEMU in libvirt.
> > 
> > If libvirt is doing CPU pinning (as instructed by the mgmt app
> > above us), then when we first start QEMU, the process thread
> > leader will get given affinity by libvirt prior to exec. This
> > affinity will be the union of affinity for all CPUs that will
> > be later configured.
> > 
> > The typical case for CPU pinning, is that everything fits in
> > one NUMA node, and so in this case, we don't need todo anything
> > more. The prealloc threads will already be constrained to the
> > right place by the affinity of the QEMU thread leader, so the
> > logic in this patch will run, but it won't do anything that
> > was not already done.
> > 
> > So we're left with the hardest case, where the guest is explicitly
> > spread across multiple NUMA nodes. In this case the thread leader
> > affinity will span many NUMA nodes, and so the prealloc threads
> > will freely be placed across any CPU that is in the union of CPUs
> > the guest is placed on. Just as with thue non-pinned case, the
> > prealloc will be at the mercy of the kernel making sensible
> > placement decisions.
> 
> Indeed, but it's at least somewhat restricted. NB, in real scenario
> users will map guest NUMA nodes onto host ones with 1:1 relationship.
> And each guest NUMA node will have its 

Re: [PATCH] util: NUMA aware memory preallocation

2022-05-11 Thread David Hildenbrand
>>
>> The very last cases is the only one where this patch can potentially
>> be beneficial. The problem is that because libvirt is in charge of
>> enforcing CPU affinity, IIRC, we explicitly block QEMU from doing
>> anything with CPU affinity. So AFAICT, this patch should result in
>> an error from sched_setaffinity when run under libvirt.
> 
> Yes, I had to disable capability dropping in qemu.conf.
> 
> After all, I think maybe the right place to fix this is kernel? I mean,
> why don't prealloc threads converge to the nodes they are working with?

Good question, I think the whole machinery doesn't support hugetlb,
including recording remote faults and migrating the thread to a
different node.

-- 
Thanks,

David / dhildenb




Re: [PATCH] util: NUMA aware memory preallocation

2022-05-11 Thread Michal Prívozník
On 5/10/22 11:12, Daniel P. Berrangé wrote:
> On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote:
>> When allocating large amounts of memory the task is offloaded
>> onto threads. These threads then use various techniques to
>> allocate the memory fully (madvise(), writing into the memory).
>> However, these threads are free to run on any CPU, which becomes
>> problematic on NUMA machines because it may happen that a thread
>> is running on a distant node.
>>
>> Ideally, this is something that a management application would
>> resolve, but we are not anywhere close to that, Firstly, memory
>> allocation happens before monitor socket is even available. But
>> okay, that's what -preconfig is for. But then the problem is that
>> 'object-add' would not return until all memory is preallocated.
>>
>> Long story short, management application has no way of learning
>> TIDs of allocator threads so it can't make them run NUMA aware.
> 
> So I'm wondering what the impact of this problem is for various
> scenarios.

The scenario which I tested this with was no  but using
'virsh emulatorpin' afterwards to pin emulator thread somewhere. For
those which are unfamiliar with libvirt, this is about placing the main
qemu TID (with the main eventloop) into a CGroup that restricts on what
CPUs it can run.

> 
> The default config for a KVM guest with libvirt is no CPU pinning
> at all. The kernel auto-places CPUs and decides on where RAM is to
> be allocated. So in this case, whether or not libvirt can talk to
> QMP in time to query threads is largely irrelevant, as we don't
> want todo placement in any case.
> 
> In theory the kernel should allocate RAM on the node local to
> where the process is currently executing. So as long as the
> guest RAM fits in available free RAM on the local node, RAM
> should be allocated from the node that matches the CPU running
> the QEMU main thread.
> 
> The challenge is if we spawn N more threads to do pre-alloc,
> these can be spread onto other nodes. I wonder if the kernel
> huas any preference for keeping threads within a process on
> the same NUMA node ?

That's not exactly what I saw. I would have thought too that initially
the prealloc thread could be spawned just anywhere but after few
iterations the scheduler realized what NUMA node the thread is close to
and automatically schedule it to run there. Well, it didn't happen.

> 
> Overall, if libvirt is not applying pinning to the QEMU guest,
> then we're 100% reliant on the kernel todo something sensible,
> both for normal QEMU execution and for prealloc. Since we're
> not doing placement of QEMU RAM or CPUs, the logic in this
> patch won't do anything either.
> 
> 
> If the guest has more RAM than can fit on the local NUMA node,
> then we're doomed no matter what, even ignoring prealloc, there
> will be cross-node traffic. This scenario requires the admin to
> setup proper CPU /memory pinning for QEMU in libvirt.
> 
> If libvirt is doing CPU pinning (as instructed by the mgmt app
> above us), then when we first start QEMU, the process thread
> leader will get given affinity by libvirt prior to exec. This
> affinity will be the union of affinity for all CPUs that will
> be later configured.
> 
> The typical case for CPU pinning, is that everything fits in
> one NUMA node, and so in this case, we don't need todo anything
> more. The prealloc threads will already be constrained to the
> right place by the affinity of the QEMU thread leader, so the
> logic in this patch will run, but it won't do anything that
> was not already done.
> 
> So we're left with the hardest case, where the guest is explicitly
> spread across multiple NUMA nodes. In this case the thread leader
> affinity will span many NUMA nodes, and so the prealloc threads
> will freely be placed across any CPU that is in the union of CPUs
> the guest is placed on. Just as with thue non-pinned case, the
> prealloc will be at the mercy of the kernel making sensible
> placement decisions.

Indeed, but it's at least somewhat restricted. NB, in real scenario
users will map guest NUMA nodes onto host ones with 1:1 relationship.
And each guest NUMA node will have its own memdev=, i.e. its own set of
threads, so in the end, prealloc threads won't jump between host NUMA
nodes but stay local to the node they are allocating memory on.

> 
> The very last cases is the only one where this patch can potentially
> be beneficial. The problem is that because libvirt is in charge of
> enforcing CPU affinity, IIRC, we explicitly block QEMU from doing
> anything with CPU affinity. So AFAICT, this patch should result in
> an error from sched_setaffinity when run under libvirt.

Yes, I had to disable capability dropping in qemu.conf.

After all, I think maybe the right place to fix this is kernel? I mean,
why don't prealloc threads converge to the nodes they are working with?

Michal




Re: [PATCH] util: NUMA aware memory preallocation

2022-05-11 Thread Paolo Bonzini

On 5/11/22 12:10, Daniel P. Berrangé wrote:

If all we needs is NUMA affinity, not CPU affinity, then it would
be sufficient to create 1 I/O thread per host NUMA node that the
VM needs to use. The job running in the I/O can spawn further
threads and inherit the NUMA affinity.  This might be more clever
than it is needed though.

I expect creating/deleting I/O threads is cheap in comparison to
the work done for preallocation. If libvirt is using -preconfig
and object-add to create the memory backend, then we could have
option of creating the I/O threads dynamically in -preconfig mode,
create the memory backend, and then delete the I/O threads again.


I think this is very overengineered.  Michal's patch is doing the 
obvious thing and if it doesn't work that's because Libvirt is trying to 
micromanage QEMU.


As mentioned on IRC, if the reason is to prevent moving around threads 
in realtime (SCHED_FIFO, SCHED_RR) classes, that should be fixed at the 
kernel level.


Paolo



Re: [PATCH] util: NUMA aware memory preallocation

2022-05-11 Thread Daniel P . Berrangé
On Wed, May 11, 2022 at 12:03:24PM +0200, David Hildenbrand wrote:
> On 11.05.22 11:34, Daniel P. Berrangé wrote:
> > On Wed, May 11, 2022 at 11:31:23AM +0200, David Hildenbrand wrote:
>  Long story short, management application has no way of learning
>  TIDs of allocator threads so it can't make them run NUMA aware.
> >>>
> >>> This feels like the key issue. The preallocation threads are
> >>> invisible to libvirt, regardless of whether we're doing coldplug
> >>> or hotplug of memory-backends. Indeed the threads are invisible
> >>> to all of QEMU, except the memory backend code.
> >>>
> >>> Conceptually we need 1 or more explicit worker threads, that we
> >>> can assign CPU affinity to, and then QEMU can place jobs on them.
> >>> I/O threads serve this role, but limited to blockdev work. We
> >>> need a generalization of I/O threads, for arbitrary jobs that
> >>> QEMU might want to farm out to specific numa nodes.
> >>
> >> At least the "-object iothread" thingy can already be used for actions
> >> outside of blockdev. virtio-balloon uses one for free page hinting.
> > 
> > Ah that's good to know, so my idea probably isn't so much work as
> > I thought it might be.
> 
> I guess we'd have to create a bunch of iothreads on the command line and
> then feed them as an array to the memory backend we want to create. We
> could then forward the threads to a new variant of os_mem_prealloc().
> 
> We could
> 
> a) Allocate new iothreads for each memory backend we create. Hm, that
> might be suboptimal, we could end up with many iothreads.
> 
> b) Reuse iothreads and have separate sets of iothreads per NUMA node.
> Assign them to a node once.
> 
> c) Reuse iothreads and reassign them to NUMA nodes on demand.

If all we needs is NUMA affinity, not CPU affinity, then it would
be sufficient to create 1 I/O thread per host NUMA node that the
VM needs to use. The job running in the I/O can spawn further
threads and inherit the NUMA affinity.  This might be more clever
than it is needed though.

I expect creating/deleting I/O threads is cheap in comparison to
the work done for preallocation. If libvirt is using -preconfig
and object-add to create the memory backend, then we could have
option of creating the I/O threads dynamically in -preconfig mode,
create the memory backend, and then delete the I/O threads again.

> However, I'm not sure what the semantics are when having multiple
> backends referencing the iothreads ...

Yep, we don't especially need an "ownership" relationship for what
we want todo with preallocatino, specially because it is a one
off point-in-time usage, not continuous usage as with block devices

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH] util: NUMA aware memory preallocation

2022-05-11 Thread David Hildenbrand
On 11.05.22 11:34, Daniel P. Berrangé wrote:
> On Wed, May 11, 2022 at 11:31:23AM +0200, David Hildenbrand wrote:
 Long story short, management application has no way of learning
 TIDs of allocator threads so it can't make them run NUMA aware.
>>>
>>> This feels like the key issue. The preallocation threads are
>>> invisible to libvirt, regardless of whether we're doing coldplug
>>> or hotplug of memory-backends. Indeed the threads are invisible
>>> to all of QEMU, except the memory backend code.
>>>
>>> Conceptually we need 1 or more explicit worker threads, that we
>>> can assign CPU affinity to, and then QEMU can place jobs on them.
>>> I/O threads serve this role, but limited to blockdev work. We
>>> need a generalization of I/O threads, for arbitrary jobs that
>>> QEMU might want to farm out to specific numa nodes.
>>
>> At least the "-object iothread" thingy can already be used for actions
>> outside of blockdev. virtio-balloon uses one for free page hinting.
> 
> Ah that's good to know, so my idea probably isn't so much work as
> I thought it might be.

I guess we'd have to create a bunch of iothreads on the command line and
then feed them as an array to the memory backend we want to create. We
could then forward the threads to a new variant of os_mem_prealloc().

We could

a) Allocate new iothreads for each memory backend we create. Hm, that
might be suboptimal, we could end up with many iothreads.

b) Reuse iothreads and have separate sets of iothreads per NUMA node.
Assign them to a node once.

c) Reuse iothreads and reassign them to NUMA nodes on demand.

However, I'm not sure what the semantics are when having multiple
backends referencing the iothreads ...


Regarding c) and alternative  I raised is that we could simply
preallocate the threads in QEMU internally on machine creation, and let
libvirt reassign them to the fitting NUMA node whenever
coldplugging/hotplugging a new memory backend.

-- 
Thanks,

David / dhildenb




Re: [PATCH] util: NUMA aware memory preallocation

2022-05-11 Thread Daniel P . Berrangé
On Wed, May 11, 2022 at 11:31:23AM +0200, David Hildenbrand wrote:
> >> Long story short, management application has no way of learning
> >> TIDs of allocator threads so it can't make them run NUMA aware.
> > 
> > This feels like the key issue. The preallocation threads are
> > invisible to libvirt, regardless of whether we're doing coldplug
> > or hotplug of memory-backends. Indeed the threads are invisible
> > to all of QEMU, except the memory backend code.
> > 
> > Conceptually we need 1 or more explicit worker threads, that we
> > can assign CPU affinity to, and then QEMU can place jobs on them.
> > I/O threads serve this role, but limited to blockdev work. We
> > need a generalization of I/O threads, for arbitrary jobs that
> > QEMU might want to farm out to specific numa nodes.
> 
> At least the "-object iothread" thingy can already be used for actions
> outside of blockdev. virtio-balloon uses one for free page hinting.

Ah that's good to know, so my idea probably isn't so much work as
I thought it might be.


With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH] util: NUMA aware memory preallocation

2022-05-11 Thread David Hildenbrand
>> Long story short, management application has no way of learning
>> TIDs of allocator threads so it can't make them run NUMA aware.
> 
> This feels like the key issue. The preallocation threads are
> invisible to libvirt, regardless of whether we're doing coldplug
> or hotplug of memory-backends. Indeed the threads are invisible
> to all of QEMU, except the memory backend code.
> 
> Conceptually we need 1 or more explicit worker threads, that we
> can assign CPU affinity to, and then QEMU can place jobs on them.
> I/O threads serve this role, but limited to blockdev work. We
> need a generalization of I/O threads, for arbitrary jobs that
> QEMU might want to farm out to specific numa nodes.

At least the "-object iothread" thingy can already be used for actions
outside of blockdev. virtio-balloon uses one for free page hinting.

-- 
Thanks,

David / dhildenb




Re: [PATCH] util: NUMA aware memory preallocation

2022-05-11 Thread Daniel P . Berrangé
On Wed, May 11, 2022 at 09:34:07AM +0100, Dr. David Alan Gilbert wrote:
> * Michal Privoznik (mpriv...@redhat.com) wrote:
> > When allocating large amounts of memory the task is offloaded
> > onto threads. These threads then use various techniques to
> > allocate the memory fully (madvise(), writing into the memory).
> > However, these threads are free to run on any CPU, which becomes
> > problematic on NUMA machines because it may happen that a thread
> > is running on a distant node.
> > 
> > Ideally, this is something that a management application would
> > resolve, but we are not anywhere close to that, Firstly, memory
> > allocation happens before monitor socket is even available. But
> > okay, that's what -preconfig is for. But then the problem is that
> > 'object-add' would not return until all memory is preallocated.
> > 
> > Long story short, management application has no way of learning
> > TIDs of allocator threads so it can't make them run NUMA aware.
> > 
> > But what we can do is to propagate the 'host-nodes' attribute of
> > MemoryBackend object down to where preallocation threads are
> > created and set their affinity according to the attribute.
> 
> Joe (cc'd) sent me some numbers for this which emphasise how useful it
> is:
>  | On systems with 4 physical numa nodes and 2-6 Tb of memory, this numa-aware
>  |preallocation provided about a 25% speedup in touching the pages.
>  |The speedup gets larger as the numa node count and memory sizes grow.
> 
>  | In a simple parallel 1Gb page-zeroing test on a very large system (32-numa
>  | nodes and 47Tb of memory), the numa-aware preallocation was 2.3X faster
>  | than letting the threads float wherever.
>  | We're working with someone whose large guest normally takes 4.5 hours to
>  | boot.  With Michal P's initial patch to parallelize the preallocation, that
>  | time dropped to about 1 hour.  Including this numa-aware preallocation
>  | would reduce the guest boot time to less than 1/2 hour.
> 
> so chopping *half an hour* off the startup time seems a worthy
> optimisation (even if most of us aren't fortunate enough to have 47T of
> ram).

I presume this test was done with bare QEMU though, not libvirt managed
QEMU, as IIUC, the latter would not be able to set its affinity and so
never see this benefit.


With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH] util: NUMA aware memory preallocation

2022-05-11 Thread Daniel P . Berrangé
On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote:
> When allocating large amounts of memory the task is offloaded
> onto threads. These threads then use various techniques to
> allocate the memory fully (madvise(), writing into the memory).
> However, these threads are free to run on any CPU, which becomes
> problematic on NUMA machines because it may happen that a thread
> is running on a distant node.
> 
> Ideally, this is something that a management application would
> resolve, but we are not anywhere close to that, Firstly, memory
> allocation happens before monitor socket is even available. But
> okay, that's what -preconfig is for. But then the problem is that
> 'object-add' would not return until all memory is preallocated.

Is the delay to 'object-add' actually a problem ?

Currently we're cold plugging the memory backends, so prealloc
happens before QMP is available. So we have a delay immediately
at startup. Switching to -preconfig plus 'object-add'  would
not be making the delay worse, merely moving it ever so slightly
later.

With the POV of an application using libvirt, this is the same.
virDomainCreate takes 1 hour, regardless of whether the 1 hour
allocatinon delay is before QMP or in -preconfig object-add
execution.

> Long story short, management application has no way of learning
> TIDs of allocator threads so it can't make them run NUMA aware.

This feels like the key issue. The preallocation threads are
invisible to libvirt, regardless of whether we're doing coldplug
or hotplug of memory-backends. Indeed the threads are invisible
to all of QEMU, except the memory backend code.

Conceptually we need 1 or more explicit worker threads, that we
can assign CPU affinity to, and then QEMU can place jobs on them.
I/O threads serve this role, but limited to blockdev work. We
need a generalization of I/O threads, for arbitrary jobs that
QEMU might want to farm out to specific numa nodes.

In a guest spanning multiple host NUMA nodes, libvirt would
have to configure 1 or more worker threads for QEMU, learn
their TIDs,then add the memory backends in -preconfig, which
would farm our preallocation to the worker threads, with
job placement matching the worker's affinity.


With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH] util: NUMA aware memory preallocation

2022-05-11 Thread Dr. David Alan Gilbert
* Michal Privoznik (mpriv...@redhat.com) wrote:
> When allocating large amounts of memory the task is offloaded
> onto threads. These threads then use various techniques to
> allocate the memory fully (madvise(), writing into the memory).
> However, these threads are free to run on any CPU, which becomes
> problematic on NUMA machines because it may happen that a thread
> is running on a distant node.
> 
> Ideally, this is something that a management application would
> resolve, but we are not anywhere close to that, Firstly, memory
> allocation happens before monitor socket is even available. But
> okay, that's what -preconfig is for. But then the problem is that
> 'object-add' would not return until all memory is preallocated.
> 
> Long story short, management application has no way of learning
> TIDs of allocator threads so it can't make them run NUMA aware.
> 
> But what we can do is to propagate the 'host-nodes' attribute of
> MemoryBackend object down to where preallocation threads are
> created and set their affinity according to the attribute.

Joe (cc'd) sent me some numbers for this which emphasise how useful it
is:
 | On systems with 4 physical numa nodes and 2-6 Tb of memory, this numa-aware
 |preallocation provided about a 25% speedup in touching the pages.
 |The speedup gets larger as the numa node count and memory sizes grow.

 | In a simple parallel 1Gb page-zeroing test on a very large system (32-numa
 | nodes and 47Tb of memory), the numa-aware preallocation was 2.3X faster
 | than letting the threads float wherever.
 | We're working with someone whose large guest normally takes 4.5 hours to
 | boot.  With Michal P's initial patch to parallelize the preallocation, that
 | time dropped to about 1 hour.  Including this numa-aware preallocation
 | would reduce the guest boot time to less than 1/2 hour.

so chopping *half an hour* off the startup time seems a worthy
optimisation (even if most of us aren't fortunate enough to have 47T of
ram).

Dave

> Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2074000
> Signed-off-by: Michal Privoznik 
> ---
>  backends/hostmem.c |  6 ++--
>  hw/virtio/virtio-mem.c |  2 +-
>  include/qemu/osdep.h   |  2 ++
>  util/meson.build   |  2 +-
>  util/oslib-posix.c | 74 --
>  util/oslib-win32.c |  2 ++
>  6 files changed, 82 insertions(+), 6 deletions(-)
> 
> diff --git a/backends/hostmem.c b/backends/hostmem.c
> index a7bae3d713..7373472c7e 100644
> --- a/backends/hostmem.c
> +++ b/backends/hostmem.c
> @@ -232,7 +232,8 @@ static void host_memory_backend_set_prealloc(Object *obj, 
> bool value,
>  void *ptr = memory_region_get_ram_ptr(>mr);
>  uint64_t sz = memory_region_size(>mr);
>  
> -os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads, _err);
> +os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads,
> +backend->host_nodes, MAX_NODES, _err);
>  if (local_err) {
>  error_propagate(errp, local_err);
>  return;
> @@ -394,7 +395,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, 
> Error **errp)
>   */
>  if (backend->prealloc) {
>  os_mem_prealloc(memory_region_get_fd(>mr), ptr, sz,
> -backend->prealloc_threads, _err);
> +backend->prealloc_threads, backend->host_nodes,
> +MAX_NODES, _err);
>  if (local_err) {
>  goto out;
>  }
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 5aca408726..48b104cdf6 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -467,7 +467,7 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, 
> uint64_t start_gpa,
>  int fd = memory_region_get_fd(>memdev->mr);
>  Error *local_err = NULL;
>  
> -os_mem_prealloc(fd, area, size, 1, _err);
> +os_mem_prealloc(fd, area, size, 1, NULL, 0, _err);
>  if (local_err) {
>  static bool warned;
>  
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 1c1e7eca98..474cbf3b86 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -577,6 +577,8 @@ unsigned long qemu_getauxval(unsigned long type);
>  void qemu_set_tty_echo(int fd, bool echo);
>  
>  void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus,
> + const unsigned long *host_nodes,
> + unsigned long max_node,
>   Error **errp);
>  
>  /**
> diff --git a/util/meson.build b/util/meson.build
> index 8f16018cd4..393ff74570 100644
> --- a/util/meson.build
> +++ b/util/meson.build
> @@ -15,7 +15,7 @@ freebsd_dep = []
>  if targetos == 'freebsd'
>freebsd_dep = util
>  endif
> -util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), 
> freebsd_dep])
> +util_ss.add(when: 'CONFIG_POSIX', if_true: 

Re: [PATCH] util: NUMA aware memory preallocation

2022-05-10 Thread Dr. David Alan Gilbert
* Daniel P. Berrangé (berra...@redhat.com) wrote:
> On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote:
> > When allocating large amounts of memory the task is offloaded
> > onto threads. These threads then use various techniques to
> > allocate the memory fully (madvise(), writing into the memory).
> > However, these threads are free to run on any CPU, which becomes
> > problematic on NUMA machines because it may happen that a thread
> > is running on a distant node.
> > 
> > Ideally, this is something that a management application would
> > resolve, but we are not anywhere close to that, Firstly, memory
> > allocation happens before monitor socket is even available. But
> > okay, that's what -preconfig is for. But then the problem is that
> > 'object-add' would not return until all memory is preallocated.
> > 
> > Long story short, management application has no way of learning
> > TIDs of allocator threads so it can't make them run NUMA aware.
> 
> So I'm wondering what the impact of this problem is for various
> scenarios.
> 
> The default config for a KVM guest with libvirt is no CPU pinning
> at all. The kernel auto-places CPUs and decides on where RAM is to
> be allocated. So in this case, whether or not libvirt can talk to
> QMP in time to query threads is largely irrelevant, as we don't
> want todo placement in any case.
> 
> In theory the kernel should allocate RAM on the node local to
> where the process is currently executing. So as long as the
> guest RAM fits in available free RAM on the local node, RAM
> should be allocated from the node that matches the CPU running
> the QEMU main thread.
> 
> The challenge is if we spawn N more threads to do pre-alloc,
> these can be spread onto other nodes. I wonder if the kernel
> huas any preference for keeping threads within a process on
> the same NUMA node ?
> 
> Overall, if libvirt is not applying pinning to the QEMU guest,
> then we're 100% reliant on the kernel todo something sensible,
> both for normal QEMU execution and for prealloc. Since we're
> not doing placement of QEMU RAM or CPUs, the logic in this
> patch won't do anything either.
> 
> 
> If the guest has more RAM than can fit on the local NUMA node,
> then we're doomed no matter what, even ignoring prealloc, there
> will be cross-node traffic. This scenario requires the admin to
> setup proper CPU /memory pinning for QEMU in libvirt.
> 
> If libvirt is doing CPU pinning (as instructed by the mgmt app
> above us), then when we first start QEMU, the process thread
> leader will get given affinity by libvirt prior to exec. This
> affinity will be the union of affinity for all CPUs that will
> be later configured.
> 
> The typical case for CPU pinning, is that everything fits in
> one NUMA node, and so in this case, we don't need todo anything
> more. The prealloc threads will already be constrained to the
> right place by the affinity of the QEMU thread leader, so the
> logic in this patch will run, but it won't do anything that
> was not already done.
> 
> So we're left with the hardest case, where the guest is explicitly
> spread across multiple NUMA nodes. In this case the thread leader
> affinity will span many NUMA nodes, and so the prealloc threads
> will freely be placed across any CPU that is in the union of CPUs
> the guest is placed on. Just as with thue non-pinned case, the
> prealloc will be at the mercy of the kernel making sensible
> placement decisions.
> 
> The very last cases is the only one where this patch can potentially
> be beneficial. The problem is that because libvirt is in charge of
> enforcing CPU affinity, IIRC, we explicitly block QEMU from doing
> anything with CPU affinity. So AFAICT, this patch should result in
> an error from sched_setaffinity when run under libvirt.

This problem crops up in a few places though; potentially the same trick
could be done with multifd threads in migration, if we can make some
multifd threads handle some specific NUMA nodes rather than shotgunning
them across the threads.
[Bonus: Make multifd route different NUMA nodes down NICs that are
node local]

Dave

> > But what we can do is to propagate the 'host-nodes' attribute of
> > MemoryBackend object down to where preallocation threads are
> > created and set their affinity according to the attribute.
> > 
> > Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2074000
> > Signed-off-by: Michal Privoznik 
> > ---
> >  backends/hostmem.c |  6 ++--
> >  hw/virtio/virtio-mem.c |  2 +-
> >  include/qemu/osdep.h   |  2 ++
> >  util/meson.build   |  2 +-
> >  util/oslib-posix.c | 74 --
> >  util/oslib-win32.c |  2 ++
> >  6 files changed, 82 insertions(+), 6 deletions(-)
> > 
> > diff --git a/backends/hostmem.c b/backends/hostmem.c
> > index a7bae3d713..7373472c7e 100644
> > --- a/backends/hostmem.c
> > +++ b/backends/hostmem.c
> > @@ -232,7 +232,8 @@ static void host_memory_backend_set_prealloc(Object 
> > *obj, 

Re: [PATCH] util: NUMA aware memory preallocation

2022-05-10 Thread Daniel P . Berrangé
On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote:
> When allocating large amounts of memory the task is offloaded
> onto threads. These threads then use various techniques to
> allocate the memory fully (madvise(), writing into the memory).
> However, these threads are free to run on any CPU, which becomes
> problematic on NUMA machines because it may happen that a thread
> is running on a distant node.
> 
> Ideally, this is something that a management application would
> resolve, but we are not anywhere close to that, Firstly, memory
> allocation happens before monitor socket is even available. But
> okay, that's what -preconfig is for. But then the problem is that
> 'object-add' would not return until all memory is preallocated.
> 
> Long story short, management application has no way of learning
> TIDs of allocator threads so it can't make them run NUMA aware.

So I'm wondering what the impact of this problem is for various
scenarios.

The default config for a KVM guest with libvirt is no CPU pinning
at all. The kernel auto-places CPUs and decides on where RAM is to
be allocated. So in this case, whether or not libvirt can talk to
QMP in time to query threads is largely irrelevant, as we don't
want todo placement in any case.

In theory the kernel should allocate RAM on the node local to
where the process is currently executing. So as long as the
guest RAM fits in available free RAM on the local node, RAM
should be allocated from the node that matches the CPU running
the QEMU main thread.

The challenge is if we spawn N more threads to do pre-alloc,
these can be spread onto other nodes. I wonder if the kernel
huas any preference for keeping threads within a process on
the same NUMA node ?

Overall, if libvirt is not applying pinning to the QEMU guest,
then we're 100% reliant on the kernel todo something sensible,
both for normal QEMU execution and for prealloc. Since we're
not doing placement of QEMU RAM or CPUs, the logic in this
patch won't do anything either.


If the guest has more RAM than can fit on the local NUMA node,
then we're doomed no matter what, even ignoring prealloc, there
will be cross-node traffic. This scenario requires the admin to
setup proper CPU /memory pinning for QEMU in libvirt.

If libvirt is doing CPU pinning (as instructed by the mgmt app
above us), then when we first start QEMU, the process thread
leader will get given affinity by libvirt prior to exec. This
affinity will be the union of affinity for all CPUs that will
be later configured.

The typical case for CPU pinning, is that everything fits in
one NUMA node, and so in this case, we don't need todo anything
more. The prealloc threads will already be constrained to the
right place by the affinity of the QEMU thread leader, so the
logic in this patch will run, but it won't do anything that
was not already done.

So we're left with the hardest case, where the guest is explicitly
spread across multiple NUMA nodes. In this case the thread leader
affinity will span many NUMA nodes, and so the prealloc threads
will freely be placed across any CPU that is in the union of CPUs
the guest is placed on. Just as with thue non-pinned case, the
prealloc will be at the mercy of the kernel making sensible
placement decisions.

The very last cases is the only one where this patch can potentially
be beneficial. The problem is that because libvirt is in charge of
enforcing CPU affinity, IIRC, we explicitly block QEMU from doing
anything with CPU affinity. So AFAICT, this patch should result in
an error from sched_setaffinity when run under libvirt.

> But what we can do is to propagate the 'host-nodes' attribute of
> MemoryBackend object down to where preallocation threads are
> created and set their affinity according to the attribute.
> 
> Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2074000
> Signed-off-by: Michal Privoznik 
> ---
>  backends/hostmem.c |  6 ++--
>  hw/virtio/virtio-mem.c |  2 +-
>  include/qemu/osdep.h   |  2 ++
>  util/meson.build   |  2 +-
>  util/oslib-posix.c | 74 --
>  util/oslib-win32.c |  2 ++
>  6 files changed, 82 insertions(+), 6 deletions(-)
> 
> diff --git a/backends/hostmem.c b/backends/hostmem.c
> index a7bae3d713..7373472c7e 100644
> --- a/backends/hostmem.c
> +++ b/backends/hostmem.c
> @@ -232,7 +232,8 @@ static void host_memory_backend_set_prealloc(Object *obj, 
> bool value,
>  void *ptr = memory_region_get_ram_ptr(>mr);
>  uint64_t sz = memory_region_size(>mr);
>  
> -os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads, _err);
> +os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads,
> +backend->host_nodes, MAX_NODES, _err);
>  if (local_err) {
>  error_propagate(errp, local_err);
>  return;
> @@ -394,7 +395,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, 
> Error **errp)
>   */
>  if 

[PATCH] util: NUMA aware memory preallocation

2022-05-10 Thread Michal Privoznik
When allocating large amounts of memory the task is offloaded
onto threads. These threads then use various techniques to
allocate the memory fully (madvise(), writing into the memory).
However, these threads are free to run on any CPU, which becomes
problematic on NUMA machines because it may happen that a thread
is running on a distant node.

Ideally, this is something that a management application would
resolve, but we are not anywhere close to that, Firstly, memory
allocation happens before monitor socket is even available. But
okay, that's what -preconfig is for. But then the problem is that
'object-add' would not return until all memory is preallocated.

Long story short, management application has no way of learning
TIDs of allocator threads so it can't make them run NUMA aware.

But what we can do is to propagate the 'host-nodes' attribute of
MemoryBackend object down to where preallocation threads are
created and set their affinity according to the attribute.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2074000
Signed-off-by: Michal Privoznik 
---
 backends/hostmem.c |  6 ++--
 hw/virtio/virtio-mem.c |  2 +-
 include/qemu/osdep.h   |  2 ++
 util/meson.build   |  2 +-
 util/oslib-posix.c | 74 --
 util/oslib-win32.c |  2 ++
 6 files changed, 82 insertions(+), 6 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index a7bae3d713..7373472c7e 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -232,7 +232,8 @@ static void host_memory_backend_set_prealloc(Object *obj, 
bool value,
 void *ptr = memory_region_get_ram_ptr(>mr);
 uint64_t sz = memory_region_size(>mr);
 
-os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads, _err);
+os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads,
+backend->host_nodes, MAX_NODES, _err);
 if (local_err) {
 error_propagate(errp, local_err);
 return;
@@ -394,7 +395,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, 
Error **errp)
  */
 if (backend->prealloc) {
 os_mem_prealloc(memory_region_get_fd(>mr), ptr, sz,
-backend->prealloc_threads, _err);
+backend->prealloc_threads, backend->host_nodes,
+MAX_NODES, _err);
 if (local_err) {
 goto out;
 }
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 5aca408726..48b104cdf6 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -467,7 +467,7 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, 
uint64_t start_gpa,
 int fd = memory_region_get_fd(>memdev->mr);
 Error *local_err = NULL;
 
-os_mem_prealloc(fd, area, size, 1, _err);
+os_mem_prealloc(fd, area, size, 1, NULL, 0, _err);
 if (local_err) {
 static bool warned;
 
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 1c1e7eca98..474cbf3b86 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -577,6 +577,8 @@ unsigned long qemu_getauxval(unsigned long type);
 void qemu_set_tty_echo(int fd, bool echo);
 
 void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus,
+ const unsigned long *host_nodes,
+ unsigned long max_node,
  Error **errp);
 
 /**
diff --git a/util/meson.build b/util/meson.build
index 8f16018cd4..393ff74570 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -15,7 +15,7 @@ freebsd_dep = []
 if targetos == 'freebsd'
   freebsd_dep = util
 endif
-util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), 
freebsd_dep])
+util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), 
freebsd_dep, numa])
 util_ss.add(when: 'CONFIG_POSIX', if_true: files('qemu-thread-posix.c'))
 util_ss.add(when: 'CONFIG_POSIX', if_true: files('memfd.c'))
 util_ss.add(when: 'CONFIG_WIN32', if_true: files('aio-win32.c'))
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 477990f39b..1572b9b178 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -73,6 +73,10 @@
 #include "qemu/error-report.h"
 #endif
 
+#ifdef CONFIG_NUMA
+#include 
+#endif
+
 #define MAX_MEM_PREALLOC_THREAD_COUNT 16
 
 struct MemsetThread;
@@ -82,6 +86,9 @@ typedef struct MemsetContext {
 bool any_thread_failed;
 struct MemsetThread *threads;
 int num_threads;
+#ifdef CONFIG_NUMA
+struct bitmask *nodemask;
+#endif
 } MemsetContext;
 
 struct MemsetThread {
@@ -420,6 +427,12 @@ static void *do_touch_pages(void *arg)
 }
 qemu_mutex_unlock(_mutex);
 
+#ifdef CONFIG_NUMA
+if (memset_args->context->nodemask) {
+numa_run_on_node_mask(memset_args->context->nodemask);
+}
+#endif
+
 /* unblock SIGBUS */
 sigemptyset();
 sigaddset(, SIGBUS);
@@ -463,6 +476,12 @@ static void *do_madv_populate_write_pages(void