Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-25 Thread Nitesh Narayan Lal
On 2/22/19 7:02 PM, Alexander Duyck wrote:
> On Mon, Feb 4, 2019 at 1:47 PM Nitesh Narayan Lal  wrote:
>> The following patch-set proposes an efficient mechanism for handing freed 
>> memory between the guest and the host. It enables the guests with no page 
>> cache to rapidly free and reclaims memory to and from the host respectively.
>>
>> Benefit:
>> With this patch-series, in our test-case, executed on a single system and 
>> single NUMA node with 15GB memory, we were able to successfully launch 
>> atleast 5 guests
>> when page hinting was enabled and 3 without it. (Detailed explanation of the 
>> test procedure is provided at the bottom).
>>
>> Changelog in V8:
>> In this patch-series, the earlier approach [1] which was used to capture and 
>> scan the pages freed by the guest has been changed. The new approach is 
>> briefly described below:
>>
>> The patch-set still leverages the existing arch_free_page() to add this 
>> functionality. It maintains a per CPU array which is used to store the pages 
>> freed by the guest. The maximum number of entries which it can hold is 
>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it 
>> is scanned and only the pages which are available in the buddy are stored. 
>> This process continues until the array is filled with pages which are part 
>> of the buddy free list. After which it wakes up a kernel per-cpu-thread.
>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation 
>> and if the page is not reallocated and present in the buddy, the kernel 
>> thread attempts to isolate it from the buddy. If it is successfully 
>> isolated, the page is added to another per-cpu array. Once the entire 
>> scanning process is complete, all the isolated pages are reported to the 
>> host through an existing virtio-balloon driver.
>>
>> Known Issues:
>> * Fixed array size: The problem with having a fixed/hardcoded array 
>> size arises when the size of the guest varies. For example when the guest 
>> size increases and it starts making large allocations fixed size limits this 
>> solution's ability to capture all the freed pages. This will result in less 
>> guest free memory getting reported to the host.
>>
>> Known code re-work:
>> * Plan to re-use Wei's work, which communicates the poison value to 
>> the host.
>> * The nomenclatures used in virtio-balloon needs to be changed so 
>> that the code can easily be distinguished from Wei's Free Page Hint code.
>> * Sorting based on zonenum, to avoid repetitive zone locks for the 
>> same zone.
>>
>> Other required work:
>> * Run other benchmarks to evaluate the performance/impact of this 
>> approach.
>>
>> Test case:
>> Setup:
>> Memory-15837 MB
>> Guest Memory Size-5 GB
>> Swap-Disabled
>> Test Program-Simple program which allocates 4GB memory via malloc, touches 
>> it via memset and exits.
>> Use case-Number of guests that can be launched completely including the 
>> successful execution of the test program.
>> Procedure:
>> The first guest is launched and once its console is up, the test allocation 
>> program is executed with 4 GB memory request (Due to this the guest occupies 
>> almost 4-5 GB of memory in the host in a system without page hinting). Once 
>> this program exits at that time another guest is launched in the host and 
>> the same process is followed. We continue launching the guests until a guest 
>> gets killed due to low memory condition in the host.
>>
>> Result:
>> Without Hinting-3 Guests
>> With Hinting-5 to 7 Guests(Based on the amount of memory freed/captured).
>>
>> [1] https://www.spinics.net/lists/kvm/msg170113.html
> So I tried reproducing your test and I am not having much luck.
> According to the sysctl in the guest  I am seeing
> "vm.guest-page-hinting = 1" which is supposed to indicate that the
> hinting is enabled in both QEMU and the guest right? 
That is correct. If your guest has the balloon driver enabled it will
also enable the hinting.
> I'm just wanting
> to verify that this is the case before I start doing any debugging.
>
> I'm assuming you never really ran any multi-threaded tests on a
> multi-CPU guest did you?
This is correct. I forgot to mention this as another todo item for me in
the cover email.
I will test multiple vcpus, once I finalize the design changes which I
am doing right now.
Thanks for pointing this out.
>  With the patches applied I am seeing
> stability issues. If I enable a VM with multiple CPUs and run
> something like the page_fault1 test from the will-it-scale suite I am
> seeing multiple traces being generated by the guest kernel and it
> ultimately just hangs.
As I am done with the changes on which I am currently working. I will
look into this as well.
>
> I have included the traces below. There end up being 3 specific
> issues, a double free that is detected, the RCU stall, and then starts
> complaining about a soft lockup.
>
> Thanks.
>
> - Alex
>
> -- This looks like a 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-22 Thread Alexander Duyck
On Mon, Feb 4, 2019 at 1:47 PM Nitesh Narayan Lal  wrote:
>
> The following patch-set proposes an efficient mechanism for handing freed 
> memory between the guest and the host. It enables the guests with no page 
> cache to rapidly free and reclaims memory to and from the host respectively.
>
> Benefit:
> With this patch-series, in our test-case, executed on a single system and 
> single NUMA node with 15GB memory, we were able to successfully launch 
> atleast 5 guests
> when page hinting was enabled and 3 without it. (Detailed explanation of the 
> test procedure is provided at the bottom).
>
> Changelog in V8:
> In this patch-series, the earlier approach [1] which was used to capture and 
> scan the pages freed by the guest has been changed. The new approach is 
> briefly described below:
>
> The patch-set still leverages the existing arch_free_page() to add this 
> functionality. It maintains a per CPU array which is used to store the pages 
> freed by the guest. The maximum number of entries which it can hold is 
> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it is 
> scanned and only the pages which are available in the buddy are stored. This 
> process continues until the array is filled with pages which are part of the 
> buddy free list. After which it wakes up a kernel per-cpu-thread.
> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation 
> and if the page is not reallocated and present in the buddy, the kernel 
> thread attempts to isolate it from the buddy. If it is successfully isolated, 
> the page is added to another per-cpu array. Once the entire scanning process 
> is complete, all the isolated pages are reported to the host through an 
> existing virtio-balloon driver.
>
> Known Issues:
> * Fixed array size: The problem with having a fixed/hardcoded array 
> size arises when the size of the guest varies. For example when the guest 
> size increases and it starts making large allocations fixed size limits this 
> solution's ability to capture all the freed pages. This will result in less 
> guest free memory getting reported to the host.
>
> Known code re-work:
> * Plan to re-use Wei's work, which communicates the poison value to 
> the host.
> * The nomenclatures used in virtio-balloon needs to be changed so 
> that the code can easily be distinguished from Wei's Free Page Hint code.
> * Sorting based on zonenum, to avoid repetitive zone locks for the 
> same zone.
>
> Other required work:
> * Run other benchmarks to evaluate the performance/impact of this 
> approach.
>
> Test case:
> Setup:
> Memory-15837 MB
> Guest Memory Size-5 GB
> Swap-Disabled
> Test Program-Simple program which allocates 4GB memory via malloc, touches it 
> via memset and exits.
> Use case-Number of guests that can be launched completely including the 
> successful execution of the test program.
> Procedure:
> The first guest is launched and once its console is up, the test allocation 
> program is executed with 4 GB memory request (Due to this the guest occupies 
> almost 4-5 GB of memory in the host in a system without page hinting). Once 
> this program exits at that time another guest is launched in the host and the 
> same process is followed. We continue launching the guests until a guest gets 
> killed due to low memory condition in the host.
>
> Result:
> Without Hinting-3 Guests
> With Hinting-5 to 7 Guests(Based on the amount of memory freed/captured).
>
> [1] https://www.spinics.net/lists/kvm/msg170113.html

So I tried reproducing your test and I am not having much luck.
According to the sysctl in the guest  I am seeing
"vm.guest-page-hinting = 1" which is supposed to indicate that the
hinting is enabled in both QEMU and the guest right? I'm just wanting
to verify that this is the case before I start doing any debugging.

I'm assuming you never really ran any multi-threaded tests on a
multi-CPU guest did you? With the patches applied I am seeing
stability issues. If I enable a VM with multiple CPUs and run
something like the page_fault1 test from the will-it-scale suite I am
seeing multiple traces being generated by the guest kernel and it
ultimately just hangs.

I have included the traces below. There end up being 3 specific
issues, a double free that is detected, the RCU stall, and then starts
complaining about a soft lockup.

Thanks.

- Alex

-- This looks like a page complaining about a double add when added to
the LRU --
[   50.479635] list_add double add: new=f6448008,
prev=a000fffd50c0, next=f6448008.
[   50.481066] [ cut here ]
[   50.481753] kernel BUG at lib/list_debug.c:31!
[   50.482448] invalid opcode:  [#1] SMP PTI
[   50.483108] CPU: 1 PID: 852 Comm: hinting/1 Not tainted
5.0.0-rc7-next-20190219-baseline+ #50
[   50.486362] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[   50.487881] RIP: 0010:__list_add_valid+0x4b/0x70

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread David Hildenbrand


>> I can't follow. We are talking about something as simple as a minimum
>> page granularity here that can easily be configured. Nothing that
>> screams for different implementations. But I get your point, we could
>> tune for different architectures.
> 
> I was thinking about the guest side of things. Basically if we need to
> define different page orders for different architectures then we start
> needing to do architecture specific includes. Then if we throw in
> stuff like the fact that the first level of KVM can make use of the
> host style hints then that is another thing that will be a difference
> int he different architectures. I'm just worried this stuff is going
> to start adding up to a bunch of "#ifdef" cruft if we are trying to do
> this as a virtio driver.

I agree that something like that is to be avoided. As MST pointed out,
feature bits and config space are a nice way to solve that at least on
the virtio side.

> 
>>>
>
> As far as fragmentation my thought is that we may want to look into
> adding support to the guest for prioritizing defragmentation on pages
> lower than THP size. Then that way we could maintain the higher
> overall performance with or without the hinting since shuffling lower
> order pages around between guests would start to get expensive pretty
> quick.

 My take would be, design an interface/mechanism that allows any kind of
 granularity. You can than balance between cpu overead and space shifting.
>>>
>>> The problem with using "any kind of granularity" is that in the case
>>> of memory we are already having problems with 4K pages being deemed
>>> too small of a granularity to be useful for anything and making
>>> operations too expensive.
>>
>> No, sorry, s390x does it. And via batch reporting it could work. Not
>> saying we should do page granularity, but "to be useful for anything" is
>> just wrong.
> 
> Yeah, I was engaging in a bit of hyperbole. I have had a headache this
> morning so I am a bit cranky.

No worries, I am very happy about this discussion. :)

> 
> So I am assuming the batching is the reason why you also have a
> arch_alloc_page then for the s390 so that you can abort the hint if a
> page is reallocated before the hint is processed then? I just want to
> confirm so that my understanding of this is correct.

s390x is very special as it actually communicates the page state to the
hypervisor via page table bits in the guest->host mapping. The reporting
is then done batched (and synchronous) via a list of PFNs. But via the
special page table bits (along with the ESSA instruction), requests for
single pages can be canceled any time. So allocation paths are *only*
blocked by a special page lock also part of the page table bits in the
guest->host mapping.

I guess now you are completely confused. The main point is: actual
reporting to _perform_ the free is synchronous + batched. Canceling any
time possible via special synchronization mechanism per page. (that's
why pages are not taken out of the buddy, because canceling is easy)

> 
> If that is the case I would be much happier with an asynchronous page
> hint setup as this doesn't deprive the guest of memory while waiting
> on the hint. The current logic in the patches from Nitesh has the
> pages unavailable to the guest while waiting on the hint and that has
> me somewhat concerned as it is going to hurt cache locality as it will
> guarantee that we cannot reuse the same page if we are doing a cycle
> of alloc and free for the same page size.

My view on things on the current plan and your question:

1. We queue up *potential* hints on kfree per VCPU. We might only queue
after merging in the buddy and if we exceed a certain page order.

2. When that list is full, we *try to* take these pages out of the
buddy. We then trigger asynchronous reporting for the ones where we
succeeded.

3. Once reporting returns, we put back the pages to the buddy.

Between 1 and 2, any page can be reallocated. 2. will simply ignore the
page if it is no longer in the buddy. Pages are removed from the buddy
only between 2 and 3. So hopefully a very short time.

Especially, the queuing might mitigate cache locality issues. And there
are the pcpu lists that - as far as I remember - won't be touched as
these pages are not "official buddy pages" (yet). But this is an
interesting point to keep in mind.

> 
>>>
>>> I'm open to using other page orders for other architectures. Nothing
>>> says we have to stick with THP sized pages for all architectures. I
>>> have just been focused on x86 and this seems like the best fit for the
>>> balance between CPU and freeing of memory for now on that
>>> architecture.
>>>
 I feel like repeating myself, but on s390x hinting is done on page
 granularity, and I have never heard somebody say "how can I turn it off,
 this is slowing down my system too much.". All we know is that one
 hypercall per free is most probably not acceptable. We 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread Michael S. Tsirkin
On Tue, Feb 19, 2019 at 01:57:14PM -0800, Alexander Duyck wrote:
> On Tue, Feb 19, 2019 at 10:32 AM David Hildenbrand  wrote:
> >
> > >>> This essentially just ends up being another trade-off of CPU versus
> > >>> memory though. Assuming we aren't using THP we are going to take a
> > >>> penalty in terms of performance but could then free individual pages
> > >>> less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be
> > >>> much higher in general even without the hinting. I figure for x86 we
> > >>> probably don't have too many options since if I am not mistaken
> > >>> MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER.
> > >>
> > >> THP is an implementation detail in the hypervisor. Yes, it is the common
> > >> case on x86. But it is e.g. not available on s390x yet. And we also want
> > >> this mechanism to work on s390x (e.g. for nested virtualization setups
> > >> as discussed).
> > >>
> > >> If we e.g. report any granularity after merging was done in the buddy,
> > >> we could end up reporting everything from page size up to MAX_SIZE - 1,
> > >> the hypervisor could ignore hints below a certain magic number, if it
> > >> makes its life easier.
> > >
> > > For each architecture we can do a separate implementation of what to
> > > hint on. We already do that for bare metal so why would we have guests
> > > do the same type of hinting in the virtualization case when there are
> > > fundamental differences in page size and features in each
> > > architecture?
> > >
> > > This is another reason why I think the hypercall approach is a better
> > > idea since each architecture is likely going to want to handle things
> > > differently and it would be a pain to try and sort that all out in a
> > > virtio driver.
> >
> > I can't follow. We are talking about something as simple as a minimum
> > page granularity here that can easily be configured. Nothing that
> > screams for different implementations. But I get your point, we could
> > tune for different architectures.
> 
> I was thinking about the guest side of things. Basically if we need to
> define different page orders for different architectures then we start
> needing to do architecture specific includes. Then if we throw in
> stuff like the fact that the first level of KVM can make use of the
> host style hints then that is another thing that will be a difference
> int he different architectures.

Sorry didn't catch this one. What are host style hints?

> I'm just worried this stuff is going
> to start adding up to a bunch of "#ifdef" cruft if we are trying to do
> this as a virtio driver.

I agree we want to avoid that.

And by comparison, if it's up to host or if it's tied to logic within
guest (such as MAX_PAGE_ORDER as suggested by Linus) as opposed to CPU
architecture, then virtio is easier as you can re-use config space and
feature bits to negotiate host/guest capabilities. Doing hypercalls for
that would add lots of hypercalls.

I CC'd Wei Wang who implemented host-driven hints in the balloon right
now. Wei I wonder - could you try changing from MAX_PAGE_ORDER to
HUGETLB_PAGE_ORDER? Does this affect performance for you at all? Thanks!


> > >
> > >>>
> > >>> As far as fragmentation my thought is that we may want to look into
> > >>> adding support to the guest for prioritizing defragmentation on pages
> > >>> lower than THP size. Then that way we could maintain the higher
> > >>> overall performance with or without the hinting since shuffling lower
> > >>> order pages around between guests would start to get expensive pretty
> > >>> quick.
> > >>
> > >> My take would be, design an interface/mechanism that allows any kind of
> > >> granularity. You can than balance between cpu overead and space shifting.
> > >
> > > The problem with using "any kind of granularity" is that in the case
> > > of memory we are already having problems with 4K pages being deemed
> > > too small of a granularity to be useful for anything and making
> > > operations too expensive.
> >
> > No, sorry, s390x does it. And via batch reporting it could work. Not
> > saying we should do page granularity, but "to be useful for anything" is
> > just wrong.
> 
> Yeah, I was engaging in a bit of hyperbole. I have had a headache this
> morning so I am a bit cranky.
> 
> So I am assuming the batching is the reason why you also have a
> arch_alloc_page then for the s390 so that you can abort the hint if a
> page is reallocated before the hint is processed then? I just want to
> confirm so that my understanding of this is correct.
> 
> If that is the case I would be much happier with an asynchronous page
> hint setup as this doesn't deprive the guest of memory while waiting
> on the hint. The current logic in the patches from Nitesh has the
> pages unavailable to the guest while waiting on the hint and that has
> me somewhat concerned as it is going to hurt cache locality as it will
> guarantee that we cannot reuse the same page if we are doing a cycle
> of alloc 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread Alexander Duyck
On Tue, Feb 19, 2019 at 10:32 AM David Hildenbrand  wrote:
>
> >>> This essentially just ends up being another trade-off of CPU versus
> >>> memory though. Assuming we aren't using THP we are going to take a
> >>> penalty in terms of performance but could then free individual pages
> >>> less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be
> >>> much higher in general even without the hinting. I figure for x86 we
> >>> probably don't have too many options since if I am not mistaken
> >>> MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER.
> >>
> >> THP is an implementation detail in the hypervisor. Yes, it is the common
> >> case on x86. But it is e.g. not available on s390x yet. And we also want
> >> this mechanism to work on s390x (e.g. for nested virtualization setups
> >> as discussed).
> >>
> >> If we e.g. report any granularity after merging was done in the buddy,
> >> we could end up reporting everything from page size up to MAX_SIZE - 1,
> >> the hypervisor could ignore hints below a certain magic number, if it
> >> makes its life easier.
> >
> > For each architecture we can do a separate implementation of what to
> > hint on. We already do that for bare metal so why would we have guests
> > do the same type of hinting in the virtualization case when there are
> > fundamental differences in page size and features in each
> > architecture?
> >
> > This is another reason why I think the hypercall approach is a better
> > idea since each architecture is likely going to want to handle things
> > differently and it would be a pain to try and sort that all out in a
> > virtio driver.
>
> I can't follow. We are talking about something as simple as a minimum
> page granularity here that can easily be configured. Nothing that
> screams for different implementations. But I get your point, we could
> tune for different architectures.

I was thinking about the guest side of things. Basically if we need to
define different page orders for different architectures then we start
needing to do architecture specific includes. Then if we throw in
stuff like the fact that the first level of KVM can make use of the
host style hints then that is another thing that will be a difference
int he different architectures. I'm just worried this stuff is going
to start adding up to a bunch of "#ifdef" cruft if we are trying to do
this as a virtio driver.

> >
> >>>
> >>> As far as fragmentation my thought is that we may want to look into
> >>> adding support to the guest for prioritizing defragmentation on pages
> >>> lower than THP size. Then that way we could maintain the higher
> >>> overall performance with or without the hinting since shuffling lower
> >>> order pages around between guests would start to get expensive pretty
> >>> quick.
> >>
> >> My take would be, design an interface/mechanism that allows any kind of
> >> granularity. You can than balance between cpu overead and space shifting.
> >
> > The problem with using "any kind of granularity" is that in the case
> > of memory we are already having problems with 4K pages being deemed
> > too small of a granularity to be useful for anything and making
> > operations too expensive.
>
> No, sorry, s390x does it. And via batch reporting it could work. Not
> saying we should do page granularity, but "to be useful for anything" is
> just wrong.

Yeah, I was engaging in a bit of hyperbole. I have had a headache this
morning so I am a bit cranky.

So I am assuming the batching is the reason why you also have a
arch_alloc_page then for the s390 so that you can abort the hint if a
page is reallocated before the hint is processed then? I just want to
confirm so that my understanding of this is correct.

If that is the case I would be much happier with an asynchronous page
hint setup as this doesn't deprive the guest of memory while waiting
on the hint. The current logic in the patches from Nitesh has the
pages unavailable to the guest while waiting on the hint and that has
me somewhat concerned as it is going to hurt cache locality as it will
guarantee that we cannot reuse the same page if we are doing a cycle
of alloc and free for the same page size.

> >
> > I'm open to using other page orders for other architectures. Nothing
> > says we have to stick with THP sized pages for all architectures. I
> > have just been focused on x86 and this seems like the best fit for the
> > balance between CPU and freeing of memory for now on that
> > architecture.
> >
> >> I feel like repeating myself, but on s390x hinting is done on page
> >> granularity, and I have never heard somebody say "how can I turn it off,
> >> this is slowing down my system too much.". All we know is that one
> >> hypercall per free is most probably not acceptable. We really have to
> >> play with the numbers.
> >
> > My thought was we could look at doing different implementations for
> > other architectures such as s390 and powerPC. Odds are the
> > implementations would be similar but 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread Michael S. Tsirkin
On Tue, Feb 19, 2019 at 09:21:20PM +0100, David Hildenbrand wrote:
> On 19.02.19 21:17, Michael S. Tsirkin wrote:
> > On Tue, Feb 19, 2019 at 09:02:52PM +0100, David Hildenbrand wrote:
> >> On 19.02.19 20:58, Michael S. Tsirkin wrote:
> >>> On Tue, Feb 19, 2019 at 10:06:35AM -0800, Alexander Duyck wrote:
> > I tend to like an asynchronous reporting approach as discussed in this
> > thread, we would have to see if Nitesh could get it implemented.
> 
>  I agree it would be great if it could work. However I have concerns
>  given that work on this patch set dates back to 2017, major issues
>  such as working around device assignment have yet to be addressed,
> >>>
> >>> BTW for device assignment to work, your idea of sending
> >>> data directly to kvm won't work, will it?
> >>> You need to update userspace so it can update VFIO right?
> >>> Another blocker for assignment is ability to make holes
> >>> an an existing mapping - supported by hardware but
> >>> not by IOMMU drivers.
> >>
> >> I had the exact same thought and then realized that we decided to block
> >> the balloon in user space until we figured out how to handle this properly.
> >>
> >> I wonder if MADV_FREE behaves differently compared to MADV_DONTNEED when
> >> finding pinned pages, but I doubt it. Most probably we'll have to
> >> disable hinting for device assignments as well.
> > 
> > OK but let's recognize it as a bug not a feature.
> > 
> 
> Yes, btw interesting read: https://lwn.net/Articles/198380/

There's also slideware from Rik van Riel circa 2011. His idea
was tagging pages in guest memory, freeing a page involves
- drop the EPT PTE
- check page is still free
- unpin page

This way you can hint without exits at all if you like.

> 
> "Pages which have been locked into memory pose an extra challenge here -
> they can be part of the page cache, but they still shouldn't be taken
> away by the host system. So such pages cannot be marked as "volatile."
> The problem is that figuring out if a page is locked is harder than it
> might seem; it can involve scanning a list of virtual memory area (VMA)
> structures, which is slow. So the hinting patches add a new flag to the
> address_space structure to note that somebody has locked pages from that
> address space in memory."
> 
> I assume locked here actually means pinned.

Locked seems to mean mlock there.

This seems to also resemble Xen's tmem a bit.


> -- 
> 
> Thanks,
> 
> David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread David Hildenbrand
On 19.02.19 21:17, Michael S. Tsirkin wrote:
> On Tue, Feb 19, 2019 at 09:02:52PM +0100, David Hildenbrand wrote:
>> On 19.02.19 20:58, Michael S. Tsirkin wrote:
>>> On Tue, Feb 19, 2019 at 10:06:35AM -0800, Alexander Duyck wrote:
> I tend to like an asynchronous reporting approach as discussed in this
> thread, we would have to see if Nitesh could get it implemented.

 I agree it would be great if it could work. However I have concerns
 given that work on this patch set dates back to 2017, major issues
 such as working around device assignment have yet to be addressed,
>>>
>>> BTW for device assignment to work, your idea of sending
>>> data directly to kvm won't work, will it?
>>> You need to update userspace so it can update VFIO right?
>>> Another blocker for assignment is ability to make holes
>>> an an existing mapping - supported by hardware but
>>> not by IOMMU drivers.
>>
>> I had the exact same thought and then realized that we decided to block
>> the balloon in user space until we figured out how to handle this properly.
>>
>> I wonder if MADV_FREE behaves differently compared to MADV_DONTNEED when
>> finding pinned pages, but I doubt it. Most probably we'll have to
>> disable hinting for device assignments as well.
> 
> OK but let's recognize it as a bug not a feature.
> 

Yes, btw interesting read: https://lwn.net/Articles/198380/

"Pages which have been locked into memory pose an extra challenge here -
they can be part of the page cache, but they still shouldn't be taken
away by the host system. So such pages cannot be marked as "volatile."
The problem is that figuring out if a page is locked is harder than it
might seem; it can involve scanning a list of virtual memory area (VMA)
structures, which is slow. So the hinting patches add a new flag to the
address_space structure to note that somebody has locked pages from that
address space in memory."

I assume locked here actually means pinned.

-- 

Thanks,

David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread Michael S. Tsirkin
On Tue, Feb 19, 2019 at 09:02:52PM +0100, David Hildenbrand wrote:
> On 19.02.19 20:58, Michael S. Tsirkin wrote:
> > On Tue, Feb 19, 2019 at 10:06:35AM -0800, Alexander Duyck wrote:
> >>> I tend to like an asynchronous reporting approach as discussed in this
> >>> thread, we would have to see if Nitesh could get it implemented.
> >>
> >> I agree it would be great if it could work. However I have concerns
> >> given that work on this patch set dates back to 2017, major issues
> >> such as working around device assignment have yet to be addressed,
> > 
> > BTW for device assignment to work, your idea of sending
> > data directly to kvm won't work, will it?
> > You need to update userspace so it can update VFIO right?
> > Another blocker for assignment is ability to make holes
> > an an existing mapping - supported by hardware but
> > not by IOMMU drivers.
> 
> I had the exact same thought and then realized that we decided to block
> the balloon in user space until we figured out how to handle this properly.
> 
> I wonder if MADV_FREE behaves differently compared to MADV_DONTNEED when
> finding pinned pages, but I doubt it. Most probably we'll have to
> disable hinting for device assignments as well.

OK but let's recognize it as a bug not a feature.

-- 
MST


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread David Hildenbrand
On 19.02.19 20:58, Michael S. Tsirkin wrote:
> On Tue, Feb 19, 2019 at 10:06:35AM -0800, Alexander Duyck wrote:
>>> I tend to like an asynchronous reporting approach as discussed in this
>>> thread, we would have to see if Nitesh could get it implemented.
>>
>> I agree it would be great if it could work. However I have concerns
>> given that work on this patch set dates back to 2017, major issues
>> such as working around device assignment have yet to be addressed,
> 
> BTW for device assignment to work, your idea of sending
> data directly to kvm won't work, will it?
> You need to update userspace so it can update VFIO right?
> Another blocker for assignment is ability to make holes
> an an existing mapping - supported by hardware but
> not by IOMMU drivers.

I had the exact same thought and then realized that we decided to block
the balloon in user space until we figured out how to handle this properly.

I wonder if MADV_FREE behaves differently compared to MADV_DONTNEED when
finding pinned pages, but I doubt it. Most probably we'll have to
disable hinting for device assignments as well.

> 
> All the issues are shared with balloon btw, so that
> could be another reason to use the balloon.

Yes, smells like it.

-- 

Thanks,

David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread Michael S. Tsirkin
On Tue, Feb 19, 2019 at 10:06:35AM -0800, Alexander Duyck wrote:
> > I tend to like an asynchronous reporting approach as discussed in this
> > thread, we would have to see if Nitesh could get it implemented.
> 
> I agree it would be great if it could work. However I have concerns
> given that work on this patch set dates back to 2017, major issues
> such as working around device assignment have yet to be addressed,

BTW for device assignment to work, your idea of sending
data directly to kvm won't work, will it?
You need to update userspace so it can update VFIO right?
Another blocker for assignment is ability to make holes
an an existing mapping - supported by hardware but
not by IOMMU drivers.

All the issues are shared with balloon btw, so that
could be another reason to use the balloon.

> and
> it seems like most of the effort is being focused on things that in my
> opinion are being over-engineered for little to no benefit.
> 
> I really think that simpler would be much better in terms of design in
> this case.
> 
> Thanks.
> 
> - Alex

I personally learned a lot from Wei's work on hinting.


All this doesn't mean you approach is not the right one.

-- 
MST


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread David Hildenbrand
>>> This essentially just ends up being another trade-off of CPU versus
>>> memory though. Assuming we aren't using THP we are going to take a
>>> penalty in terms of performance but could then free individual pages
>>> less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be
>>> much higher in general even without the hinting. I figure for x86 we
>>> probably don't have too many options since if I am not mistaken
>>> MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER.
>>
>> THP is an implementation detail in the hypervisor. Yes, it is the common
>> case on x86. But it is e.g. not available on s390x yet. And we also want
>> this mechanism to work on s390x (e.g. for nested virtualization setups
>> as discussed).
>>
>> If we e.g. report any granularity after merging was done in the buddy,
>> we could end up reporting everything from page size up to MAX_SIZE - 1,
>> the hypervisor could ignore hints below a certain magic number, if it
>> makes its life easier.
> 
> For each architecture we can do a separate implementation of what to
> hint on. We already do that for bare metal so why would we have guests
> do the same type of hinting in the virtualization case when there are
> fundamental differences in page size and features in each
> architecture?
> 
> This is another reason why I think the hypercall approach is a better
> idea since each architecture is likely going to want to handle things
> differently and it would be a pain to try and sort that all out in a
> virtio driver.

I can't follow. We are talking about something as simple as a minimum
page granularity here that can easily be configured. Nothing that
screams for different implementations. But I get your point, we could
tune for different architectures.

> 
>>>
>>> As far as fragmentation my thought is that we may want to look into
>>> adding support to the guest for prioritizing defragmentation on pages
>>> lower than THP size. Then that way we could maintain the higher
>>> overall performance with or without the hinting since shuffling lower
>>> order pages around between guests would start to get expensive pretty
>>> quick.
>>
>> My take would be, design an interface/mechanism that allows any kind of
>> granularity. You can than balance between cpu overead and space shifting.
> 
> The problem with using "any kind of granularity" is that in the case
> of memory we are already having problems with 4K pages being deemed
> too small of a granularity to be useful for anything and making
> operations too expensive.

No, sorry, s390x does it. And via batch reporting it could work. Not
saying we should do page granularity, but "to be useful for anything" is
just wrong.

> 
> I'm open to using other page orders for other architectures. Nothing
> says we have to stick with THP sized pages for all architectures. I
> have just been focused on x86 and this seems like the best fit for the
> balance between CPU and freeing of memory for now on that
> architecture.
> 
>> I feel like repeating myself, but on s390x hinting is done on page
>> granularity, and I have never heard somebody say "how can I turn it off,
>> this is slowing down my system too much.". All we know is that one
>> hypercall per free is most probably not acceptable. We really have to
>> play with the numbers.
> 
> My thought was we could look at doing different implementations for
> other architectures such as s390 and powerPC. Odds are the
> implementations would be similar but have slight differences where
> appropriate such as what order we should start hinting on, or if we
> bypass the hypercall/virtio-balloon for a host native approach if
> available.
> 
>> I tend to like an asynchronous reporting approach as discussed in this
>> thread, we would have to see if Nitesh could get it implemented.
> 
> I agree it would be great if it could work. However I have concerns
> given that work on this patch set dates back to 2017, major issues
> such as working around device assignment have yet to be addressed, and
> it seems like most of the effort is being focused on things that in my
> opinion are being over-engineered for little to no benefit.

I can understand that you are trying to push your solution. I would do
the same. Again, I don't like a pure synchronous approach that works on
one-element-at-a-time. Period. Other people might have other opinions.
This is mine - luckily I don't have anything to say here :)

MST also voted for an asynchronous solution if we can make it work.
Nitesh made significant improvements since the 2017. Complicated stuff
needs time. No need to rush. People have been talking about free page
hinting since 2006. I talked to various people that experimented with
bitmap based solutions two years ago.

So much to that, if you think your solution is the way to go, please
follow up on it. Nitesh seems to have decided to look into the
asynchronous approach you also called "great if it could work". As long
as we don't run into elementary blockers there, 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread Alexander Duyck
On Mon, Feb 18, 2019 at 11:55 PM David Hildenbrand  wrote:
>
> On 19.02.19 01:01, Alexander Duyck wrote:
> > On Mon, Feb 18, 2019 at 1:04 PM David Hildenbrand  wrote:
> >>
> >> On 18.02.19 21:40, Nitesh Narayan Lal wrote:
> >>> On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
>  On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
> >> So I'm fine with a simple implementation but the interface needs to
> >> allow the hypervisor to process hints in parallel while guest is
> >> running.  We can then fix any issues on hypervisor without breaking
> >> guests.
> > Yes, I am fine with defining an interface that theoretically let's 
> > us
> > change the implementation in the guest later.
> > I consider this even a
> > prerequisite. IMHO the interface shouldn't be different, it will be
> > exactly the same.
> >
> > It is just "who" calls the batch freeing and waits for it. And as I
> > outlined here, doing it without additional threads at least avoids 
> > us
> > for now having to think about dynamic data structures and that we 
> > can
> > sometimes not report "because the thread is still busy reporting or
> > wasn't scheduled yet".
>  Sorry I wasn't clear. I think we need ability to change the
>  implementation in the *host* later. IOW don't rely on
>  host being synchronous.
> 
> 
> >>> I actually misread it :) . In any way, there has to be a mechanism to
> >>> synchronize.
> >>>
> >>> If we are going via a bare hypercall (like s390x, like what Alexander
> >>> proposes), it is going to be a synchronous interface either way. Just 
> >>> a
> >>> bare hypercall, there will not really be any blocking on the guest 
> >>> side.
> >> It bothers me that we are now tied to interface being synchronous. We
> >> won't be able to fix it if there's an issue as that would break guests.
> > I assume with "fix it" you mean "fix kfree taking longer on every X 
> > call"?
> >
> > Yes, as I initially wrote, this mimics s390x. That might be good (we
> > know it has been working for years) and bad (we are inheriting the same
> > problem class, if it exists). And being synchronous is part of the
> > approach for now.
>  BTW on s390 are these hypercalls handled by Linux?
> 
> > I tend to focus on the first part (we don't know anything besides it is
> > working) while you focus on the second part (there could be a potential
> > problem). Having a real problem at hand would be great, then we would
> > know what exactly we actually have to fix. But read below.
>  If we end up doing a hypercall per THP, maybe we could at least
>  not block with interrupts disabled? Poll in guest until
>  hypervisor reports its done?  That would already be an
>  improvement IMHO. E.g. perf within guest will point you
>  in the right direction and towards disabling hinting.
> 
> 
> >>> Via virtio, I guess it is waiting for a response to a requests, right?
> >> For the buffer to be used, yes. And it could mean putting some pages
> >> aside until hypervisor is done with them. Then you don't need timers or
> >> tricks like this, you can get an interrupt and start using the memory.
> > I am very open to such an approach as long as we can make it work and it
> > is not too complicated. (-> simple)
> >
> > This would mean for example
> >
> > 1. Collect entries to be reported per VCPU in a buffer. Say magic number
> > 256/512.
> >
> > 2. Once the buffer is full, do crazy "take pages out of the balloon
> > action" and report them to the hypervisor via virtio. Let the VCPU
> > continue. This will require some memory to store the request. Small
> > hickup for the VCPU to kick of the reporting to the hypervisor.
> >
> > 3. On interrupt/response, go over the response and put the pages back to
> > the buddy.
> >
> > (assuming that reporting a bulk of frees is better than reporting every
> > single free obviously)
> >
> > This could allow nice things like "when OOM gets trigger, see if pages
> > are currently being reported and wait until they have been put back to
> > the buddy, return "new pages available", so in a real "low on memory"
> > scenario, no OOM killer would get involved. This could address the issue
> > Wei had with reporting when low on memory.
> >
> > Is that something you have in mind?
>  Yes that seems more future proof I think.
> 
> > I assume we would have to allocate
> > memory when crafting the new requests. This is the only reason I tend to
> > prefer a synchronous interface for now. But if allocation is not a
> > problem, great.
>  There are two main ways to avoid allocation:
>  1. 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread Alexander Duyck
On Mon, Feb 18, 2019 at 6:46 PM Andrea Arcangeli  wrote:
>
> Hello,
>
> On Mon, Feb 18, 2019 at 03:47:22PM -0800, Alexander Duyck wrote:
> > essentially fragmented them. I guess hugepaged went through and
> > started trying to reassemble the huge pages and as a result there have
> > been apps that ended up consuming more memory than they would have
> > otherwise since they were using fragments of THP pages after doing an
> > MADV_DONTNEED on sections of the page.
>
> With relatively recent kernels MADV_DONTNEED doesn't necessarily free
> anything when it's applied to a THP subpage, it only splits the
> pagetables and queues the THP for deferred splitting. If there's
> memory pressure a shrinker will be invoked and the queue is scanned
> and the THPs are physically splitted, but to be reassembled/collapsed
> after a physical split it requires at least one young pte.
>
> If this is particularly problematic for page hinting, this behavior
> where the MADV_DONTNEED can be undoed by khugepaged (if some subpage is
> being frequently accessed), can be turned off by setting
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none to
> 0. Then the THP will only be collapsed if all 512 subpages are mapped
> (i.e. they've all be re-allocated by the guest).
>
> Regardless of the max_ptes_none default, keeping the smaller guest
> buddy orders as the last target for page hinting should be good for
> performance.

Okay, this is good to know. Thanks.

> > Yeah, no problem. The only thing I don't like about MADV_FREE is that
> > you have to have memory pressure before the pages really start getting
> > scrubbed with is both a benefit and a drawback. Basically it defers
> > the freeing until you are under actual memory pressure so when you hit
> > that case things start feeling much slower, that and it limits your
> > allocations since the kernel doesn't recognize the pages as free until
> > it would have to start trying to push memory to swap.
>
> The guest allocation behavior should not be influenced by MADV_FREE vs
> MADV_DONTNEED, the guest can't see the difference anyway, so why
> should it limit the allocations?

Actually I was talking about the host. So if  have a guest that is
using MADV_FREE what I have to do is create an allocation that would
force us to have to access swap and that in turn ends up triggering
the freeing of the pages that were moved to the "Inactive(file)" list
by the MADV_FREE call.

The only reason it came up is that one of my test systems had a small
swap so I ended up having to do multiple allocations and frees in swap
sized increments to free up memory from a large guest that wasn't in
use.

> The benefit of MADV_FREE should be that when the same guest frees and
> reallocates an huge amount of RAM (i.e. guest app allocating and
> freeing lots of RAM in a loop, not so uncommon), there will be no KVM
> page fault during guest re-allocations. So in absence of memory
> pressure in the host it should be a major win. Overall it sounds like
> a good tradeoff compared to MADV_DONTNEED that forcefully invokes MMU
> notifiers and forces host allocations and KVM page faults in order to
> reallocate the same RAM in the same guest.

Right, and I do see that behavior.

> When there's memory pressure it's up to the host Linux VM to notice
> there's plenty of MADV_FREE material to free at zero I/O cost before
> starting swapping.

Right.


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread David Hildenbrand

 Also one reason why I am not a fan of working with anything less than
 PMD order is because there have been issues in the past with false
 memory leaks being created when hints were provided on THP pages that
 essentially fragmented them. I guess hugepaged went through and
 started trying to reassemble the huge pages and as a result there have
 been apps that ended up consuming more memory than they would have
 otherwise since they were using fragments of THP pages after doing an
 MADV_DONTNEED on sections of the page.
>>>
>>> I understand your concerns, but we should not let bugs in the hypervisor
>>> dictate the design. Bugs are there to be fixed. Interesting read,
>>> though, thanks!
>>
>> Right but if we break up a huge page we are then creating
>> more work for hypervisor to reassemble it.
> 
> Yes, but the hypervisor can decide what to do. E.g. on s390x there are
> no THP, so nothing to break up.

To clarify as that might be confusing: No THP in a KVM guest mapping for
now.

> 
> It is all very complicated :)
> 


-- 

Thanks,

David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread David Hildenbrand
On 19.02.19 15:40, Michael S. Tsirkin wrote:
> On Tue, Feb 19, 2019 at 09:06:01AM +0100, David Hildenbrand wrote:
>> On 19.02.19 00:47, Alexander Duyck wrote:
>>> On Mon, Feb 18, 2019 at 9:42 AM David Hildenbrand  wrote:

 On 18.02.19 18:31, Alexander Duyck wrote:
> On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand  
> wrote:
>>
>> On 18.02.19 17:49, Michael S. Tsirkin wrote:
>>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
 It would be worth a try. My feeling is that a synchronous report after
 e.g. 512 frees should be acceptable, as it seems to be acceptable on
 s390x. (basically always enabled, nobody complains).
>>>
>>> What slips under the radar on an arch like s390 might
>>> raise issues for a popular arch like x86. My fear would be
>>> if it's only a problem e.g. for realtime. Then you get
>>> a condition that's very hard to trigger and affects
>>> worst case latencies.
>>
>> Realtime should never use free page hinting. Just like it should never
>> use ballooning. Just like it should pin all pages in the hypervisor.
>>
>>>
>>> But really what business has something that is supposedly
>>> an optimization blocking a VCPU? We are just freeing up
>>> lots of memory why is it a good idea to slow that
>>> process down?
>>
>> I first want to know that it is a problem before we declare it a
>> problem. I provided an example (s390x) where it does not seem to be a
>> problem. One hypercall ~every 512 frees. As simple as it can get.
>>
>> No trying to deny that it could be a problem on x86, but then I assume
>> it is only a problem in specific setups.
>>
>> I would much rather prefer a simple solution that can eventually be
>> disabled in selected setup than a complicated solution that tries to fit
>> all possible setups. Realtime is one of the examples where such stuff is
>> to be disabled either way.
>>
>> Optimization of space comes with a price (here: execution time).
>
> One thing to keep in mind though is that if you are already having to
> pull pages in and out of swap on the host in order be able to provide
> enough memory for the guests the free page hinting should be a
> significant win in terms of performance.

 Indeed. And also we are in a virtualized environment already, we can
 have any kind of sudden hickups. (again, realtime has special
 requirements on the setup)

 Side note: I like your approach because it is simple. I don't like your
 approach because it cannot deal with fragmented memory. And that can
 happen easily.

 The idea I described here can be similarly be an extension of your
 approach, merging in a "batched reporting" Nitesh proposed, so we can
 report on something < MAX_ORDER, similar to s390x. In the end it boils
 down to reporting via hypercall vs. reporting via virtio. The main point
 is that it is synchronous and batched. (and that we properly take care
 of the race between host freeing and guest allocation)
>>>
>>> I'd say the discussion is even simpler then that. My concern is more
>>> synchronous versus asynchronous. I honestly think the cost for a
>>> synchronous call is being overblown and we are likely to see the fault
>>> and zeroing of pages cost more than the hypercall or virtio
>>> transaction itself.
>>
>> The overhead of page faults and zeroing should be mitigated by
>> MADV_FREE, as Andrea correctly stated (thanks!). Then, the call overhead
>> (context switch) becomes relevant.
>>
>> We have various discussions now :) And I think they are related.
>>
>> synchronous versus asynchronous
>> batched vs. non-batched
>> MAX_ORDER - 1 vs. other/none magic number
>>
>> 1. synchronous call without batching on every kfree is bad. The
>> interface is fixed to big magic numbers, otherwise we end up having a
>> hypercall on every kfree. This is your approach.
>>
>> 2. asynchronous calls without batching would most probably have similar
>> problems with small granularities as we had when ballooning without
>> batching. Just overhead we can avoid.
>>
>> 3. synchronous and batched is what s390x does. It can deal with page
>> granularity. It is what I initially described in this sub-thread.
>>
>> 4. asynchronous and batched. This is the other approach we discussed
>> yesterday. If we can get it implemented, I would be interested in
>> performance numbers.
>>
>> As far as I understood, Michael seems to favor something like 4 (and I
>> assume eventually 2 if it is similarly fast). I am a friend of either 3
>> or 4.
> 
> Well Linus said big granularity is important for linux MM
> and not to bother with hinting small sizes.
> 

For some reason I tend to challenge also the opinions of people way
smarter than me ;) Only the numbers can tell the true story later.

> Alex said cost of a hypercall is drawfed by a pagefault

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread Michael S. Tsirkin
On Tue, Feb 19, 2019 at 09:06:01AM +0100, David Hildenbrand wrote:
> On 19.02.19 00:47, Alexander Duyck wrote:
> > On Mon, Feb 18, 2019 at 9:42 AM David Hildenbrand  wrote:
> >>
> >> On 18.02.19 18:31, Alexander Duyck wrote:
> >>> On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand  
> >>> wrote:
> 
>  On 18.02.19 17:49, Michael S. Tsirkin wrote:
> > On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
> >> It would be worth a try. My feeling is that a synchronous report after
> >> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> >> s390x. (basically always enabled, nobody complains).
> >
> > What slips under the radar on an arch like s390 might
> > raise issues for a popular arch like x86. My fear would be
> > if it's only a problem e.g. for realtime. Then you get
> > a condition that's very hard to trigger and affects
> > worst case latencies.
> 
>  Realtime should never use free page hinting. Just like it should never
>  use ballooning. Just like it should pin all pages in the hypervisor.
> 
> >
> > But really what business has something that is supposedly
> > an optimization blocking a VCPU? We are just freeing up
> > lots of memory why is it a good idea to slow that
> > process down?
> 
>  I first want to know that it is a problem before we declare it a
>  problem. I provided an example (s390x) where it does not seem to be a
>  problem. One hypercall ~every 512 frees. As simple as it can get.
> 
>  No trying to deny that it could be a problem on x86, but then I assume
>  it is only a problem in specific setups.
> 
>  I would much rather prefer a simple solution that can eventually be
>  disabled in selected setup than a complicated solution that tries to fit
>  all possible setups. Realtime is one of the examples where such stuff is
>  to be disabled either way.
> 
>  Optimization of space comes with a price (here: execution time).
> >>>
> >>> One thing to keep in mind though is that if you are already having to
> >>> pull pages in and out of swap on the host in order be able to provide
> >>> enough memory for the guests the free page hinting should be a
> >>> significant win in terms of performance.
> >>
> >> Indeed. And also we are in a virtualized environment already, we can
> >> have any kind of sudden hickups. (again, realtime has special
> >> requirements on the setup)
> >>
> >> Side note: I like your approach because it is simple. I don't like your
> >> approach because it cannot deal with fragmented memory. And that can
> >> happen easily.
> >>
> >> The idea I described here can be similarly be an extension of your
> >> approach, merging in a "batched reporting" Nitesh proposed, so we can
> >> report on something < MAX_ORDER, similar to s390x. In the end it boils
> >> down to reporting via hypercall vs. reporting via virtio. The main point
> >> is that it is synchronous and batched. (and that we properly take care
> >> of the race between host freeing and guest allocation)
> > 
> > I'd say the discussion is even simpler then that. My concern is more
> > synchronous versus asynchronous. I honestly think the cost for a
> > synchronous call is being overblown and we are likely to see the fault
> > and zeroing of pages cost more than the hypercall or virtio
> > transaction itself.
> 
> The overhead of page faults and zeroing should be mitigated by
> MADV_FREE, as Andrea correctly stated (thanks!). Then, the call overhead
> (context switch) becomes relevant.
> 
> We have various discussions now :) And I think they are related.
> 
> synchronous versus asynchronous
> batched vs. non-batched
> MAX_ORDER - 1 vs. other/none magic number
> 
> 1. synchronous call without batching on every kfree is bad. The
> interface is fixed to big magic numbers, otherwise we end up having a
> hypercall on every kfree. This is your approach.
> 
> 2. asynchronous calls without batching would most probably have similar
> problems with small granularities as we had when ballooning without
> batching. Just overhead we can avoid.
> 
> 3. synchronous and batched is what s390x does. It can deal with page
> granularity. It is what I initially described in this sub-thread.
> 
> 4. asynchronous and batched. This is the other approach we discussed
> yesterday. If we can get it implemented, I would be interested in
> performance numbers.
> 
> As far as I understood, Michael seems to favor something like 4 (and I
> assume eventually 2 if it is similarly fast). I am a friend of either 3
> or 4.

Well Linus said big granularity is important for linux MM
and not to bother with hinting small sizes.

Alex said cost of a hypercall is drawfed by a pagefault
after alloc. I would be curious whether async pagefault
can help things somehow though.

> > 
> > Also one reason why I am not a fan of working with anything less than
> > PMD order is because there have been 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread David Hildenbrand
On 19.02.19 15:17, Nitesh Narayan Lal wrote:
> On 2/19/19 8:03 AM, David Hildenbrand wrote:
>> There are two main ways to avoid allocation:
>> 1. do not add extra data on top of each chunk passed
> If I am not wrong then this is close to what we have right now.
 Yes, minus the kthread(s) and eventually with some sort of memory
 allocation for the request. Once you're asynchronous via a notification
 mechanisnm, there is no real need for a thread anymore, hopefully.
>>> Whether we should go with kthread or without it, I would like to do some
>>> performance comparison before commenting on this.
> One issue I see right now is that I am polling while host is freeing the
> memory.
> In the next version I could tie the logic which returns pages to the
> buddy and resets the per cpu array index value to 0 with the callback.
> (i.e.., it happens once we receive an response from the host)
 The question is, what happens when freeing pages and the array is not
 ready to be reused yet. In that case, you want to somehow continue
 freeing pages without busy waiting or eventually not reporting pages.
>>> This is what happens right now.
>>> Having kthread or not should not effect this behavior.
>>> When the array is full the current approach simply skips collecting the
>>> free pages.
>> Well, it somehow does affect your implementation. If you have a kthread
>> you always have to synchronize against the VCPU: "Is the pcpu array
>> ready to be used again".
>>
>> Once you do it asynchronously from your VCPU without another thread
>> being involved, such synchronization is not required. Simply prepare a
>> request and send it off. Reuse the pcpu array instantly. At least that's
>> the theory :)
>>
>> If you have a guest bulk freeing a lot of memory, I guess temporarily
>> dropping free page hints could be counter-productive. It really depends
>> on how fast the thread gets scheduled and how long the hinting process
>> takes. Having another thread involved might add a lot to that latency to
>> that formula. We'll have to measure, but my gut feeling is that once we
>> do stuff asynchronously, there is no need for a thread anymore.
> This is true.
>>
 The callback should put the pages back to the buddy and free the request
 eventually to have a fully asynchronous mechanism.

> Other change which I am testing right now is to only capture 'MAX_ORDER
 I am not sure if this is an arbitrary number we came up with here. We
 should really play with different orders to find a hot spot. I wouldn't
 consider this high priority, though. Getting the whole concept right to
 be able to deal with any magic number we come up should be the ultimate
 goal. (stuff that only works with huge pages I consider not future
 proof, especially regarding fragmented guests which can happen easily)
>>> Its quite possible that when we are only capturing MAX_ORDER - 1 and run
>>> a specific workload we don't get any memory back until we re-run the
>>> program and buddy finally starts merging of pages of order MAX_ORDER -1.
>>> This is why I think we may make this configurable from compile time and
>>> keep capturing MAX_ORDER - 1 so that we don't end up breaking anything.
>> Eventually pages will never get merged. Assume you have 1 page of a
>> MAX_ORDER - 1 chunk still allocated somewhere (e.g. !movable via
>> kmalloc). You skip reporting that chunk completely. Roughly 1mb/2mb/4mb
>> wasted (depending on the arch). This stuff can sum up.
> 
> After the discussion, here are the changes on which I am planning to
> work next:
> 1. Get rid of the kthread and dynamically allocate a per-cpu array to
> hold the
> isolated pages. As soon as the initial per-cpu array is completely
> scanned, release it
> so that we don't end up blocking anything.
> 2. Continue capturing MAX_ORDER - 1, for now. Reduce the initial per-cpu
> array size to 256
> for now. As we are doing asynchronous reporting we should be fine with a
> lower size array.
> 3. As soon as the host responds, release the pages back to the buddy
> from the callback and free the request.
> 
> Benefits wrt current implementation:
> 1. We will not eat up performance due to kernel thread.
> 2. We will still be doing reporting asynchronously=> no blocking.
> 3. Hopefully, we will be able to free more memory.
> 

+1 to that approach. We can fine tune the numbers (array size, sizes to
report) easily later on. Let us know if you run into problems doing the
allocation for the request. If that is a blocker, all we left with is a
synchronous approach I guess. Let's cross fingers. :)

-- 

Thanks,

David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread Nitesh Narayan Lal
On 2/19/19 8:03 AM, David Hildenbrand wrote:
> There are two main ways to avoid allocation:
> 1. do not add extra data on top of each chunk passed
 If I am not wrong then this is close to what we have right now.
>>> Yes, minus the kthread(s) and eventually with some sort of memory
>>> allocation for the request. Once you're asynchronous via a notification
>>> mechanisnm, there is no real need for a thread anymore, hopefully.
>> Whether we should go with kthread or without it, I would like to do some
>> performance comparison before commenting on this.
 One issue I see right now is that I am polling while host is freeing the
 memory.
 In the next version I could tie the logic which returns pages to the
 buddy and resets the per cpu array index value to 0 with the callback.
 (i.e.., it happens once we receive an response from the host)
>>> The question is, what happens when freeing pages and the array is not
>>> ready to be reused yet. In that case, you want to somehow continue
>>> freeing pages without busy waiting or eventually not reporting pages.
>> This is what happens right now.
>> Having kthread or not should not effect this behavior.
>> When the array is full the current approach simply skips collecting the
>> free pages.
> Well, it somehow does affect your implementation. If you have a kthread
> you always have to synchronize against the VCPU: "Is the pcpu array
> ready to be used again".
>
> Once you do it asynchronously from your VCPU without another thread
> being involved, such synchronization is not required. Simply prepare a
> request and send it off. Reuse the pcpu array instantly. At least that's
> the theory :)
>
> If you have a guest bulk freeing a lot of memory, I guess temporarily
> dropping free page hints could be counter-productive. It really depends
> on how fast the thread gets scheduled and how long the hinting process
> takes. Having another thread involved might add a lot to that latency to
> that formula. We'll have to measure, but my gut feeling is that once we
> do stuff asynchronously, there is no need for a thread anymore.
This is true.
>
>>> The callback should put the pages back to the buddy and free the request
>>> eventually to have a fully asynchronous mechanism.
>>>
 Other change which I am testing right now is to only capture 'MAX_ORDER
>>> I am not sure if this is an arbitrary number we came up with here. We
>>> should really play with different orders to find a hot spot. I wouldn't
>>> consider this high priority, though. Getting the whole concept right to
>>> be able to deal with any magic number we come up should be the ultimate
>>> goal. (stuff that only works with huge pages I consider not future
>>> proof, especially regarding fragmented guests which can happen easily)
>> Its quite possible that when we are only capturing MAX_ORDER - 1 and run
>> a specific workload we don't get any memory back until we re-run the
>> program and buddy finally starts merging of pages of order MAX_ORDER -1.
>> This is why I think we may make this configurable from compile time and
>> keep capturing MAX_ORDER - 1 so that we don't end up breaking anything.
> Eventually pages will never get merged. Assume you have 1 page of a
> MAX_ORDER - 1 chunk still allocated somewhere (e.g. !movable via
> kmalloc). You skip reporting that chunk completely. Roughly 1mb/2mb/4mb
> wasted (depending on the arch). This stuff can sum up.

After the discussion, here are the changes on which I am planning to
work next:
1. Get rid of the kthread and dynamically allocate a per-cpu array to
hold the
isolated pages. As soon as the initial per-cpu array is completely
scanned, release it
so that we don't end up blocking anything.
2. Continue capturing MAX_ORDER - 1, for now. Reduce the initial per-cpu
array size to 256
for now. As we are doing asynchronous reporting we should be fine with a
lower size array.
3. As soon as the host responds, release the pages back to the buddy
from the callback and free the request.

Benefits wrt current implementation:
1. We will not eat up performance due to kernel thread.
2. We will still be doing reporting asynchronously=> no blocking.
3. Hopefully, we will be able to free more memory.
-- 
Regards
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread David Hildenbrand
 There are two main ways to avoid allocation:
 1. do not add extra data on top of each chunk passed
>>> If I am not wrong then this is close to what we have right now.
>> Yes, minus the kthread(s) and eventually with some sort of memory
>> allocation for the request. Once you're asynchronous via a notification
>> mechanisnm, there is no real need for a thread anymore, hopefully.
> Whether we should go with kthread or without it, I would like to do some
> performance comparison before commenting on this.
>>
>>> One issue I see right now is that I am polling while host is freeing the
>>> memory.
>>> In the next version I could tie the logic which returns pages to the
>>> buddy and resets the per cpu array index value to 0 with the callback.
>>> (i.e.., it happens once we receive an response from the host)
>> The question is, what happens when freeing pages and the array is not
>> ready to be reused yet. In that case, you want to somehow continue
>> freeing pages without busy waiting or eventually not reporting pages.
> This is what happens right now.
> Having kthread or not should not effect this behavior.
> When the array is full the current approach simply skips collecting the
> free pages.

Well, it somehow does affect your implementation. If you have a kthread
you always have to synchronize against the VCPU: "Is the pcpu array
ready to be used again".

Once you do it asynchronously from your VCPU without another thread
being involved, such synchronization is not required. Simply prepare a
request and send it off. Reuse the pcpu array instantly. At least that's
the theory :)

If you have a guest bulk freeing a lot of memory, I guess temporarily
dropping free page hints could be counter-productive. It really depends
on how fast the thread gets scheduled and how long the hinting process
takes. Having another thread involved might add a lot to that latency to
that formula. We'll have to measure, but my gut feeling is that once we
do stuff asynchronously, there is no need for a thread anymore.

>>
>> The callback should put the pages back to the buddy and free the request
>> eventually to have a fully asynchronous mechanism.
>>
>>> Other change which I am testing right now is to only capture 'MAX_ORDER
>> I am not sure if this is an arbitrary number we came up with here. We
>> should really play with different orders to find a hot spot. I wouldn't
>> consider this high priority, though. Getting the whole concept right to
>> be able to deal with any magic number we come up should be the ultimate
>> goal. (stuff that only works with huge pages I consider not future
>> proof, especially regarding fragmented guests which can happen easily)
> Its quite possible that when we are only capturing MAX_ORDER - 1 and run
> a specific workload we don't get any memory back until we re-run the
> program and buddy finally starts merging of pages of order MAX_ORDER -1.
> This is why I think we may make this configurable from compile time and
> keep capturing MAX_ORDER - 1 so that we don't end up breaking anything.

Eventually pages will never get merged. Assume you have 1 page of a
MAX_ORDER - 1 chunk still allocated somewhere (e.g. !movable via
kmalloc). You skip reporting that chunk completely. Roughly 1mb/2mb/4mb
wasted (depending on the arch). This stuff can sum up.

-- 

Thanks,

David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread Nitesh Narayan Lal

On 2/18/19 9:46 PM, Andrea Arcangeli wrote:
> Hello,
>
> On Mon, Feb 18, 2019 at 03:47:22PM -0800, Alexander Duyck wrote:
>> essentially fragmented them. I guess hugepaged went through and
>> started trying to reassemble the huge pages and as a result there have
>> been apps that ended up consuming more memory than they would have
>> otherwise since they were using fragments of THP pages after doing an
>> MADV_DONTNEED on sections of the page.
> With relatively recent kernels MADV_DONTNEED doesn't necessarily free
> anything when it's applied to a THP subpage, it only splits the
> pagetables and queues the THP for deferred splitting. If there's
> memory pressure a shrinker will be invoked and the queue is scanned
> and the THPs are physically splitted, but to be reassembled/collapsed
> after a physical split it requires at least one young pte.
>
> If this is particularly problematic for page hinting, this behavior
> where the MADV_DONTNEED can be undoed by khugepaged (if some subpage is
> being frequently accessed), can be turned off by setting
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none to
> 0. Then the THP will only be collapsed if all 512 subpages are mapped
> (i.e. they've all be re-allocated by the guest).
>
> Regardless of the max_ptes_none default, keeping the smaller guest
> buddy orders as the last target for page hinting should be good for
> performance.
>
>> Yeah, no problem. The only thing I don't like about MADV_FREE is that
>> you have to have memory pressure before the pages really start getting
>> scrubbed with is both a benefit and a drawback. Basically it defers
>> the freeing until you are under actual memory pressure so when you hit
>> that case things start feeling much slower, that and it limits your
>> allocations since the kernel doesn't recognize the pages as free until
>> it would have to start trying to push memory to swap.
> The guest allocation behavior should not be influenced by MADV_FREE vs
> MADV_DONTNEED, the guest can't see the difference anyway, so why
> should it limit the allocations?
>
> The benefit of MADV_FREE should be that when the same guest frees and
> reallocates an huge amount of RAM (i.e. guest app allocating and
> freeing lots of RAM in a loop, not so uncommon), there will be no KVM
> page fault during guest re-allocations. So in absence of memory
> pressure in the host it should be a major win. Overall it sounds like
> a good tradeoff compared to MADV_DONTNEED that forcefully invokes MMU
> notifiers and forces host allocations and KVM page faults in order to
> reallocate the same RAM in the same guest.
This does makes sense.
Thanks for explaining this.
>
> When there's memory pressure it's up to the host Linux VM to notice
> there's plenty of MADV_FREE material to free at zero I/O cost before
> starting swapping.
>
> Thanks,
> Andrea
-- 
Regards
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread Nitesh Narayan Lal

On 2/18/19 4:04 PM, David Hildenbrand wrote:
> On 18.02.19 21:40, Nitesh Narayan Lal wrote:
>> On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
>>> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
> So I'm fine with a simple implementation but the interface needs to
> allow the hypervisor to process hints in parallel while guest is
> running.  We can then fix any issues on hypervisor without breaking
> guests.
 Yes, I am fine with defining an interface that theoretically let's us
 change the implementation in the guest later.
 I consider this even a
 prerequisite. IMHO the interface shouldn't be different, it will be
 exactly the same.

 It is just "who" calls the batch freeing and waits for it. And as I
 outlined here, doing it without additional threads at least avoids us
 for now having to think about dynamic data structures and that we can
 sometimes not report "because the thread is still busy reporting or
 wasn't scheduled yet".
>>> Sorry I wasn't clear. I think we need ability to change the
>>> implementation in the *host* later. IOW don't rely on
>>> host being synchronous.
>>>
>>>
>> I actually misread it :) . In any way, there has to be a mechanism to
>> synchronize.
>>
>> If we are going via a bare hypercall (like s390x, like what Alexander
>> proposes), it is going to be a synchronous interface either way. Just a
>> bare hypercall, there will not really be any blocking on the guest side.
> It bothers me that we are now tied to interface being synchronous. We
> won't be able to fix it if there's an issue as that would break guests.
 I assume with "fix it" you mean "fix kfree taking longer on every X call"?

 Yes, as I initially wrote, this mimics s390x. That might be good (we
 know it has been working for years) and bad (we are inheriting the same
 problem class, if it exists). And being synchronous is part of the
 approach for now.
>>> BTW on s390 are these hypercalls handled by Linux?
>>>
 I tend to focus on the first part (we don't know anything besides it is
 working) while you focus on the second part (there could be a potential
 problem). Having a real problem at hand would be great, then we would
 know what exactly we actually have to fix. But read below.
>>> If we end up doing a hypercall per THP, maybe we could at least
>>> not block with interrupts disabled? Poll in guest until
>>> hypervisor reports its done?  That would already be an
>>> improvement IMHO. E.g. perf within guest will point you
>>> in the right direction and towards disabling hinting.
>>>
>>>
>> Via virtio, I guess it is waiting for a response to a requests, right?
> For the buffer to be used, yes. And it could mean putting some pages
> aside until hypervisor is done with them. Then you don't need timers or
> tricks like this, you can get an interrupt and start using the memory.
 I am very open to such an approach as long as we can make it work and it
 is not too complicated. (-> simple)

 This would mean for example

 1. Collect entries to be reported per VCPU in a buffer. Say magic number
 256/512.

 2. Once the buffer is full, do crazy "take pages out of the balloon
 action" and report them to the hypervisor via virtio. Let the VCPU
 continue. This will require some memory to store the request. Small
 hickup for the VCPU to kick of the reporting to the hypervisor.

 3. On interrupt/response, go over the response and put the pages back to
 the buddy.

 (assuming that reporting a bulk of frees is better than reporting every
 single free obviously)

 This could allow nice things like "when OOM gets trigger, see if pages
 are currently being reported and wait until they have been put back to
 the buddy, return "new pages available", so in a real "low on memory"
 scenario, no OOM killer would get involved. This could address the issue
 Wei had with reporting when low on memory.

 Is that something you have in mind?
>>> Yes that seems more future proof I think.
>>>
 I assume we would have to allocate
 memory when crafting the new requests. This is the only reason I tend to
 prefer a synchronous interface for now. But if allocation is not a
 problem, great.
>>> There are two main ways to avoid allocation:
>>> 1. do not add extra data on top of each chunk passed
>> If I am not wrong then this is close to what we have right now.
> Yes, minus the kthread(s) and eventually with some sort of memory
> allocation for the request. Once you're asynchronous via a notification
> mechanisnm, there is no real need for a thread anymore, hopefully.
Whether we should go with kthread or without it, I would like to do some
performance comparison before commenting on this.
>
>> One 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-19 Thread David Hildenbrand
On 19.02.19 00:47, Alexander Duyck wrote:
> On Mon, Feb 18, 2019 at 9:42 AM David Hildenbrand  wrote:
>>
>> On 18.02.19 18:31, Alexander Duyck wrote:
>>> On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand  wrote:

 On 18.02.19 17:49, Michael S. Tsirkin wrote:
> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
>> It would be worth a try. My feeling is that a synchronous report after
>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
>> s390x. (basically always enabled, nobody complains).
>
> What slips under the radar on an arch like s390 might
> raise issues for a popular arch like x86. My fear would be
> if it's only a problem e.g. for realtime. Then you get
> a condition that's very hard to trigger and affects
> worst case latencies.

 Realtime should never use free page hinting. Just like it should never
 use ballooning. Just like it should pin all pages in the hypervisor.

>
> But really what business has something that is supposedly
> an optimization blocking a VCPU? We are just freeing up
> lots of memory why is it a good idea to slow that
> process down?

 I first want to know that it is a problem before we declare it a
 problem. I provided an example (s390x) where it does not seem to be a
 problem. One hypercall ~every 512 frees. As simple as it can get.

 No trying to deny that it could be a problem on x86, but then I assume
 it is only a problem in specific setups.

 I would much rather prefer a simple solution that can eventually be
 disabled in selected setup than a complicated solution that tries to fit
 all possible setups. Realtime is one of the examples where such stuff is
 to be disabled either way.

 Optimization of space comes with a price (here: execution time).
>>>
>>> One thing to keep in mind though is that if you are already having to
>>> pull pages in and out of swap on the host in order be able to provide
>>> enough memory for the guests the free page hinting should be a
>>> significant win in terms of performance.
>>
>> Indeed. And also we are in a virtualized environment already, we can
>> have any kind of sudden hickups. (again, realtime has special
>> requirements on the setup)
>>
>> Side note: I like your approach because it is simple. I don't like your
>> approach because it cannot deal with fragmented memory. And that can
>> happen easily.
>>
>> The idea I described here can be similarly be an extension of your
>> approach, merging in a "batched reporting" Nitesh proposed, so we can
>> report on something < MAX_ORDER, similar to s390x. In the end it boils
>> down to reporting via hypercall vs. reporting via virtio. The main point
>> is that it is synchronous and batched. (and that we properly take care
>> of the race between host freeing and guest allocation)
> 
> I'd say the discussion is even simpler then that. My concern is more
> synchronous versus asynchronous. I honestly think the cost for a
> synchronous call is being overblown and we are likely to see the fault
> and zeroing of pages cost more than the hypercall or virtio
> transaction itself.

The overhead of page faults and zeroing should be mitigated by
MADV_FREE, as Andrea correctly stated (thanks!). Then, the call overhead
(context switch) becomes relevant.

We have various discussions now :) And I think they are related.

synchronous versus asynchronous
batched vs. non-batched
MAX_ORDER - 1 vs. other/none magic number

1. synchronous call without batching on every kfree is bad. The
interface is fixed to big magic numbers, otherwise we end up having a
hypercall on every kfree. This is your approach.

2. asynchronous calls without batching would most probably have similar
problems with small granularities as we had when ballooning without
batching. Just overhead we can avoid.

3. synchronous and batched is what s390x does. It can deal with page
granularity. It is what I initially described in this sub-thread.

4. asynchronous and batched. This is the other approach we discussed
yesterday. If we can get it implemented, I would be interested in
performance numbers.

As far as I understood, Michael seems to favor something like 4 (and I
assume eventually 2 if it is similarly fast). I am a friend of either 3
or 4.

> 
> Also one reason why I am not a fan of working with anything less than
> PMD order is because there have been issues in the past with false
> memory leaks being created when hints were provided on THP pages that
> essentially fragmented them. I guess hugepaged went through and
> started trying to reassemble the huge pages and as a result there have
> been apps that ended up consuming more memory than they would have
> otherwise since they were using fragments of THP pages after doing an
> MADV_DONTNEED on sections of the page.

I understand your concerns, but we should not let bugs in the hypervisor
dictate the design. Bugs 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread David Hildenbrand
On 19.02.19 01:01, Alexander Duyck wrote:
> On Mon, Feb 18, 2019 at 1:04 PM David Hildenbrand  wrote:
>>
>> On 18.02.19 21:40, Nitesh Narayan Lal wrote:
>>> On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
 On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
>> So I'm fine with a simple implementation but the interface needs to
>> allow the hypervisor to process hints in parallel while guest is
>> running.  We can then fix any issues on hypervisor without breaking
>> guests.
> Yes, I am fine with defining an interface that theoretically let's us
> change the implementation in the guest later.
> I consider this even a
> prerequisite. IMHO the interface shouldn't be different, it will be
> exactly the same.
>
> It is just "who" calls the batch freeing and waits for it. And as I
> outlined here, doing it without additional threads at least avoids us
> for now having to think about dynamic data structures and that we can
> sometimes not report "because the thread is still busy reporting or
> wasn't scheduled yet".
 Sorry I wasn't clear. I think we need ability to change the
 implementation in the *host* later. IOW don't rely on
 host being synchronous.


>>> I actually misread it :) . In any way, there has to be a mechanism to
>>> synchronize.
>>>
>>> If we are going via a bare hypercall (like s390x, like what Alexander
>>> proposes), it is going to be a synchronous interface either way. Just a
>>> bare hypercall, there will not really be any blocking on the guest side.
>> It bothers me that we are now tied to interface being synchronous. We
>> won't be able to fix it if there's an issue as that would break guests.
> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
>
> Yes, as I initially wrote, this mimics s390x. That might be good (we
> know it has been working for years) and bad (we are inheriting the same
> problem class, if it exists). And being synchronous is part of the
> approach for now.
 BTW on s390 are these hypercalls handled by Linux?

> I tend to focus on the first part (we don't know anything besides it is
> working) while you focus on the second part (there could be a potential
> problem). Having a real problem at hand would be great, then we would
> know what exactly we actually have to fix. But read below.
 If we end up doing a hypercall per THP, maybe we could at least
 not block with interrupts disabled? Poll in guest until
 hypervisor reports its done?  That would already be an
 improvement IMHO. E.g. perf within guest will point you
 in the right direction and towards disabling hinting.


>>> Via virtio, I guess it is waiting for a response to a requests, right?
>> For the buffer to be used, yes. And it could mean putting some pages
>> aside until hypervisor is done with them. Then you don't need timers or
>> tricks like this, you can get an interrupt and start using the memory.
> I am very open to such an approach as long as we can make it work and it
> is not too complicated. (-> simple)
>
> This would mean for example
>
> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
> 256/512.
>
> 2. Once the buffer is full, do crazy "take pages out of the balloon
> action" and report them to the hypervisor via virtio. Let the VCPU
> continue. This will require some memory to store the request. Small
> hickup for the VCPU to kick of the reporting to the hypervisor.
>
> 3. On interrupt/response, go over the response and put the pages back to
> the buddy.
>
> (assuming that reporting a bulk of frees is better than reporting every
> single free obviously)
>
> This could allow nice things like "when OOM gets trigger, see if pages
> are currently being reported and wait until they have been put back to
> the buddy, return "new pages available", so in a real "low on memory"
> scenario, no OOM killer would get involved. This could address the issue
> Wei had with reporting when low on memory.
>
> Is that something you have in mind?
 Yes that seems more future proof I think.

> I assume we would have to allocate
> memory when crafting the new requests. This is the only reason I tend to
> prefer a synchronous interface for now. But if allocation is not a
> problem, great.
 There are two main ways to avoid allocation:
 1. do not add extra data on top of each chunk passed
>>> If I am not wrong then this is close to what we have right now.
>>
>> Yes, minus the kthread(s) and eventually with some sort of memory
>> allocation for the request. Once you're asynchronous via a notification
>> mechanisnm, there is no real need for a thread 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread Michael S. Tsirkin
On Mon, Feb 18, 2019 at 03:47:22PM -0800, Alexander Duyck wrote:
> > > So far with my patch set that hints at the PMD level w/ THP enabled I
> > > am not really seeing that much overhead for the hypercalls. The bigger
> > > piece that is eating up CPU time is all the page faults and page
> > > zeroing that is going on as we are cycling the memory in and out of
> > > the guest. Some of that could probably be resolved by using MADV_FREE,
> > > but if we are under actual memory pressure I suspect it would behave
> > > similar to MADV_DONTNEED.
> > >
> >
> > MADV_FREE is certainly the better thing to do for hinting in my opinion.
> > It should result in even less overhead. Thanks for the comment about the
> > hypercall overhead.
> 
> Yeah, no problem. The only thing I don't like about MADV_FREE is that
> you have to have memory pressure before the pages really start getting
> scrubbed with is both a benefit and a drawback. Basically it defers
> the freeing until you are under actual memory pressure so when you hit
> that case things start feeling much slower, that and it limits your
> allocations since the kernel doesn't recognize the pages as free until
> it would have to start trying to push memory to swap.

For sure if someone *wants* to spend cycles freeing memory,
we could add a system call that does exactly that. There's
no reason to force that on the same CPU while VCPU
is stopped though.

-- 
MST


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread Andrea Arcangeli
Hello,

On Mon, Feb 18, 2019 at 03:47:22PM -0800, Alexander Duyck wrote:
> essentially fragmented them. I guess hugepaged went through and
> started trying to reassemble the huge pages and as a result there have
> been apps that ended up consuming more memory than they would have
> otherwise since they were using fragments of THP pages after doing an
> MADV_DONTNEED on sections of the page.

With relatively recent kernels MADV_DONTNEED doesn't necessarily free
anything when it's applied to a THP subpage, it only splits the
pagetables and queues the THP for deferred splitting. If there's
memory pressure a shrinker will be invoked and the queue is scanned
and the THPs are physically splitted, but to be reassembled/collapsed
after a physical split it requires at least one young pte.

If this is particularly problematic for page hinting, this behavior
where the MADV_DONTNEED can be undoed by khugepaged (if some subpage is
being frequently accessed), can be turned off by setting
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none to
0. Then the THP will only be collapsed if all 512 subpages are mapped
(i.e. they've all be re-allocated by the guest).

Regardless of the max_ptes_none default, keeping the smaller guest
buddy orders as the last target for page hinting should be good for
performance.

> Yeah, no problem. The only thing I don't like about MADV_FREE is that
> you have to have memory pressure before the pages really start getting
> scrubbed with is both a benefit and a drawback. Basically it defers
> the freeing until you are under actual memory pressure so when you hit
> that case things start feeling much slower, that and it limits your
> allocations since the kernel doesn't recognize the pages as free until
> it would have to start trying to push memory to swap.

The guest allocation behavior should not be influenced by MADV_FREE vs
MADV_DONTNEED, the guest can't see the difference anyway, so why
should it limit the allocations?

The benefit of MADV_FREE should be that when the same guest frees and
reallocates an huge amount of RAM (i.e. guest app allocating and
freeing lots of RAM in a loop, not so uncommon), there will be no KVM
page fault during guest re-allocations. So in absence of memory
pressure in the host it should be a major win. Overall it sounds like
a good tradeoff compared to MADV_DONTNEED that forcefully invokes MMU
notifiers and forces host allocations and KVM page faults in order to
reallocate the same RAM in the same guest.

When there's memory pressure it's up to the host Linux VM to notice
there's plenty of MADV_FREE material to free at zero I/O cost before
starting swapping.

Thanks,
Andrea


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread Alexander Duyck
On Mon, Feb 18, 2019 at 1:04 PM David Hildenbrand  wrote:
>
> On 18.02.19 21:40, Nitesh Narayan Lal wrote:
> > On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
> >> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
>  So I'm fine with a simple implementation but the interface needs to
>  allow the hypervisor to process hints in parallel while guest is
>  running.  We can then fix any issues on hypervisor without breaking
>  guests.
> >>> Yes, I am fine with defining an interface that theoretically let's us
> >>> change the implementation in the guest later.
> >>> I consider this even a
> >>> prerequisite. IMHO the interface shouldn't be different, it will be
> >>> exactly the same.
> >>>
> >>> It is just "who" calls the batch freeing and waits for it. And as I
> >>> outlined here, doing it without additional threads at least avoids us
> >>> for now having to think about dynamic data structures and that we can
> >>> sometimes not report "because the thread is still busy reporting or
> >>> wasn't scheduled yet".
> >> Sorry I wasn't clear. I think we need ability to change the
> >> implementation in the *host* later. IOW don't rely on
> >> host being synchronous.
> >>
> >>
> > I actually misread it :) . In any way, there has to be a mechanism to
> > synchronize.
> >
> > If we are going via a bare hypercall (like s390x, like what Alexander
> > proposes), it is going to be a synchronous interface either way. Just a
> > bare hypercall, there will not really be any blocking on the guest side.
>  It bothers me that we are now tied to interface being synchronous. We
>  won't be able to fix it if there's an issue as that would break guests.
> >>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
> >>>
> >>> Yes, as I initially wrote, this mimics s390x. That might be good (we
> >>> know it has been working for years) and bad (we are inheriting the same
> >>> problem class, if it exists). And being synchronous is part of the
> >>> approach for now.
> >> BTW on s390 are these hypercalls handled by Linux?
> >>
> >>> I tend to focus on the first part (we don't know anything besides it is
> >>> working) while you focus on the second part (there could be a potential
> >>> problem). Having a real problem at hand would be great, then we would
> >>> know what exactly we actually have to fix. But read below.
> >> If we end up doing a hypercall per THP, maybe we could at least
> >> not block with interrupts disabled? Poll in guest until
> >> hypervisor reports its done?  That would already be an
> >> improvement IMHO. E.g. perf within guest will point you
> >> in the right direction and towards disabling hinting.
> >>
> >>
> > Via virtio, I guess it is waiting for a response to a requests, right?
>  For the buffer to be used, yes. And it could mean putting some pages
>  aside until hypervisor is done with them. Then you don't need timers or
>  tricks like this, you can get an interrupt and start using the memory.
> >>> I am very open to such an approach as long as we can make it work and it
> >>> is not too complicated. (-> simple)
> >>>
> >>> This would mean for example
> >>>
> >>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
> >>> 256/512.
> >>>
> >>> 2. Once the buffer is full, do crazy "take pages out of the balloon
> >>> action" and report them to the hypervisor via virtio. Let the VCPU
> >>> continue. This will require some memory to store the request. Small
> >>> hickup for the VCPU to kick of the reporting to the hypervisor.
> >>>
> >>> 3. On interrupt/response, go over the response and put the pages back to
> >>> the buddy.
> >>>
> >>> (assuming that reporting a bulk of frees is better than reporting every
> >>> single free obviously)
> >>>
> >>> This could allow nice things like "when OOM gets trigger, see if pages
> >>> are currently being reported and wait until they have been put back to
> >>> the buddy, return "new pages available", so in a real "low on memory"
> >>> scenario, no OOM killer would get involved. This could address the issue
> >>> Wei had with reporting when low on memory.
> >>>
> >>> Is that something you have in mind?
> >> Yes that seems more future proof I think.
> >>
> >>> I assume we would have to allocate
> >>> memory when crafting the new requests. This is the only reason I tend to
> >>> prefer a synchronous interface for now. But if allocation is not a
> >>> problem, great.
> >> There are two main ways to avoid allocation:
> >> 1. do not add extra data on top of each chunk passed
> > If I am not wrong then this is close to what we have right now.
>
> Yes, minus the kthread(s) and eventually with some sort of memory
> allocation for the request. Once you're asynchronous via a notification
> mechanisnm, there is no real need for a thread anymore, hopefully.
>
> > One issue I see right 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread Alexander Duyck
On Mon, Feb 18, 2019 at 9:42 AM David Hildenbrand  wrote:
>
> On 18.02.19 18:31, Alexander Duyck wrote:
> > On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand  wrote:
> >>
> >> On 18.02.19 17:49, Michael S. Tsirkin wrote:
> >>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
>  It would be worth a try. My feeling is that a synchronous report after
>  e.g. 512 frees should be acceptable, as it seems to be acceptable on
>  s390x. (basically always enabled, nobody complains).
> >>>
> >>> What slips under the radar on an arch like s390 might
> >>> raise issues for a popular arch like x86. My fear would be
> >>> if it's only a problem e.g. for realtime. Then you get
> >>> a condition that's very hard to trigger and affects
> >>> worst case latencies.
> >>
> >> Realtime should never use free page hinting. Just like it should never
> >> use ballooning. Just like it should pin all pages in the hypervisor.
> >>
> >>>
> >>> But really what business has something that is supposedly
> >>> an optimization blocking a VCPU? We are just freeing up
> >>> lots of memory why is it a good idea to slow that
> >>> process down?
> >>
> >> I first want to know that it is a problem before we declare it a
> >> problem. I provided an example (s390x) where it does not seem to be a
> >> problem. One hypercall ~every 512 frees. As simple as it can get.
> >>
> >> No trying to deny that it could be a problem on x86, but then I assume
> >> it is only a problem in specific setups.
> >>
> >> I would much rather prefer a simple solution that can eventually be
> >> disabled in selected setup than a complicated solution that tries to fit
> >> all possible setups. Realtime is one of the examples where such stuff is
> >> to be disabled either way.
> >>
> >> Optimization of space comes with a price (here: execution time).
> >
> > One thing to keep in mind though is that if you are already having to
> > pull pages in and out of swap on the host in order be able to provide
> > enough memory for the guests the free page hinting should be a
> > significant win in terms of performance.
>
> Indeed. And also we are in a virtualized environment already, we can
> have any kind of sudden hickups. (again, realtime has special
> requirements on the setup)
>
> Side note: I like your approach because it is simple. I don't like your
> approach because it cannot deal with fragmented memory. And that can
> happen easily.
>
> The idea I described here can be similarly be an extension of your
> approach, merging in a "batched reporting" Nitesh proposed, so we can
> report on something < MAX_ORDER, similar to s390x. In the end it boils
> down to reporting via hypercall vs. reporting via virtio. The main point
> is that it is synchronous and batched. (and that we properly take care
> of the race between host freeing and guest allocation)

I'd say the discussion is even simpler then that. My concern is more
synchronous versus asynchronous. I honestly think the cost for a
synchronous call is being overblown and we are likely to see the fault
and zeroing of pages cost more than the hypercall or virtio
transaction itself.

Also one reason why I am not a fan of working with anything less than
PMD order is because there have been issues in the past with false
memory leaks being created when hints were provided on THP pages that
essentially fragmented them. I guess hugepaged went through and
started trying to reassemble the huge pages and as a result there have
been apps that ended up consuming more memory than they would have
otherwise since they were using fragments of THP pages after doing an
MADV_DONTNEED on sections of the page.

> >
> > So far with my patch set that hints at the PMD level w/ THP enabled I
> > am not really seeing that much overhead for the hypercalls. The bigger
> > piece that is eating up CPU time is all the page faults and page
> > zeroing that is going on as we are cycling the memory in and out of
> > the guest. Some of that could probably be resolved by using MADV_FREE,
> > but if we are under actual memory pressure I suspect it would behave
> > similar to MADV_DONTNEED.
> >
>
> MADV_FREE is certainly the better thing to do for hinting in my opinion.
> It should result in even less overhead. Thanks for the comment about the
> hypercall overhead.

Yeah, no problem. The only thing I don't like about MADV_FREE is that
you have to have memory pressure before the pages really start getting
scrubbed with is both a benefit and a drawback. Basically it defers
the freeing until you are under actual memory pressure so when you hit
that case things start feeling much slower, that and it limits your
allocations since the kernel doesn't recognize the pages as free until
it would have to start trying to push memory to swap.


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread David Hildenbrand
On 18.02.19 21:40, Nitesh Narayan Lal wrote:
> On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
>> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
 So I'm fine with a simple implementation but the interface needs to
 allow the hypervisor to process hints in parallel while guest is
 running.  We can then fix any issues on hypervisor without breaking
 guests.
>>> Yes, I am fine with defining an interface that theoretically let's us
>>> change the implementation in the guest later.
>>> I consider this even a
>>> prerequisite. IMHO the interface shouldn't be different, it will be
>>> exactly the same.
>>>
>>> It is just "who" calls the batch freeing and waits for it. And as I
>>> outlined here, doing it without additional threads at least avoids us
>>> for now having to think about dynamic data structures and that we can
>>> sometimes not report "because the thread is still busy reporting or
>>> wasn't scheduled yet".
>> Sorry I wasn't clear. I think we need ability to change the
>> implementation in the *host* later. IOW don't rely on
>> host being synchronous.
>>
>>
> I actually misread it :) . In any way, there has to be a mechanism to
> synchronize.
>
> If we are going via a bare hypercall (like s390x, like what Alexander
> proposes), it is going to be a synchronous interface either way. Just a
> bare hypercall, there will not really be any blocking on the guest side.
 It bothers me that we are now tied to interface being synchronous. We
 won't be able to fix it if there's an issue as that would break guests.
>>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
>>>
>>> Yes, as I initially wrote, this mimics s390x. That might be good (we
>>> know it has been working for years) and bad (we are inheriting the same
>>> problem class, if it exists). And being synchronous is part of the
>>> approach for now.
>> BTW on s390 are these hypercalls handled by Linux?
>>
>>> I tend to focus on the first part (we don't know anything besides it is
>>> working) while you focus on the second part (there could be a potential
>>> problem). Having a real problem at hand would be great, then we would
>>> know what exactly we actually have to fix. But read below.
>> If we end up doing a hypercall per THP, maybe we could at least
>> not block with interrupts disabled? Poll in guest until
>> hypervisor reports its done?  That would already be an
>> improvement IMHO. E.g. perf within guest will point you
>> in the right direction and towards disabling hinting.
>>
>>
> Via virtio, I guess it is waiting for a response to a requests, right?
 For the buffer to be used, yes. And it could mean putting some pages
 aside until hypervisor is done with them. Then you don't need timers or
 tricks like this, you can get an interrupt and start using the memory.
>>> I am very open to such an approach as long as we can make it work and it
>>> is not too complicated. (-> simple)
>>>
>>> This would mean for example
>>>
>>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
>>> 256/512.
>>>
>>> 2. Once the buffer is full, do crazy "take pages out of the balloon
>>> action" and report them to the hypervisor via virtio. Let the VCPU
>>> continue. This will require some memory to store the request. Small
>>> hickup for the VCPU to kick of the reporting to the hypervisor.
>>>
>>> 3. On interrupt/response, go over the response and put the pages back to
>>> the buddy.
>>>
>>> (assuming that reporting a bulk of frees is better than reporting every
>>> single free obviously)
>>>
>>> This could allow nice things like "when OOM gets trigger, see if pages
>>> are currently being reported and wait until they have been put back to
>>> the buddy, return "new pages available", so in a real "low on memory"
>>> scenario, no OOM killer would get involved. This could address the issue
>>> Wei had with reporting when low on memory.
>>>
>>> Is that something you have in mind?
>> Yes that seems more future proof I think.
>>
>>> I assume we would have to allocate
>>> memory when crafting the new requests. This is the only reason I tend to
>>> prefer a synchronous interface for now. But if allocation is not a
>>> problem, great.
>> There are two main ways to avoid allocation:
>> 1. do not add extra data on top of each chunk passed
> If I am not wrong then this is close to what we have right now.

Yes, minus the kthread(s) and eventually with some sort of memory
allocation for the request. Once you're asynchronous via a notification
mechanisnm, there is no real need for a thread anymore, hopefully.

> One issue I see right now is that I am polling while host is freeing the
> memory.
> In the next version I could tie the logic which returns pages to the
> buddy and resets the per cpu array index value to 0 with the callback.
> (i.e.., it happens once we receive an 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread David Hildenbrand
On 18.02.19 21:31, Michael S. Tsirkin wrote:
> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
>>> So I'm fine with a simple implementation but the interface needs to
>>> allow the hypervisor to process hints in parallel while guest is
>>> running.  We can then fix any issues on hypervisor without breaking
>>> guests.
>>
>> Yes, I am fine with defining an interface that theoretically let's us
>> change the implementation in the guest later.
>> I consider this even a
>> prerequisite. IMHO the interface shouldn't be different, it will be
>> exactly the same.
>>
>> It is just "who" calls the batch freeing and waits for it. And as I
>> outlined here, doing it without additional threads at least avoids us
>> for now having to think about dynamic data structures and that we can
>> sometimes not report "because the thread is still busy reporting or
>> wasn't scheduled yet".
>
> Sorry I wasn't clear. I think we need ability to change the
> implementation in the *host* later. IOW don't rely on
> host being synchronous.
>
>
 I actually misread it :) . In any way, there has to be a mechanism to
 synchronize.

 If we are going via a bare hypercall (like s390x, like what Alexander
 proposes), it is going to be a synchronous interface either way. Just a
 bare hypercall, there will not really be any blocking on the guest side.
>>>
>>> It bothers me that we are now tied to interface being synchronous. We
>>> won't be able to fix it if there's an issue as that would break guests.
>>
>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
>>
>> Yes, as I initially wrote, this mimics s390x. That might be good (we
>> know it has been working for years) and bad (we are inheriting the same
>> problem class, if it exists). And being synchronous is part of the
>> approach for now.
> 
> BTW on s390 are these hypercalls handled by Linux?

I assume you mean in KVM - Yes! There is a hardware assist to handle the
"queuing of 512 pfns" but once the buffer is full, the actual hypercall
intercept will be triggered.

arch/s390/kvm/priv.c:handle_essa()

The interesting part is

down_read(>mm->mmap_sem);
for (i = 0; i < entries; ++i);
__gmap_zap(gmap, cbrlo[i]);
up_read(>mm->mmap_sem);

cbrlo is the pfn array stored in the hypervisor.

> 
>> I tend to focus on the first part (we don't know anything besides it is
>> working) while you focus on the second part (there could be a potential
>> problem). Having a real problem at hand would be great, then we would
>> know what exactly we actually have to fix. But read below.
> 
> If we end up doing a hypercall per THP, maybe we could at least
> not block with interrupts disabled? Poll in guest until
> hypervisor reports its done?  That would already be an
> improvement IMHO. E.g. perf within guest will point you
> in the right direction and towards disabling hinting.

I think we always have the option to busy loop where we consider it more
helpful. On synchronous hypercalls, no waiting is necessary. Only on
asynchronous ones (which would the most probably be virtio based).

I don't think only reporting THP will be future proof. So with whatever
we come up, it has to be able to deal with smaller granularities. Not
saying eventually page granularity, but at least some other orders. The
only solution to avoid overhead of many hypercalls is then to report
multiple ones in one shot.

>>>
 Via virtio, I guess it is waiting for a response to a requests, right?
>>>
>>> For the buffer to be used, yes. And it could mean putting some pages
>>> aside until hypervisor is done with them. Then you don't need timers or
>>> tricks like this, you can get an interrupt and start using the memory.
>>
>> I am very open to such an approach as long as we can make it work and it
>> is not too complicated. (-> simple)
>>
>> This would mean for example
>>
>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
>> 256/512.
>>
>> 2. Once the buffer is full, do crazy "take pages out of the balloon
>> action" and report them to the hypervisor via virtio. Let the VCPU
>> continue. This will require some memory to store the request. Small
>> hickup for the VCPU to kick of the reporting to the hypervisor.
>>
>> 3. On interrupt/response, go over the response and put the pages back to
>> the buddy.
>>
>> (assuming that reporting a bulk of frees is better than reporting every
>> single free obviously)
>>
>> This could allow nice things like "when OOM gets trigger, see if pages
>> are currently being reported and wait until they have been put back to
>> the buddy, return "new pages available", so in a real "low on memory"
>> scenario, no OOM killer would get involved. This could address the issue
>> Wei had with reporting when low on memory.
>>
>> Is that something you have in mind?
> 
> Yes that seems more future proof I think.

And it would satisfy 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread Nitesh Narayan Lal
On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
>>> So I'm fine with a simple implementation but the interface needs to
>>> allow the hypervisor to process hints in parallel while guest is
>>> running.  We can then fix any issues on hypervisor without breaking
>>> guests.
>> Yes, I am fine with defining an interface that theoretically let's us
>> change the implementation in the guest later.
>> I consider this even a
>> prerequisite. IMHO the interface shouldn't be different, it will be
>> exactly the same.
>>
>> It is just "who" calls the batch freeing and waits for it. And as I
>> outlined here, doing it without additional threads at least avoids us
>> for now having to think about dynamic data structures and that we can
>> sometimes not report "because the thread is still busy reporting or
>> wasn't scheduled yet".
> Sorry I wasn't clear. I think we need ability to change the
> implementation in the *host* later. IOW don't rely on
> host being synchronous.
>
>
 I actually misread it :) . In any way, there has to be a mechanism to
 synchronize.

 If we are going via a bare hypercall (like s390x, like what Alexander
 proposes), it is going to be a synchronous interface either way. Just a
 bare hypercall, there will not really be any blocking on the guest side.
>>> It bothers me that we are now tied to interface being synchronous. We
>>> won't be able to fix it if there's an issue as that would break guests.
>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
>>
>> Yes, as I initially wrote, this mimics s390x. That might be good (we
>> know it has been working for years) and bad (we are inheriting the same
>> problem class, if it exists). And being synchronous is part of the
>> approach for now.
> BTW on s390 are these hypercalls handled by Linux?
>
>> I tend to focus on the first part (we don't know anything besides it is
>> working) while you focus on the second part (there could be a potential
>> problem). Having a real problem at hand would be great, then we would
>> know what exactly we actually have to fix. But read below.
> If we end up doing a hypercall per THP, maybe we could at least
> not block with interrupts disabled? Poll in guest until
> hypervisor reports its done?  That would already be an
> improvement IMHO. E.g. perf within guest will point you
> in the right direction and towards disabling hinting.
>
>
 Via virtio, I guess it is waiting for a response to a requests, right?
>>> For the buffer to be used, yes. And it could mean putting some pages
>>> aside until hypervisor is done with them. Then you don't need timers or
>>> tricks like this, you can get an interrupt and start using the memory.
>> I am very open to such an approach as long as we can make it work and it
>> is not too complicated. (-> simple)
>>
>> This would mean for example
>>
>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
>> 256/512.
>>
>> 2. Once the buffer is full, do crazy "take pages out of the balloon
>> action" and report them to the hypervisor via virtio. Let the VCPU
>> continue. This will require some memory to store the request. Small
>> hickup for the VCPU to kick of the reporting to the hypervisor.
>>
>> 3. On interrupt/response, go over the response and put the pages back to
>> the buddy.
>>
>> (assuming that reporting a bulk of frees is better than reporting every
>> single free obviously)
>>
>> This could allow nice things like "when OOM gets trigger, see if pages
>> are currently being reported and wait until they have been put back to
>> the buddy, return "new pages available", so in a real "low on memory"
>> scenario, no OOM killer would get involved. This could address the issue
>> Wei had with reporting when low on memory.
>>
>> Is that something you have in mind?
> Yes that seems more future proof I think.
>
>> I assume we would have to allocate
>> memory when crafting the new requests. This is the only reason I tend to
>> prefer a synchronous interface for now. But if allocation is not a
>> problem, great.
> There are two main ways to avoid allocation:
> 1. do not add extra data on top of each chunk passed
If I am not wrong then this is close to what we have right now.
One issue I see right now is that I am polling while host is freeing the
memory.
In the next version I could tie the logic which returns pages to the
buddy and resets the per cpu array index value to 0 with the callback.
(i.e.., it happens once we receive an response from the host)
Other change which I am testing right now is to only capture 'MAX_ORDER
- 1' pages.
> 2. add extra data but pre-allocate buffers for it
>
>> -- 
>>
>> Thanks,
>>
>> David / dhildenb
-- 
Regards
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread Michael S. Tsirkin
On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
> > So I'm fine with a simple implementation but the interface needs to
> > allow the hypervisor to process hints in parallel while guest is
> > running.  We can then fix any issues on hypervisor without breaking
> > guests.
> 
>  Yes, I am fine with defining an interface that theoretically let's us
>  change the implementation in the guest later.
>  I consider this even a
>  prerequisite. IMHO the interface shouldn't be different, it will be
>  exactly the same.
> 
>  It is just "who" calls the batch freeing and waits for it. And as I
>  outlined here, doing it without additional threads at least avoids us
>  for now having to think about dynamic data structures and that we can
>  sometimes not report "because the thread is still busy reporting or
>  wasn't scheduled yet".
> >>>
> >>> Sorry I wasn't clear. I think we need ability to change the
> >>> implementation in the *host* later. IOW don't rely on
> >>> host being synchronous.
> >>>
> >>>
> >> I actually misread it :) . In any way, there has to be a mechanism to
> >> synchronize.
> >>
> >> If we are going via a bare hypercall (like s390x, like what Alexander
> >> proposes), it is going to be a synchronous interface either way. Just a
> >> bare hypercall, there will not really be any blocking on the guest side.
> > 
> > It bothers me that we are now tied to interface being synchronous. We
> > won't be able to fix it if there's an issue as that would break guests.
> 
> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
> 
> Yes, as I initially wrote, this mimics s390x. That might be good (we
> know it has been working for years) and bad (we are inheriting the same
> problem class, if it exists). And being synchronous is part of the
> approach for now.

BTW on s390 are these hypercalls handled by Linux?

> I tend to focus on the first part (we don't know anything besides it is
> working) while you focus on the second part (there could be a potential
> problem). Having a real problem at hand would be great, then we would
> know what exactly we actually have to fix. But read below.

If we end up doing a hypercall per THP, maybe we could at least
not block with interrupts disabled? Poll in guest until
hypervisor reports its done?  That would already be an
improvement IMHO. E.g. perf within guest will point you
in the right direction and towards disabling hinting.


> > 
> >> Via virtio, I guess it is waiting for a response to a requests, right?
> > 
> > For the buffer to be used, yes. And it could mean putting some pages
> > aside until hypervisor is done with them. Then you don't need timers or
> > tricks like this, you can get an interrupt and start using the memory.
> 
> I am very open to such an approach as long as we can make it work and it
> is not too complicated. (-> simple)
> 
> This would mean for example
> 
> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
> 256/512.
> 
> 2. Once the buffer is full, do crazy "take pages out of the balloon
> action" and report them to the hypervisor via virtio. Let the VCPU
> continue. This will require some memory to store the request. Small
> hickup for the VCPU to kick of the reporting to the hypervisor.
> 
> 3. On interrupt/response, go over the response and put the pages back to
> the buddy.
> 
> (assuming that reporting a bulk of frees is better than reporting every
> single free obviously)
> 
> This could allow nice things like "when OOM gets trigger, see if pages
> are currently being reported and wait until they have been put back to
> the buddy, return "new pages available", so in a real "low on memory"
> scenario, no OOM killer would get involved. This could address the issue
> Wei had with reporting when low on memory.
> 
> Is that something you have in mind?

Yes that seems more future proof I think.

> I assume we would have to allocate
> memory when crafting the new requests. This is the only reason I tend to
> prefer a synchronous interface for now. But if allocation is not a
> problem, great.

There are two main ways to avoid allocation:
1. do not add extra data on top of each chunk passed
2. add extra data but pre-allocate buffers for it

> -- 
> 
> Thanks,
> 
> David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread David Hildenbrand
> So I'm fine with a simple implementation but the interface needs to
> allow the hypervisor to process hints in parallel while guest is
> running.  We can then fix any issues on hypervisor without breaking
> guests.

 Yes, I am fine with defining an interface that theoretically let's us
 change the implementation in the guest later.
 I consider this even a
 prerequisite. IMHO the interface shouldn't be different, it will be
 exactly the same.

 It is just "who" calls the batch freeing and waits for it. And as I
 outlined here, doing it without additional threads at least avoids us
 for now having to think about dynamic data structures and that we can
 sometimes not report "because the thread is still busy reporting or
 wasn't scheduled yet".
>>>
>>> Sorry I wasn't clear. I think we need ability to change the
>>> implementation in the *host* later. IOW don't rely on
>>> host being synchronous.
>>>
>>>
>> I actually misread it :) . In any way, there has to be a mechanism to
>> synchronize.
>>
>> If we are going via a bare hypercall (like s390x, like what Alexander
>> proposes), it is going to be a synchronous interface either way. Just a
>> bare hypercall, there will not really be any blocking on the guest side.
> 
> It bothers me that we are now tied to interface being synchronous. We
> won't be able to fix it if there's an issue as that would break guests.

I assume with "fix it" you mean "fix kfree taking longer on every X call"?

Yes, as I initially wrote, this mimics s390x. That might be good (we
know it has been working for years) and bad (we are inheriting the same
problem class, if it exists). And being synchronous is part of the
approach for now.

I tend to focus on the first part (we don't know anything besides it is
working) while you focus on the second part (there could be a potential
problem). Having a real problem at hand would be great, then we would
know what exactly we actually have to fix. But read below.

> 
>> Via virtio, I guess it is waiting for a response to a requests, right?
> 
> For the buffer to be used, yes. And it could mean putting some pages
> aside until hypervisor is done with them. Then you don't need timers or
> tricks like this, you can get an interrupt and start using the memory.

I am very open to such an approach as long as we can make it work and it
is not too complicated. (-> simple)

This would mean for example

1. Collect entries to be reported per VCPU in a buffer. Say magic number
256/512.

2. Once the buffer is full, do crazy "take pages out of the balloon
action" and report them to the hypervisor via virtio. Let the VCPU
continue. This will require some memory to store the request. Small
hickup for the VCPU to kick of the reporting to the hypervisor.

3. On interrupt/response, go over the response and put the pages back to
the buddy.

(assuming that reporting a bulk of frees is better than reporting every
single free obviously)

This could allow nice things like "when OOM gets trigger, see if pages
are currently being reported and wait until they have been put back to
the buddy, return "new pages available", so in a real "low on memory"
scenario, no OOM killer would get involved. This could address the issue
Wei had with reporting when low on memory.

Is that something you have in mind? I assume we would have to allocate
memory when crafting the new requests. This is the only reason I tend to
prefer a synchronous interface for now. But if allocation is not a
problem, great.

-- 

Thanks,

David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread Michael S. Tsirkin
On Mon, Feb 18, 2019 at 08:35:36PM +0100, David Hildenbrand wrote:
> On 18.02.19 20:16, Michael S. Tsirkin wrote:
> > On Mon, Feb 18, 2019 at 07:29:44PM +0100, David Hildenbrand wrote:
> >>>
> >
> > But really what business has something that is supposedly
> > an optimization blocking a VCPU? We are just freeing up
> > lots of memory why is it a good idea to slow that
> > process down?
> 
>  I first want to know that it is a problem before we declare it a
>  problem. I provided an example (s390x) where it does not seem to be a
>  problem. One hypercall ~every 512 frees. As simple as it can get.
> 
>  No trying to deny that it could be a problem on x86, but then I assume
>  it is only a problem in specific setups.
> >>>
> >>> But which setups? How are we going to identify them?
> >>
> >> I guess is simple (I should be carefuly with this word ;) ): As long as
> >> you don't isolate + pin your CPUs in the hypervisor, you can expect any
> >> kind of sudden hickups. We're in a virtualized world. Real time is one
> >> example.
> >>
> >> Using kernel threads like Nitesh does right now? It can be scheduled
> >> anytime by the hypervisor on the exact same cpu. Unless you isolate +
> >> pin in the hypervor. So the same problem applies.
> > 
> > Right but we know how to handle this. Many deployments already use tools
> > to detect host threads kicking VCPUs out.
> > Getting VCPU blocked by a kfree call would be something new.
> > 
> 
> Yes, and for s390x we already have some kfree's taking longer than
> others. We have to identify when it is not okay.

Right even if the problem exists elsewhere this does not make it go away
or ensure that someone will work to address it :)


> > 
> >>> So I'm fine with a simple implementation but the interface needs to
> >>> allow the hypervisor to process hints in parallel while guest is
> >>> running.  We can then fix any issues on hypervisor without breaking
> >>> guests.
> >>
> >> Yes, I am fine with defining an interface that theoretically let's us
> >> change the implementation in the guest later.
> >> I consider this even a
> >> prerequisite. IMHO the interface shouldn't be different, it will be
> >> exactly the same.
> >>
> >> It is just "who" calls the batch freeing and waits for it. And as I
> >> outlined here, doing it without additional threads at least avoids us
> >> for now having to think about dynamic data structures and that we can
> >> sometimes not report "because the thread is still busy reporting or
> >> wasn't scheduled yet".
> > 
> > Sorry I wasn't clear. I think we need ability to change the
> > implementation in the *host* later. IOW don't rely on
> > host being synchronous.
> > 
> > 
> I actually misread it :) . In any way, there has to be a mechanism to
> synchronize.
> 
> If we are going via a bare hypercall (like s390x, like what Alexander
> proposes), it is going to be a synchronous interface either way. Just a
> bare hypercall, there will not really be any blocking on the guest side.

It bothers me that we are now tied to interface being synchronous. We
won't be able to fix it if there's an issue as that would break guests.

> Via virtio, I guess it is waiting for a response to a requests, right?

For the buffer to be used, yes. And it could mean putting some pages
aside until hypervisor is done with them. Then you don't need timers or
tricks like this, you can get an interrupt and start using the memory.


> -- 
> 
> Thanks,
> 
> David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread David Hildenbrand
On 18.02.19 20:16, Michael S. Tsirkin wrote:
> On Mon, Feb 18, 2019 at 07:29:44PM +0100, David Hildenbrand wrote:
>>>
>
> But really what business has something that is supposedly
> an optimization blocking a VCPU? We are just freeing up
> lots of memory why is it a good idea to slow that
> process down?

 I first want to know that it is a problem before we declare it a
 problem. I provided an example (s390x) where it does not seem to be a
 problem. One hypercall ~every 512 frees. As simple as it can get.

 No trying to deny that it could be a problem on x86, but then I assume
 it is only a problem in specific setups.
>>>
>>> But which setups? How are we going to identify them?
>>
>> I guess is simple (I should be carefuly with this word ;) ): As long as
>> you don't isolate + pin your CPUs in the hypervisor, you can expect any
>> kind of sudden hickups. We're in a virtualized world. Real time is one
>> example.
>>
>> Using kernel threads like Nitesh does right now? It can be scheduled
>> anytime by the hypervisor on the exact same cpu. Unless you isolate +
>> pin in the hypervor. So the same problem applies.
> 
> Right but we know how to handle this. Many deployments already use tools
> to detect host threads kicking VCPUs out.
> Getting VCPU blocked by a kfree call would be something new.
> 

Yes, and for s390x we already have some kfree's taking longer than
others. We have to identify when it is not okay.

> 
>>> So I'm fine with a simple implementation but the interface needs to
>>> allow the hypervisor to process hints in parallel while guest is
>>> running.  We can then fix any issues on hypervisor without breaking
>>> guests.
>>
>> Yes, I am fine with defining an interface that theoretically let's us
>> change the implementation in the guest later.
>> I consider this even a
>> prerequisite. IMHO the interface shouldn't be different, it will be
>> exactly the same.
>>
>> It is just "who" calls the batch freeing and waits for it. And as I
>> outlined here, doing it without additional threads at least avoids us
>> for now having to think about dynamic data structures and that we can
>> sometimes not report "because the thread is still busy reporting or
>> wasn't scheduled yet".
> 
> Sorry I wasn't clear. I think we need ability to change the
> implementation in the *host* later. IOW don't rely on
> host being synchronous.
> 
> 
I actually misread it :) . In any way, there has to be a mechanism to
synchronize.

If we are going via a bare hypercall (like s390x, like what Alexander
proposes), it is going to be a synchronous interface either way. Just a
bare hypercall, there will not really be any blocking on the guest side.

Via virtio, I guess it is waiting for a response to a requests, right?

-- 

Thanks,

David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread Michael S. Tsirkin
On Mon, Feb 18, 2019 at 07:29:44PM +0100, David Hildenbrand wrote:
> > 
> >>>
> >>> But really what business has something that is supposedly
> >>> an optimization blocking a VCPU? We are just freeing up
> >>> lots of memory why is it a good idea to slow that
> >>> process down?
> >>
> >> I first want to know that it is a problem before we declare it a
> >> problem. I provided an example (s390x) where it does not seem to be a
> >> problem. One hypercall ~every 512 frees. As simple as it can get.
> >>
> >> No trying to deny that it could be a problem on x86, but then I assume
> >> it is only a problem in specific setups.
> > 
> > But which setups? How are we going to identify them?
> 
> I guess is simple (I should be carefuly with this word ;) ): As long as
> you don't isolate + pin your CPUs in the hypervisor, you can expect any
> kind of sudden hickups. We're in a virtualized world. Real time is one
> example.
> 
> Using kernel threads like Nitesh does right now? It can be scheduled
> anytime by the hypervisor on the exact same cpu. Unless you isolate +
> pin in the hypervor. So the same problem applies.

Right but we know how to handle this. Many deployments already use tools
to detect host threads kicking VCPUs out.
Getting VCPU blocked by a kfree call would be something new.


...

> > So I'm fine with a simple implementation but the interface needs to
> > allow the hypervisor to process hints in parallel while guest is
> > running.  We can then fix any issues on hypervisor without breaking
> > guests.
> 
> Yes, I am fine with defining an interface that theoretically let's us
> change the implementation in the guest later.
> I consider this even a
> prerequisite. IMHO the interface shouldn't be different, it will be
> exactly the same.
> 
> It is just "who" calls the batch freeing and waits for it. And as I
> outlined here, doing it without additional threads at least avoids us
> for now having to think about dynamic data structures and that we can
> sometimes not report "because the thread is still busy reporting or
> wasn't scheduled yet".

Sorry I wasn't clear. I think we need ability to change the
implementation in the *host* later. IOW don't rely on
host being synchronous.


-- 
MST


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread David Hildenbrand
On 18.02.19 18:54, Michael S. Tsirkin wrote:
> On Mon, Feb 18, 2019 at 05:59:06PM +0100, David Hildenbrand wrote:
>> On 18.02.19 17:49, Michael S. Tsirkin wrote:
>>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
 It would be worth a try. My feeling is that a synchronous report after
 e.g. 512 frees should be acceptable, as it seems to be acceptable on
 s390x. (basically always enabled, nobody complains).
>>>
>>> What slips under the radar on an arch like s390 might
>>> raise issues for a popular arch like x86. My fear would be
>>> if it's only a problem e.g. for realtime. Then you get
>>> a condition that's very hard to trigger and affects
>>> worst case latencies.
>>
>> Realtime should never use free page hinting.
> 
> OK maybe document this in commit log. RT project has
> enough work as it is without need to untangle
> complex dependencies with other features.

We most certainly should!!

> 
>> Just like it should never
>> use ballooning.
> 
> Well its an aside but why not ballooning? As long as hypervisor does not 
> touch the balloon,
> and you don't touch the (weird, not really documented properly)
> deflate on oom, you are fine.
> Real time is violated when you reconfigure balloon,
> but  after you are done guest is real time again.
> And management certainly knows it that it did something
> with balloon at the exact same time there was a latency spike.

Fair enough, this is a potential use case. But it goes hand in hand with
pinning/unpinning pages. So yes, while this would be possible - modify
balloon in  "no real time period", I doubt this is a real life scenario.
As always, I like to be taught differently :)

Similar to "start reporting on !RT activity" and "stop reporting on RT
activity"

> 
> 
> I think this might not work well right now, but generally
> I think it should be fine. No?
> 
> 
>> Just like it should pin all pages in the hypervisor.
> 
> BTW all this is absolutely interesting to fix.
> But I agree wrt hinting being kind of like pinning.

Yes, this is all interesting stuff :)

> 
> 
>>>
>>> But really what business has something that is supposedly
>>> an optimization blocking a VCPU? We are just freeing up
>>> lots of memory why is it a good idea to slow that
>>> process down?
>>
>> I first want to know that it is a problem before we declare it a
>> problem. I provided an example (s390x) where it does not seem to be a
>> problem. One hypercall ~every 512 frees. As simple as it can get.
>>
>> No trying to deny that it could be a problem on x86, but then I assume
>> it is only a problem in specific setups.
> 
> But which setups? How are we going to identify them?

I guess is simple (I should be carefuly with this word ;) ): As long as
you don't isolate + pin your CPUs in the hypervisor, you can expect any
kind of sudden hickups. We're in a virtualized world. Real time is one
example.

Using kernel threads like Nitesh does right now? It can be scheduled
anytime by the hypervisor on the exact same cpu. Unless you isolate +
pin in the hypervor. So the same problem applies.

> 
>> I would much rather prefer a simple solution that can eventually be
>> disabled in selected setup than a complicated solution that tries to fit
>> all possible setups.
> 
> Well I am not sure just disabling it is reasonable.  E.g. Alex shows
> drastic boot time speedups.  You won't be able to come to people later
> and say oh you need to disable this feature yes you will stop getting
> packet loss once in a while but you also won't be able to boot your VMs
> quickly enough.

The guest is always free to disable once up. Yes, these are nice
details, but I consider these improvements we can work on later.

> 
> So I'm fine with a simple implementation but the interface needs to
> allow the hypervisor to process hints in parallel while guest is
> running.  We can then fix any issues on hypervisor without breaking
> guests.

Yes, I am fine with defining an interface that theoretically let's us
change the implementation in the guest later. I consider this even a
prerequisite. IMHO the interface shouldn't be different, it will be
exactly the same.

It is just "who" calls the batch freeing and waits for it. And as I
outlined here, doing it without additional threads at least avoids us
for now having to think about dynamic data structures and that we can
sometimes not report "because the thread is still busy reporting or
wasn't scheduled yet".

> 
> 
>> Realtime is one of the examples where such stuff is
>> to be disabled either way.
> 
> OK so we have identified realtime. Nice even though it wasn't documented
> anywhere. Are there other workloads? What are they?

As stated above, I think these environments are easy to spot. As long as
you don't isolate and pin, surprises can happen anytime. Can you think
of others?

(this stuff really has to be documented)

> 
> 
>> Optimization of space comes with a price (here: execution time).
> 
> I am not sure I agree. If hinting patches just 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread Michael S. Tsirkin
On Mon, Feb 18, 2019 at 09:31:13AM -0800, Alexander Duyck wrote:
> > Optimization of space comes with a price (here: execution time).
> 
> One thing to keep in mind though is that if you are already having to
> pull pages in and out of swap on the host in order be able to provide
> enough memory for the guests the free page hinting should be a
> significant win in terms of performance.

Absolutely and I think that's the point of the hinting right?
To cut out swap/IO on the host.

-- 
MST


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread Michael S. Tsirkin
On Mon, Feb 18, 2019 at 05:59:06PM +0100, David Hildenbrand wrote:
> On 18.02.19 17:49, Michael S. Tsirkin wrote:
> > On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
> >> It would be worth a try. My feeling is that a synchronous report after
> >> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> >> s390x. (basically always enabled, nobody complains).
> > 
> > What slips under the radar on an arch like s390 might
> > raise issues for a popular arch like x86. My fear would be
> > if it's only a problem e.g. for realtime. Then you get
> > a condition that's very hard to trigger and affects
> > worst case latencies.
> 
> Realtime should never use free page hinting.

OK maybe document this in commit log. RT project has
enough work as it is without need to untangle
complex dependencies with other features.

> Just like it should never
> use ballooning.

Well its an aside but why not ballooning? As long as hypervisor does not touch 
the balloon,
and you don't touch the (weird, not really documented properly)
deflate on oom, you are fine.
Real time is violated when you reconfigure balloon,
but  after you are done guest is real time again.
And management certainly knows it that it did something
with balloon at the exact same time there was a latency spike.


I think this might not work well right now, but generally
I think it should be fine. No?


> Just like it should pin all pages in the hypervisor.

BTW all this is absolutely interesting to fix.
But I agree wrt hinting being kind of like pinning.


> > 
> > But really what business has something that is supposedly
> > an optimization blocking a VCPU? We are just freeing up
> > lots of memory why is it a good idea to slow that
> > process down?
> 
> I first want to know that it is a problem before we declare it a
> problem. I provided an example (s390x) where it does not seem to be a
> problem. One hypercall ~every 512 frees. As simple as it can get.
> 
> No trying to deny that it could be a problem on x86, but then I assume
> it is only a problem in specific setups.

But which setups? How are we going to identify them?

> I would much rather prefer a simple solution that can eventually be
> disabled in selected setup than a complicated solution that tries to fit
> all possible setups.

Well I am not sure just disabling it is reasonable.  E.g. Alex shows
drastic boot time speedups.  You won't be able to come to people later
and say oh you need to disable this feature yes you will stop getting
packet loss once in a while but you also won't be able to boot your VMs
quickly enough.

So I'm fine with a simple implementation but the interface needs to
allow the hypervisor to process hints in parallel while guest is
running.  We can then fix any issues on hypervisor without breaking
guests.


> Realtime is one of the examples where such stuff is
> to be disabled either way.

OK so we have identified realtime. Nice even though it wasn't documented
anywhere. Are there other workloads? What are they?


> Optimization of space comes with a price (here: execution time).

I am not sure I agree. If hinting patches just slowed everyone down they
would be useless. Note how Alex show-cased this by demonstrating
faster boot times.

Unlike regular ballooning, this doesn't do much to optimize space. There
are no promises so host must still have enough swap to fit guest memory
anyway.

All free page hinting does is reduce IO on the hypervisor.

So it's a tradeoff.

> -- 
> 
> Thanks,
> 
> David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread David Hildenbrand
On 18.02.19 18:31, Alexander Duyck wrote:
> On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand  wrote:
>>
>> On 18.02.19 17:49, Michael S. Tsirkin wrote:
>>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
 It would be worth a try. My feeling is that a synchronous report after
 e.g. 512 frees should be acceptable, as it seems to be acceptable on
 s390x. (basically always enabled, nobody complains).
>>>
>>> What slips under the radar on an arch like s390 might
>>> raise issues for a popular arch like x86. My fear would be
>>> if it's only a problem e.g. for realtime. Then you get
>>> a condition that's very hard to trigger and affects
>>> worst case latencies.
>>
>> Realtime should never use free page hinting. Just like it should never
>> use ballooning. Just like it should pin all pages in the hypervisor.
>>
>>>
>>> But really what business has something that is supposedly
>>> an optimization blocking a VCPU? We are just freeing up
>>> lots of memory why is it a good idea to slow that
>>> process down?
>>
>> I first want to know that it is a problem before we declare it a
>> problem. I provided an example (s390x) where it does not seem to be a
>> problem. One hypercall ~every 512 frees. As simple as it can get.
>>
>> No trying to deny that it could be a problem on x86, but then I assume
>> it is only a problem in specific setups.
>>
>> I would much rather prefer a simple solution that can eventually be
>> disabled in selected setup than a complicated solution that tries to fit
>> all possible setups. Realtime is one of the examples where such stuff is
>> to be disabled either way.
>>
>> Optimization of space comes with a price (here: execution time).
> 
> One thing to keep in mind though is that if you are already having to
> pull pages in and out of swap on the host in order be able to provide
> enough memory for the guests the free page hinting should be a
> significant win in terms of performance.

Indeed. And also we are in a virtualized environment already, we can
have any kind of sudden hickups. (again, realtime has special
requirements on the setup)

Side note: I like your approach because it is simple. I don't like your
approach because it cannot deal with fragmented memory. And that can
happen easily.

The idea I described here can be similarly be an extension of your
approach, merging in a "batched reporting" Nitesh proposed, so we can
report on something < MAX_ORDER, similar to s390x. In the end it boils
down to reporting via hypercall vs. reporting via virtio. The main point
is that it is synchronous and batched. (and that we properly take care
of the race between host freeing and guest allocation)

> 
> So far with my patch set that hints at the PMD level w/ THP enabled I
> am not really seeing that much overhead for the hypercalls. The bigger
> piece that is eating up CPU time is all the page faults and page
> zeroing that is going on as we are cycling the memory in and out of
> the guest. Some of that could probably be resolved by using MADV_FREE,
> but if we are under actual memory pressure I suspect it would behave
> similar to MADV_DONTNEED.
> 

MADV_FREE is certainly the better thing to do for hinting in my opinion.
It should result in even less overhead. Thanks for the comment about the
hypercall overhead.


-- 

Thanks,

David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread Alexander Duyck
On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand  wrote:
>
> On 18.02.19 17:49, Michael S. Tsirkin wrote:
> > On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
> >> It would be worth a try. My feeling is that a synchronous report after
> >> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> >> s390x. (basically always enabled, nobody complains).
> >
> > What slips under the radar on an arch like s390 might
> > raise issues for a popular arch like x86. My fear would be
> > if it's only a problem e.g. for realtime. Then you get
> > a condition that's very hard to trigger and affects
> > worst case latencies.
>
> Realtime should never use free page hinting. Just like it should never
> use ballooning. Just like it should pin all pages in the hypervisor.
>
> >
> > But really what business has something that is supposedly
> > an optimization blocking a VCPU? We are just freeing up
> > lots of memory why is it a good idea to slow that
> > process down?
>
> I first want to know that it is a problem before we declare it a
> problem. I provided an example (s390x) where it does not seem to be a
> problem. One hypercall ~every 512 frees. As simple as it can get.
>
> No trying to deny that it could be a problem on x86, but then I assume
> it is only a problem in specific setups.
>
> I would much rather prefer a simple solution that can eventually be
> disabled in selected setup than a complicated solution that tries to fit
> all possible setups. Realtime is one of the examples where such stuff is
> to be disabled either way.
>
> Optimization of space comes with a price (here: execution time).

One thing to keep in mind though is that if you are already having to
pull pages in and out of swap on the host in order be able to provide
enough memory for the guests the free page hinting should be a
significant win in terms of performance.

So far with my patch set that hints at the PMD level w/ THP enabled I
am not really seeing that much overhead for the hypercalls. The bigger
piece that is eating up CPU time is all the page faults and page
zeroing that is going on as we are cycling the memory in and out of
the guest. Some of that could probably be resolved by using MADV_FREE,
but if we are under actual memory pressure I suspect it would behave
similar to MADV_DONTNEED.


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread David Hildenbrand
On 18.02.19 17:49, Michael S. Tsirkin wrote:
> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
>> It would be worth a try. My feeling is that a synchronous report after
>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
>> s390x. (basically always enabled, nobody complains).
> 
> What slips under the radar on an arch like s390 might
> raise issues for a popular arch like x86. My fear would be
> if it's only a problem e.g. for realtime. Then you get
> a condition that's very hard to trigger and affects
> worst case latencies.

Realtime should never use free page hinting. Just like it should never
use ballooning. Just like it should pin all pages in the hypervisor.

> 
> But really what business has something that is supposedly
> an optimization blocking a VCPU? We are just freeing up
> lots of memory why is it a good idea to slow that
> process down?

I first want to know that it is a problem before we declare it a
problem. I provided an example (s390x) where it does not seem to be a
problem. One hypercall ~every 512 frees. As simple as it can get.

No trying to deny that it could be a problem on x86, but then I assume
it is only a problem in specific setups.

I would much rather prefer a simple solution that can eventually be
disabled in selected setup than a complicated solution that tries to fit
all possible setups. Realtime is one of the examples where such stuff is
to be disabled either way.

Optimization of space comes with a price (here: execution time).

-- 

Thanks,

David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread Michael S. Tsirkin
On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
> It would be worth a try. My feeling is that a synchronous report after
> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> s390x. (basically always enabled, nobody complains).

What slips under the radar on an arch like s390 might
raise issues for a popular arch like x86. My fear would be
if it's only a problem e.g. for realtime. Then you get
a condition that's very hard to trigger and affects
worst case latencies.

But really what business has something that is supposedly
an optimization blocking a VCPU? We are just freeing up
lots of memory why is it a good idea to slow that
process down?

-- 
MST


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread David Hildenbrand
On 18.02.19 16:50, Nitesh Narayan Lal wrote:
> 
> On 2/16/19 4:40 AM, David Hildenbrand wrote:
>> On 04.02.19 21:18, Nitesh Narayan Lal wrote:
>>
>> Hi Nitesh,
>>
>> I thought again about how s390x handles free page hinting. As that seems
>> to work just fine, I guess sticking to a similar model makes sense.
>>
>>
>> I already explained in this thread how it works on s390x, a short summary:
>>
>> 1. Each VCPU has a buffer of pfns to be reported to the hypervisor. If I
>> am not wrong, it contains 512 entries, so is exactly 1 page big. This
>> buffer is stored in the hypervisor and is on page granularity.
>>
>> 2. This page buffer is managed via the ESSA instruction. In addition, to
>> synchronize with the guest ("page reused when freeing in the
>> hypervisor"), special bits in the host->guest page table can be
>> set/locked via the ESSA instruction by the guest and similarly accessed
>> by the hypervisor.
>>
>> 3. Once the buffer is full, the guest does a synchronous hypercall,
>> going over all 512 entries and zapping them (== similar to MADV_DONTNEED)
>>
>>
>> To mimic that, we
>>
>> 1. Have a static buffer per VCPU in the guest with 512 entries. You
>> basically have that already.
>>
>> 2. On every free, add the page _or_ the page after merging by the buddy
>> (e.g. MAX_ORDER - 1) to the buffer (this is where we could be better
>> than s390x). You basically have that already.
>>
>> 3. If the buffer is full, try to isolate all pages and do a synchronous
>> report to the hypervisor. You have the first part already. The second
>> part would require a change (don't use a separate/global thread to do
>> the hinting, just do it synchronously).
>>
>> 4. One hinting is done, putback all isolated pages to the budy. You
>> basically have that already.
>>
>>
>> For 3. we can try what you have right now, using virtio. If we detect
>> that's a problem, we can do it similar to what Alexander proposes and
>> just do a bare hypercall. It's just a different way of carrying out the
>> same task.
>>
>>
>> This approach
>> 1. Mimics what s390x does, besides supporting different granularities.
>> To synchronize guest->host we simply take the pages off the buddy.
>>
>> 2. Is basically what Alexander does, however his design limitation is
>> that doing any hinting on smaller granularities will not work because
>> there will be too many synchronous hints. Bad on fragmented guests.
>>
>> 3. Does not require any dynamic data structures in the guest.
>>
>> 4. Does not block allocation paths.
>>
>> 5. Blocks on e.g. every 512'ed free. It seems to work on s390x, why
>> shouldn't it for us. We have to measure.
>>
>> 6. We are free to decide which granularity we report.
>>
>> 7. Potentially works even if the guest memory is fragmented (little
>> MAX_ORDER - 1) pages.
>>
>> It would be worth a try. My feeling is that a synchronous report after
>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
>> s390x. (basically always enabled, nobody complains).
> 
> The reason I like the current approach of reporting via separate kernel
> thread is that it doesn't block any regular allocation/freeing code path
> in anyways.

Well, that is partially true. The work has to be done "somewhere", so
once you kick a separate kernel thread, it can easily be scheduled on
the very same VCPU in the very near future. So depending on the user,
the "hickup" is similarly visible.

Having separate kernel threads seems to result in other questions not
easy to answer (do we need dynamic data structures, how to size these
data structures, how many threads do we want (e.g. big number of vcpus)
) that seem to be avoidable by keeping it simple and not having separate
threads. Initially I also thought that separate threads were the natural
thing to do, but now I have the feeling that it tends to over complicate
the problem. (and I don't want to repeat myself, but on s390x it seems
to work this way just fine, if we want to mimic that). Especially
without us knowing if "don't do a hypercall every X free calls" is
really a problem.

>>
>> We would have to play with how to enable/disable reporting and when to
>> not report because it's not worth it in the guest (e.g. low on memory).
>>
>>
>> Do you think something like this would be easy to change/implement and
>> measure?
> 
> I can do that as I figure out a real world guest workload using which
> the two approaches can be compared.




-- 

Thanks,

David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-18 Thread Nitesh Narayan Lal

On 2/16/19 4:40 AM, David Hildenbrand wrote:
> On 04.02.19 21:18, Nitesh Narayan Lal wrote:
>
> Hi Nitesh,
>
> I thought again about how s390x handles free page hinting. As that seems
> to work just fine, I guess sticking to a similar model makes sense.
>
>
> I already explained in this thread how it works on s390x, a short summary:
>
> 1. Each VCPU has a buffer of pfns to be reported to the hypervisor. If I
> am not wrong, it contains 512 entries, so is exactly 1 page big. This
> buffer is stored in the hypervisor and is on page granularity.
>
> 2. This page buffer is managed via the ESSA instruction. In addition, to
> synchronize with the guest ("page reused when freeing in the
> hypervisor"), special bits in the host->guest page table can be
> set/locked via the ESSA instruction by the guest and similarly accessed
> by the hypervisor.
>
> 3. Once the buffer is full, the guest does a synchronous hypercall,
> going over all 512 entries and zapping them (== similar to MADV_DONTNEED)
>
>
> To mimic that, we
>
> 1. Have a static buffer per VCPU in the guest with 512 entries. You
> basically have that already.
>
> 2. On every free, add the page _or_ the page after merging by the buddy
> (e.g. MAX_ORDER - 1) to the buffer (this is where we could be better
> than s390x). You basically have that already.
>
> 3. If the buffer is full, try to isolate all pages and do a synchronous
> report to the hypervisor. You have the first part already. The second
> part would require a change (don't use a separate/global thread to do
> the hinting, just do it synchronously).
>
> 4. One hinting is done, putback all isolated pages to the budy. You
> basically have that already.
>
>
> For 3. we can try what you have right now, using virtio. If we detect
> that's a problem, we can do it similar to what Alexander proposes and
> just do a bare hypercall. It's just a different way of carrying out the
> same task.
>
>
> This approach
> 1. Mimics what s390x does, besides supporting different granularities.
> To synchronize guest->host we simply take the pages off the buddy.
>
> 2. Is basically what Alexander does, however his design limitation is
> that doing any hinting on smaller granularities will not work because
> there will be too many synchronous hints. Bad on fragmented guests.
>
> 3. Does not require any dynamic data structures in the guest.
>
> 4. Does not block allocation paths.
>
> 5. Blocks on e.g. every 512'ed free. It seems to work on s390x, why
> shouldn't it for us. We have to measure.
>
> 6. We are free to decide which granularity we report.
>
> 7. Potentially works even if the guest memory is fragmented (little
> MAX_ORDER - 1) pages.
>
> It would be worth a try. My feeling is that a synchronous report after
> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> s390x. (basically always enabled, nobody complains).

The reason I like the current approach of reporting via separate kernel
thread is that it doesn't block any regular allocation/freeing code path
in anyways.
>
> We would have to play with how to enable/disable reporting and when to
> not report because it's not worth it in the guest (e.g. low on memory).
>
>
> Do you think something like this would be easy to change/implement and
> measure?

I can do that as I figure out a real world guest workload using which
the two approaches can be compared.

> Thanks!
>
>> The following patch-set proposes an efficient mechanism for handing freed 
>> memory between the guest and the host. It enables the guests with no page 
>> cache to rapidly free and reclaims memory to and from the host respectively.
>>
>> Benefit:
>> With this patch-series, in our test-case, executed on a single system and 
>> single NUMA node with 15GB memory, we were able to successfully launch 
>> atleast 5 guests 
>> when page hinting was enabled and 3 without it. (Detailed explanation of the 
>> test procedure is provided at the bottom).
>>
>> Changelog in V8:
>> In this patch-series, the earlier approach [1] which was used to capture and 
>> scan the pages freed by the guest has been changed. The new approach is 
>> briefly described below:
>>
>> The patch-set still leverages the existing arch_free_page() to add this 
>> functionality. It maintains a per CPU array which is used to store the pages 
>> freed by the guest. The maximum number of entries which it can hold is 
>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it 
>> is scanned and only the pages which are available in the buddy are stored. 
>> This process continues until the array is filled with pages which are part 
>> of the buddy free list. After which it wakes up a kernel per-cpu-thread.
>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation 
>> and if the page is not reallocated and present in the buddy, the kernel 
>> thread attempts to isolate it from the buddy. If it is successfully 
>> isolated, the page is added to another per-cpu array. Once the entire 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-17 Thread Wei Wang

On 02/18/2019 10:36 AM, Wei Wang wrote:

On 02/15/2019 05:41 PM, David Hildenbrand wrote:

On 15.02.19 10:05, Wang, Wei W wrote:

On Thursday, February 14, 2019 5:43 PM, David Hildenbrand wrote:
Yes indeed, that is the important bit. They must not be put pack to 
the
buddy before they have been processed by the hypervisor. But as the 
pages
are not in the buddy, no one allocating a page will stumble over 
such a page
and try to allocate it. Threads trying to allocate memory will 
simply pick
another buddy page instead of "busy waiting" for that page to be 
finished

reporting.
What if a guest thread try to allocate some pages but the buddy 
cannot satisfy
because all the pages are isolated? Would it be the same case that 
the guest thread
gets blocked by waiting all the isolated pages to get madvised by 
the host and
returned to the guest buddy, or even worse, some guest threads get 
killed due to oom?

Your question targets low memory situations in the guest. I think Nitesh
already answered parts of that question somewhere and I'll let him
answer it in detail, only a short comment from my side :)

I can imagine techniques where the OOM killer can be avoided, but the
OOM handler will eventually kick in and handle it.

In general your question is valid and we will have to think about a way
to avoid that from happening. However, in contrast to your approach
blocking on potentially every page that is being hinted, in Nitesh's
approach this would only happen when the guest is really low on memory.
And the question is in general, if a guest wants to hint if low on
memory ("safety buffer").


I think we should forget that the guest is low on memory because

%s/should/shouldn't
this approach takes all the pages off the list, not because the guest 
really

uses up the free memory.

Guest allocating one page could also potentially be blocked until all 
the pages

(as opposed to one page) being madvised and returned to the guest buddy.

Best,
Wei




Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-17 Thread Wei Wang

On 02/15/2019 05:41 PM, David Hildenbrand wrote:

On 15.02.19 10:05, Wang, Wei W wrote:

On Thursday, February 14, 2019 5:43 PM, David Hildenbrand wrote:

Yes indeed, that is the important bit. They must not be put pack to the
buddy before they have been processed by the hypervisor. But as the pages
are not in the buddy, no one allocating a page will stumble over such a page
and try to allocate it. Threads trying to allocate memory will simply pick
another buddy page instead of "busy waiting" for that page to be finished
reporting.

What if a guest thread try to allocate some pages but the buddy cannot satisfy
because all the pages are isolated? Would it be the same case that the guest 
thread
gets blocked by waiting all the isolated pages to get madvised by the host and
returned to the guest buddy, or even worse, some guest threads get killed due 
to oom?

Your question targets low memory situations in the guest. I think Nitesh
already answered parts of that question somewhere and I'll let him
answer it in detail, only a short comment from my side :)

I can imagine techniques where the OOM killer can be avoided, but the
OOM handler will eventually kick in and handle it.

In general your question is valid and we will have to think about a way
to avoid that from happening. However, in contrast to your approach
blocking on potentially every page that is being hinted, in Nitesh's
approach this would only happen when the guest is really low on memory.
And the question is in general, if a guest wants to hint if low on
memory ("safety buffer").


I think we should forget that the guest is low on memory because
this approach takes all the pages off the list, not because the guest really
uses up the free memory.

Guest allocating one page could also potentially be blocked until all 
the pages

(as opposed to one page) being madvised and returned to the guest buddy.

Best,
Wei


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-16 Thread David Hildenbrand
On 04.02.19 21:18, Nitesh Narayan Lal wrote:

Hi Nitesh,

I thought again about how s390x handles free page hinting. As that seems
to work just fine, I guess sticking to a similar model makes sense.


I already explained in this thread how it works on s390x, a short summary:

1. Each VCPU has a buffer of pfns to be reported to the hypervisor. If I
am not wrong, it contains 512 entries, so is exactly 1 page big. This
buffer is stored in the hypervisor and is on page granularity.

2. This page buffer is managed via the ESSA instruction. In addition, to
synchronize with the guest ("page reused when freeing in the
hypervisor"), special bits in the host->guest page table can be
set/locked via the ESSA instruction by the guest and similarly accessed
by the hypervisor.

3. Once the buffer is full, the guest does a synchronous hypercall,
going over all 512 entries and zapping them (== similar to MADV_DONTNEED)


To mimic that, we

1. Have a static buffer per VCPU in the guest with 512 entries. You
basically have that already.

2. On every free, add the page _or_ the page after merging by the buddy
(e.g. MAX_ORDER - 1) to the buffer (this is where we could be better
than s390x). You basically have that already.

3. If the buffer is full, try to isolate all pages and do a synchronous
report to the hypervisor. You have the first part already. The second
part would require a change (don't use a separate/global thread to do
the hinting, just do it synchronously).

4. One hinting is done, putback all isolated pages to the budy. You
basically have that already.


For 3. we can try what you have right now, using virtio. If we detect
that's a problem, we can do it similar to what Alexander proposes and
just do a bare hypercall. It's just a different way of carrying out the
same task.


This approach
1. Mimics what s390x does, besides supporting different granularities.
To synchronize guest->host we simply take the pages off the buddy.

2. Is basically what Alexander does, however his design limitation is
that doing any hinting on smaller granularities will not work because
there will be too many synchronous hints. Bad on fragmented guests.

3. Does not require any dynamic data structures in the guest.

4. Does not block allocation paths.

5. Blocks on e.g. every 512'ed free. It seems to work on s390x, why
shouldn't it for us. We have to measure.

6. We are free to decide which granularity we report.

7. Potentially works even if the guest memory is fragmented (little
MAX_ORDER - 1) pages.

It would be worth a try. My feeling is that a synchronous report after
e.g. 512 frees should be acceptable, as it seems to be acceptable on
s390x. (basically always enabled, nobody complains).

We would have to play with how to enable/disable reporting and when to
not report because it's not worth it in the guest (e.g. low on memory).


Do you think something like this would be easy to change/implement and
measure?

Thanks!

> The following patch-set proposes an efficient mechanism for handing freed 
> memory between the guest and the host. It enables the guests with no page 
> cache to rapidly free and reclaims memory to and from the host respectively.
> 
> Benefit:
> With this patch-series, in our test-case, executed on a single system and 
> single NUMA node with 15GB memory, we were able to successfully launch 
> atleast 5 guests 
> when page hinting was enabled and 3 without it. (Detailed explanation of the 
> test procedure is provided at the bottom).
> 
> Changelog in V8:
> In this patch-series, the earlier approach [1] which was used to capture and 
> scan the pages freed by the guest has been changed. The new approach is 
> briefly described below:
> 
> The patch-set still leverages the existing arch_free_page() to add this 
> functionality. It maintains a per CPU array which is used to store the pages 
> freed by the guest. The maximum number of entries which it can hold is 
> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it is 
> scanned and only the pages which are available in the buddy are stored. This 
> process continues until the array is filled with pages which are part of the 
> buddy free list. After which it wakes up a kernel per-cpu-thread.
> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation 
> and if the page is not reallocated and present in the buddy, the kernel 
> thread attempts to isolate it from the buddy. If it is successfully isolated, 
> the page is added to another per-cpu array. Once the entire scanning process 
> is complete, all the isolated pages are reported to the host through an 
> existing virtio-balloon driver.
> 
> Known Issues:
>   * Fixed array size: The problem with having a fixed/hardcoded array 
> size arises when the size of the guest varies. For example when the guest 
> size increases and it starts making large allocations fixed size limits this 
> solution's ability to capture all the freed pages. This will result in less 
> guest 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-15 Thread Nitesh Narayan Lal

On 2/15/19 4:05 AM, Wang, Wei W wrote:
> On Thursday, February 14, 2019 5:43 PM, David Hildenbrand wrote:
>> Yes indeed, that is the important bit. They must not be put pack to the
>> buddy before they have been processed by the hypervisor. But as the pages
>> are not in the buddy, no one allocating a page will stumble over such a page
>> and try to allocate it. Threads trying to allocate memory will simply pick
>> another buddy page instead of "busy waiting" for that page to be finished
>> reporting.
> What if a guest thread try to allocate some pages but the buddy cannot satisfy
> because all the pages are isolated? Would it be the same case that the guest 
> thread
> gets blocked by waiting all the isolated pages to get madvised by the host and
> returned to the guest buddy, or even worse, some guest threads get killed due 
> to oom?

If you are referring to a situation when guest is under memory pressure
then isolation request will fail and memory will not be
reported/isolated. However, there can always be a corner case and we can
definitely take that into consideration later.

>
> Best,
> Wei
-- 
Regards
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-15 Thread David Hildenbrand
On 15.02.19 10:05, Wang, Wei W wrote:
> On Thursday, February 14, 2019 5:43 PM, David Hildenbrand wrote:
>> Yes indeed, that is the important bit. They must not be put pack to the
>> buddy before they have been processed by the hypervisor. But as the pages
>> are not in the buddy, no one allocating a page will stumble over such a page
>> and try to allocate it. Threads trying to allocate memory will simply pick
>> another buddy page instead of "busy waiting" for that page to be finished
>> reporting.
> 
> What if a guest thread try to allocate some pages but the buddy cannot satisfy
> because all the pages are isolated? Would it be the same case that the guest 
> thread
> gets blocked by waiting all the isolated pages to get madvised by the host and
> returned to the guest buddy, or even worse, some guest threads get killed due 
> to oom?

Your question targets low memory situations in the guest. I think Nitesh
already answered parts of that question somewhere and I'll let him
answer it in detail, only a short comment from my side :)

I can imagine techniques where the OOM killer can be avoided, but the
OOM handler will eventually kick in and handle it.

In general your question is valid and we will have to think about a way
to avoid that from happening. However, in contrast to your approach
blocking on potentially every page that is being hinted, in Nitesh's
approach this would only happen when the guest is really low on memory.
And the question is in general, if a guest wants to hint if low on
memory ("safety buffer").

> 
> Best,
> Wei
> 


-- 

Thanks,

David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-15 Thread David Hildenbrand
On 15.02.19 10:15, Wang, Wei W wrote:
> On Thursday, February 14, 2019 6:01 PM, David Hildenbrand wrote:
>> And how to preload without locking?
> 
> The memory is preload per-CPU. It's usually called outside the lock.

Right, that works as long as only a fixed amount of pages is needed. I
remember that working for some radix tree operations.

> 
> Best,
> Wei
> 


-- 

Thanks,

David / dhildenb


RE: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-15 Thread Wang, Wei W
On Thursday, February 14, 2019 6:01 PM, David Hildenbrand wrote:
> And how to preload without locking?

The memory is preload per-CPU. It's usually called outside the lock.

Best,
Wei


RE: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-15 Thread Wang, Wei W
On Thursday, February 14, 2019 5:43 PM, David Hildenbrand wrote:
> Yes indeed, that is the important bit. They must not be put pack to the
> buddy before they have been processed by the hypervisor. But as the pages
> are not in the buddy, no one allocating a page will stumble over such a page
> and try to allocate it. Threads trying to allocate memory will simply pick
> another buddy page instead of "busy waiting" for that page to be finished
> reporting.

What if a guest thread try to allocate some pages but the buddy cannot satisfy
because all the pages are isolated? Would it be the same case that the guest 
thread
gets blocked by waiting all the isolated pages to get madvised by the host and
returned to the guest buddy, or even worse, some guest threads get killed due 
to oom?

Best,
Wei


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-14 Thread Nitesh Narayan Lal

On 2/14/19 3:48 AM, Wang, Wei W wrote:
> On Wednesday, February 13, 2019 8:07 PM, Nitesh Narayan Lal wrote:
>> Once the host free the pages. All the isolated pages are returned back 
>> to the buddy. (This is implemented in hyperlist_ready())
> This actually has the same issue: the isolated pages have to wait to return 
> to the buddy
> after the host has done with madvise(DONTNEED). Otherwise, a page could be 
> used by
> a guest thread and the next moment the host takes it to other host threads.
I don't think that this will be a blocking case. Let's say there are
pages from the normal zone which are isolated and are currently freed by
the host at this point even if the normal zone runs out of free pages,
any allocation request could be served from another zone which will have
free pages.
>
> Best,
> Wei
-- 
Regards
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-14 Thread David Hildenbrand
On 14.02.19 11:00, David Hildenbrand wrote:
> On 14.02.19 10:08, Wang, Wei W wrote:
>> On Wednesday, February 13, 2019 5:19 PM, David Hildenbrand wrote:
>>> If you have to resize/alloc/coordinate who will report, you will need 
>>> locking.
>>> Especially, I doubt that there is an atomic xbitmap  (prove me wrong :) ).
>>
>> Yes, we need change xbitmap to support it.
>>
>> Just thought of another option, which would be better:
>> - xb_preload in prepare_alloc_pages to pre-allocate the bitmap memory;
>> - xb_set/clear the bit under the zone->lock, i.e. in rmqueue and 
>> free_one_page
> 
> And how to preload without locking?
> 
>>
>> will not be concurrently called to race on the same bitmap.
>> And we don't add any new locks to generate new doubts.
>> Also, we can probably remove the arch_alloc/free_page part.
>>
>> For the first step, we could optimize VIRTIO_BALLOON_F_FREE_PAGE_HINT for 
>> the live migration optimization:
>> - just replace alloc_pages(VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG,
>>VIRTIO_BALLOON_FREE_PAGE_ORDER)
>> with get_free_page_hints()
>>
>> get_free_page_hints() was designed to clear the bit, and need 
>> put_free_page_hints() to set it later after host finishes madvise. For the 
>> live migration usage, as host doesn't free the backing host pages, so we can 
>> give get_free_page_hints a parameter option to not clear the bit for this 
>> usage. It will be simpler and faster.
>>
>> I think get_free_page_hints() to read hints via bitmaps should be much 
>> faster than that allocation function, which takes around 15us to get a 4MB 
>> block. Another big bonus is that we don't need free_pages() to return all 
>> the pages back to buddy (it's a quite expensive operation too) when 
>> migration is done.
>>
>> For the second step, we can improve ballooning, e.g. a new feature 
>> VIRTIO_BALLOON_F_ADVANCED_BALLOON to use the same get_free_page_hints() and 
>> another put_free_page_hints(), along with the virtio-balloon's report_vq and 
>> ack_vq to wait for the host's ack before making the free page ready.
>> (I think waiting for the host ack is the overhead that the guest has to 
>> suffer for enabling memory overcommitment, and even with this v8 patch 
>> series it also needs to do that. The optimization method was described 
>> yesterday)
>>
> 
> As I already said, I don't like that approach, because it has the
> fundamental issue of page allocs getting blocked. That does not mean
> that it is bad, but that I think what Nitesh has is superior in that
> sense. Of course, things like "how to enable/disable", and much more
> needs to be clarified.
> 
> If you believe in your approach, feel free to come up with a prototype.
>  Especially the "no global locking" could be tricky in my opinion :)


I want to add that your approach makes sense if we expect that the
hypervisor will ask for free memory very rarely. Then, blocking during
page alloc is most probably acceptable. Depending on the setup, this
might or might not be the case. If you have some guests that are
allocating/freeing memory continuously, you might want to get back free
pages fairly often to move them to other guests.

In case the hypervisor asks for free pages, as we are not reporting
continuously, you would have to somehow report all pages currently free
to the hypervisor, making sure via the bitmap that they cannot be allocated.

You certainly don't want to track free pages in a bitmap if the
hypervisor is not asking for free pages, otherwise you will waste
eventually a big amount of memory tracking page states nobody cares
about in a xbtimap. So you would have to use another way to initially
fill the bitmap with free pages (when the hypervisor requests it), while
making sure to avoid races with pages getting allocated just while you
are creating the bitmap.

-- 

Thanks,

David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-14 Thread David Hildenbrand
On 14.02.19 10:08, Wang, Wei W wrote:
> On Wednesday, February 13, 2019 5:19 PM, David Hildenbrand wrote:
>> If you have to resize/alloc/coordinate who will report, you will need 
>> locking.
>> Especially, I doubt that there is an atomic xbitmap  (prove me wrong :) ).
> 
> Yes, we need change xbitmap to support it.
> 
> Just thought of another option, which would be better:
> - xb_preload in prepare_alloc_pages to pre-allocate the bitmap memory;
> - xb_set/clear the bit under the zone->lock, i.e. in rmqueue and free_one_page

And how to preload without locking?

> 
> will not be concurrently called to race on the same bitmap.
> And we don't add any new locks to generate new doubts.
> Also, we can probably remove the arch_alloc/free_page part.
> 
> For the first step, we could optimize VIRTIO_BALLOON_F_FREE_PAGE_HINT for the 
> live migration optimization:
> - just replace alloc_pages(VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG,
>VIRTIO_BALLOON_FREE_PAGE_ORDER)
> with get_free_page_hints()
> 
> get_free_page_hints() was designed to clear the bit, and need 
> put_free_page_hints() to set it later after host finishes madvise. For the 
> live migration usage, as host doesn't free the backing host pages, so we can 
> give get_free_page_hints a parameter option to not clear the bit for this 
> usage. It will be simpler and faster.
> 
> I think get_free_page_hints() to read hints via bitmaps should be much faster 
> than that allocation function, which takes around 15us to get a 4MB block. 
> Another big bonus is that we don't need free_pages() to return all the pages 
> back to buddy (it's a quite expensive operation too) when migration is done.
> 
> For the second step, we can improve ballooning, e.g. a new feature 
> VIRTIO_BALLOON_F_ADVANCED_BALLOON to use the same get_free_page_hints() and 
> another put_free_page_hints(), along with the virtio-balloon's report_vq and 
> ack_vq to wait for the host's ack before making the free page ready.
> (I think waiting for the host ack is the overhead that the guest has to 
> suffer for enabling memory overcommitment, and even with this v8 patch series 
> it also needs to do that. The optimization method was described yesterday)
> 

As I already said, I don't like that approach, because it has the
fundamental issue of page allocs getting blocked. That does not mean
that it is bad, but that I think what Nitesh has is superior in that
sense. Of course, things like "how to enable/disable", and much more
needs to be clarified.

If you believe in your approach, feel free to come up with a prototype.
 Especially the "no global locking" could be tricky in my opinion :)

>> Yes, but as I mentioned this has other drawbacks. Relying on a a guest to 
>> free
>> up memory when you really need it is not going to work. 
> 
> why not working? Host can ask at any time (including when not urgently need 
> it) depending on the admin's configuration.

Because any heuristic like "I am running out of memory, quickly ask
someone who will need time to respond" is prone to fail in some
scenarios. It might work for many, but it is not a "I am running out of
memory, oh look, this page has been flagged via madv(FREE), let's just
take that."

> 
>> It might work for
>> some scenarios but should not dictate the design. It is a good start though 
>> if
>> it makes things easier.
>  > Enabling/disabling free page hintning by the hypervisor via some
>> mechanism is on the other hand a good idea. "I have plenty of free space,
>> don't worry".
> 
> Also guests are not treated identically, host can decide whom to offer the 
> free pages first (offering free pages will cause the guest some performance 
> drop).

Yes, it should definetly be configurable somehow. You don't want free
page hinting always and in any setup.


-- 

Thanks,

David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-14 Thread David Hildenbrand
On 14.02.19 09:48, Wang, Wei W wrote:
> On Wednesday, February 13, 2019 8:07 PM, Nitesh Narayan Lal wrote:
>> Once the host free the pages. All the isolated pages are returned back 
>> to the buddy. (This is implemented in hyperlist_ready())
> 
> This actually has the same issue: the isolated pages have to wait to return 
> to the buddy
> after the host has done with madvise(DONTNEED). Otherwise, a page could be 
> used by
> a guest thread and the next moment the host takes it to other host threads.

Yes indeed, that is the important bit. They must not be put pack to the
buddy before they have been processed by the hypervisor. But as the
pages are not in the buddy, no one allocating a page will stumble over
such a page and try to allocate it. Threads trying to allocate memory
will simply pick another buddy page instead of "busy waiting" for that
page to be finished reporting.

> 
> Best,
> Wei
> 


-- 

Thanks,

David / dhildenb


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-14 Thread David Hildenbrand
On 14.02.19 10:12, Wang, Wei W wrote:
> On Thursday, February 14, 2019 1:22 AM, Nitesh Narayan Lal wrote:
>> In normal condition yes we would not like to report any memory when the
>> guest is already under memory pressure.
>>
>> I am not sure about the scenario where both guest and the host are under
>> memory pressure, who will be given priority? Is it something per-decided or
>> it depends on the use case?
>>
> 
> That's one of the reasons that I would vote for "host to ask for free pages".

As I already said, there are scenarios where this does not work reliably.

When we

1. Let the guest report free pages
2. Allow the hypervisor to enable/disable it

We have a mechanism that is superior to what you describe.

"host to ask for free pages" can be emulated using that.

> 
> Host can have a global view of all the guest's memory states, so better to 
> have the
> memory overcommitment policy defined on the host.
> 
> For example, the host can know guest1 is under memory pressure (thus not 
> asking him)
> and guest2 has a huge amount of free memory. When host lacks memory to run 
> the 3rd
> guest, it can asks guest2 to offer some free memory. 
> 
> Best,
> Wei
> 


-- 

Thanks,

David / dhildenb


RE: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-14 Thread Wang, Wei W
On Wednesday, February 13, 2019 5:19 PM, David Hildenbrand wrote:
> If you have to resize/alloc/coordinate who will report, you will need locking.
> Especially, I doubt that there is an atomic xbitmap  (prove me wrong :) ).

Yes, we need change xbitmap to support it.

Just thought of another option, which would be better:
- xb_preload in prepare_alloc_pages to pre-allocate the bitmap memory;
- xb_set/clear the bit under the zone->lock, i.e. in rmqueue and free_one_page

So we use the existing zone->lock to guarantee that the xb ops
will not be concurrently called to race on the same bitmap.
And we don't add any new locks to generate new doubts.
Also, we can probably remove the arch_alloc/free_page part.

For the first step, we could optimize VIRTIO_BALLOON_F_FREE_PAGE_HINT for the 
live migration optimization:
- just replace alloc_pages(VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG,
   VIRTIO_BALLOON_FREE_PAGE_ORDER)
with get_free_page_hints()

get_free_page_hints() was designed to clear the bit, and need 
put_free_page_hints() to set it later after host finishes madvise. For the live 
migration usage, as host doesn't free the backing host pages, so we can give 
get_free_page_hints a parameter option to not clear the bit for this usage. It 
will be simpler and faster.

I think get_free_page_hints() to read hints via bitmaps should be much faster 
than that allocation function, which takes around 15us to get a 4MB block. 
Another big bonus is that we don't need free_pages() to return all the pages 
back to buddy (it's a quite expensive operation too) when migration is done.

For the second step, we can improve ballooning, e.g. a new feature 
VIRTIO_BALLOON_F_ADVANCED_BALLOON to use the same get_free_page_hints() and 
another put_free_page_hints(), along with the virtio-balloon's report_vq and 
ack_vq to wait for the host's ack before making the free page ready.
(I think waiting for the host ack is the overhead that the guest has to suffer 
for enabling memory overcommitment, and even with this v8 patch series it also 
needs to do that. The optimization method was described yesterday)

> Yes, but as I mentioned this has other drawbacks. Relying on a a guest to free
> up memory when you really need it is not going to work. 

why not working? Host can ask at any time (including when not urgently need it) 
depending on the admin's configuration.

>It might work for
> some scenarios but should not dictate the design. It is a good start though if
> it makes things easier.
 > Enabling/disabling free page hintning by the hypervisor via some
> mechanism is on the other hand a good idea. "I have plenty of free space,
> don't worry".

Also guests are not treated identically, host can decide whom to offer the free 
pages first (offering free pages will cause the guest some performance drop).

Best,
Wei


RE: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-14 Thread Wang, Wei W
On Wednesday, February 13, 2019 8:07 PM, Nitesh Narayan Lal wrote:
> Once the host free the pages. All the isolated pages are returned back 
> to the buddy. (This is implemented in hyperlist_ready())

This actually has the same issue: the isolated pages have to wait to return to 
the buddy
after the host has done with madvise(DONTNEED). Otherwise, a page could be used 
by
a guest thread and the next moment the host takes it to other host threads.

Best,
Wei


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-13 Thread Michael S. Tsirkin
On Wed, Feb 13, 2019 at 06:59:24PM +0100, David Hildenbrand wrote:
> >>>
>  Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages 
>  as
>  candidates for removal and if the host is low on memory, only scanning 
>  the
>  guest page tables is sufficient to free up memory.
> 
>  But both points might just be an implementation detail in the example you
>  describe.
> >>>
> >>> Yes, it is an implementation detail. I think DONTNEED would be easier
> >>> for the first step.
> >>>
> 
> >
> > In above 2), get_free_page_hints clears the bits which indicates that 
> > those
>  pages are not ready to be used by the guest yet. Why?
> > This is because 3) will unmap the underlying physical pages from EPT.
>  Normally, when guest re-visits those pages, EPT violations and QEMU page
>  faults will get a new host page to set up the related EPT entry. If 
>  guest uses
>  that page before the page gets unmapped (i.e. right before step 3), no 
>  EPT
>  violation happens and the guest will use the same physical page that 
>  will be
>  unmapped and given to other host threads. So we need to make sure that
>  the guest free page is usable only after step 3 finishes.
> >
> > Back to arch_alloc_page(), it needs to check if the allocated pages
> > have "1" set in the bitmap, if that's true, just clear the bits. 
> > Otherwise, it
>  means step 2) above has happened and step 4) hasn't been reached. In this
>  case, we can either have arch_alloc_page() busywaiting a bit till 4) is 
>  done
>  for that page Or better to have a balloon callback which prioritize 3) 
>  and 4)
>  to make this page usable by the guest.
> 
>  Regarding the latter, the VCPU allocating a page cannot do anything if 
>  the
>  page (along with other pages) is just being freed by the hypervisor.
>  It has to busy-wait, no chance to prioritize.
> >>>
> >>> I meant this:
> >>> With this approach, essentially the free pages have 2 states:
> >>> ready free page: the page is on the free list and it has "1" in the bitmap
> >>> non-ready free page: the page is on the free list and it has "0" in the 
> >>> bitmap
> >>> Ready free pages are those who can be allocated to use.
> >>> Non-ready free pages are those who are in progress of being reported to
> >>> host and the related EPT mapping is about to be zapped. 
> >>>
> >>> The non-ready pages are inserted into the report_vq and waiting for the
> >>> host to zap the mappings one by one. After the mapping gets zapped
> >>> (which means the backing host page has been taken away), host acks to
> >>> the guest to mark the free page as ready free page (set the bit to 1 in 
> >>> the bitmap).
> >>
> >> Yes, that's how I understood your approach. The interesting part is
> >> where somebody finds a buddy page and wants to allocate it.
> >>
> >>>
> >>> So the non-ready free page may happen to be used when they are waiting in
> >>> the report_vq to be handled by the host to zap the mapping, balloon could
> >>> have a fast path to notify the host:
> >>> "page 0x1000 is about to be used, don’t zap the mapping when you get
> >>> 0x1000 from the report_vq"  /*option [1] */
> >>
> >> This requires coordination and in any case there will be a scenario
> >> where you have to wait for the hypervisor to eventually finish a madv
> >> call. You can just try to make that scenario less likely.
> >>
> >> What you propose is synchronous in the worst case. Getting pages of the
> >> buddy makes it possible to have it done completely asynchronous. Nobody
> >> allocating a page has to wait.
> >>
> >>>
> >>> Or
> >>>
> >>> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) 
> >>> and 4) above,
> >>> so that the free page will be marked as ready free page and the guest can 
> >>> use it".
> >>> This option will generate an extra EPT violation and QEMU page fault to 
> >>> get a new host
> >>> page to back the guest ready free page.
> >>
> >> Again, coordination with the hypervisor while allocating a page. That is
> >> to be avoided in any case.
> >>
> >>>
> 
> >
> > Using bitmaps to record free page hints don't need to take the free 
> > pages
>  off the buddy list and return them later, which needs to go through the 
>  long
>  allocation/free code path.
> >
> 
>  Yes, but it means that any process is able to get stuck on such a page 
>  for as
>  long as it takes to report the free pages to the hypervisor and for it 
>  to call
>  madvise(pfn_start, DONTNEED) on any such page.
> >>>
> >>> This only happens when the guest thread happens to get allocated on a 
> >>> page which is
> >>> being reported to the host. Using option [1] above will avoid this.
> >>
> >> I think getting pages out of the buddy system temporarily is the only
> >> way we can avoid somebody else stumbling over a page currently 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-13 Thread David Hildenbrand
>>>
 Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
 candidates for removal and if the host is low on memory, only scanning the
 guest page tables is sufficient to free up memory.

 But both points might just be an implementation detail in the example you
 describe.
>>>
>>> Yes, it is an implementation detail. I think DONTNEED would be easier
>>> for the first step.
>>>

>
> In above 2), get_free_page_hints clears the bits which indicates that 
> those
 pages are not ready to be used by the guest yet. Why?
> This is because 3) will unmap the underlying physical pages from EPT.
 Normally, when guest re-visits those pages, EPT violations and QEMU page
 faults will get a new host page to set up the related EPT entry. If guest 
 uses
 that page before the page gets unmapped (i.e. right before step 3), no EPT
 violation happens and the guest will use the same physical page that will 
 be
 unmapped and given to other host threads. So we need to make sure that
 the guest free page is usable only after step 3 finishes.
>
> Back to arch_alloc_page(), it needs to check if the allocated pages
> have "1" set in the bitmap, if that's true, just clear the bits. 
> Otherwise, it
 means step 2) above has happened and step 4) hasn't been reached. In this
 case, we can either have arch_alloc_page() busywaiting a bit till 4) is 
 done
 for that page Or better to have a balloon callback which prioritize 3) and 
 4)
 to make this page usable by the guest.

 Regarding the latter, the VCPU allocating a page cannot do anything if the
 page (along with other pages) is just being freed by the hypervisor.
 It has to busy-wait, no chance to prioritize.
>>>
>>> I meant this:
>>> With this approach, essentially the free pages have 2 states:
>>> ready free page: the page is on the free list and it has "1" in the bitmap
>>> non-ready free page: the page is on the free list and it has "0" in the 
>>> bitmap
>>> Ready free pages are those who can be allocated to use.
>>> Non-ready free pages are those who are in progress of being reported to
>>> host and the related EPT mapping is about to be zapped. 
>>>
>>> The non-ready pages are inserted into the report_vq and waiting for the
>>> host to zap the mappings one by one. After the mapping gets zapped
>>> (which means the backing host page has been taken away), host acks to
>>> the guest to mark the free page as ready free page (set the bit to 1 in the 
>>> bitmap).
>>
>> Yes, that's how I understood your approach. The interesting part is
>> where somebody finds a buddy page and wants to allocate it.
>>
>>>
>>> So the non-ready free page may happen to be used when they are waiting in
>>> the report_vq to be handled by the host to zap the mapping, balloon could
>>> have a fast path to notify the host:
>>> "page 0x1000 is about to be used, don’t zap the mapping when you get
>>> 0x1000 from the report_vq"  /*option [1] */
>>
>> This requires coordination and in any case there will be a scenario
>> where you have to wait for the hypervisor to eventually finish a madv
>> call. You can just try to make that scenario less likely.
>>
>> What you propose is synchronous in the worst case. Getting pages of the
>> buddy makes it possible to have it done completely asynchronous. Nobody
>> allocating a page has to wait.
>>
>>>
>>> Or
>>>
>>> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) 
>>> and 4) above,
>>> so that the free page will be marked as ready free page and the guest can 
>>> use it".
>>> This option will generate an extra EPT violation and QEMU page fault to get 
>>> a new host
>>> page to back the guest ready free page.
>>
>> Again, coordination with the hypervisor while allocating a page. That is
>> to be avoided in any case.
>>
>>>

>
> Using bitmaps to record free page hints don't need to take the free pages
 off the buddy list and return them later, which needs to go through the 
 long
 allocation/free code path.
>

 Yes, but it means that any process is able to get stuck on such a page for 
 as
 long as it takes to report the free pages to the hypervisor and for it to 
 call
 madvise(pfn_start, DONTNEED) on any such page.
>>>
>>> This only happens when the guest thread happens to get allocated on a page 
>>> which is
>>> being reported to the host. Using option [1] above will avoid this.
>>
>> I think getting pages out of the buddy system temporarily is the only
>> way we can avoid somebody else stumbling over a page currently getting
>> reported by the hypervisor. Otherwise, as I said, there are scenarios
>> where a allocating VCPU has to wait for the hypervisor to finish the
>> "freeing" task. While you can try to "speedup" that scenario -
>> "hypervisor please prioritize" you cannot avoid it. There will be busy
>> waiting.
> 
> Right - there has 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-13 Thread Nitesh Narayan Lal

On 2/13/19 12:09 PM, Michael S. Tsirkin wrote:
> On Wed, Feb 13, 2019 at 07:17:13AM -0500, Nitesh Narayan Lal wrote:
>> On 2/13/19 4:19 AM, David Hildenbrand wrote:
>>> On 13.02.19 09:55, Wang, Wei W wrote:
 On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
> Global means all VCPUs will be competing potentially for a single lock 
> when
> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
> memory like crazy?
 I think the key point is that the 64 vcpus won't allocate/free on the same 
 page simultaneously, so no need to have a global big lock, isn’t it?
 I think atomic operations on the bitmap would be enough.
>>> If you have to resize/alloc/coordinate who will report, you will need
>>> locking. Especially, I doubt that there is an atomic xbitmap  (prove me
>>> wrong :) ).
>>>
> (I assume some kind of locking is required even if the bitmap would be
> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
> memory at places where we don't want to - e.g. from arch_free_page ?)
 arch_free_pages is in free_pages_prepare, why can't we have memory 
 allocation there?
>>> I remember we were stumbling over some issues that were non-trivial. I
>>> am not 100% sure yet anymore, but allocating memory while deep down in
>>> the freeing part of MM core smells like "be careful".
>>>
 It would also be doable to find a preferred place to preallocate some 
 amount of memory for the bitmap.
>>> That makes things very ugly. Especially, preallocation will most likely
>>> require locking.
>>>
> That's the big benefit of taking the pages of the buddy free list. Other 
> VCPUs
> won't stumble over them, waiting for them to get freed in the hypervisor.
 As also mentioned above, I think other vcpus will not allocate/free on the 
 same page that is in progress of being allocated/freed.
>>> If a page is in the buddy but stuck in some other bitmap, there is
>>> nothing stopping another VCPU from trying to allocate it. Nitesh has
>>> been fighting with this problem already :)
>>>
> This sounds more like "the host requests to get free pages once in a 
> while"
> compared to "the host is always informed about free pages". At the time
> where the host actually has to ask the guest (e.g. because the host is 
> low on
> memory), it might be to late to wait for guest action.
 Option 1: Host asks for free pages:
 Not necessary to ask only when the host has been in memory pressure.
 This could be the orchestration layer's job to monitor the host memory 
 usage.
 For example, people could set the condition "when 50% of the host memory
 has been used, start to ask a guest for some amount of free pages" 

 Option 2: Guest actively offers free pages:
 Add a balloon callback to arch_free_page so that whenever a page gets 
 freed its gfn
 will be filled into the balloon's report_vq and the host will take away 
 the backing
 host page.

 Both options can be implemented. But I think option 1 would be more
 efficient as the guest free pages are offered on demand.  
>>> Yes, but as I mentioned this has other drawbacks. Relying on a a guest
>>> to free up memory when you really need it is not going to work. It might
>>> work for some scenarios but should not dictate the design. It is a good
>>> start though if it makes things easier.
>>>
>>> Enabling/disabling free page hintning by the hypervisor via some
>>> mechanism is on the other hand a good idea. "I have plenty of free
>>> space, don't worry".
>>>
> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
> candidates for removal and if the host is low on memory, only scanning the
> guest page tables is sufficient to free up memory.
>
> But both points might just be an implementation detail in the example you
> describe.
 Yes, it is an implementation detail. I think DONTNEED would be easier
 for the first step.

>> In above 2), get_free_page_hints clears the bits which indicates that 
>> those
> pages are not ready to be used by the guest yet. Why?
>> This is because 3) will unmap the underlying physical pages from EPT.
> Normally, when guest re-visits those pages, EPT violations and QEMU page
> faults will get a new host page to set up the related EPT entry. If guest 
> uses
> that page before the page gets unmapped (i.e. right before step 3), no EPT
> violation happens and the guest will use the same physical page that will 
> be
> unmapped and given to other host threads. So we need to make sure that
> the guest free page is usable only after step 3 finishes.
>> Back to arch_alloc_page(), it needs to check if the allocated pages
>> have "1" set in the bitmap, if that's true, just clear the bits. 
>> Otherwise, it
> means step 2) above has happened and step 4) 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-13 Thread Michael S. Tsirkin
On Wed, Feb 13, 2019 at 10:19:05AM +0100, David Hildenbrand wrote:
> On 13.02.19 09:55, Wang, Wei W wrote:
> > On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
> >> Global means all VCPUs will be competing potentially for a single lock when
> >> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
> >> memory like crazy?
> > 
> > I think the key point is that the 64 vcpus won't allocate/free on the same 
> > page simultaneously, so no need to have a global big lock, isn’t it?
> > I think atomic operations on the bitmap would be enough.
> 
> If you have to resize/alloc/coordinate who will report, you will need
> locking. Especially, I doubt that there is an atomic xbitmap  (prove me
> wrong :) ).
> 
> > 
> >> (I assume some kind of locking is required even if the bitmap would be
> >> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
> >> memory at places where we don't want to - e.g. from arch_free_page ?)
> > 
> > arch_free_pages is in free_pages_prepare, why can't we have memory 
> > allocation there?
> 
> I remember we were stumbling over some issues that were non-trivial. I
> am not 100% sure yet anymore, but allocating memory while deep down in
> the freeing part of MM core smells like "be careful".
> 
> > 
> > It would also be doable to find a preferred place to preallocate some 
> > amount of memory for the bitmap.
> 
> That makes things very ugly. Especially, preallocation will most likely
> require locking.
> 
> > 
> >>
> >> That's the big benefit of taking the pages of the buddy free list. Other 
> >> VCPUs
> >> won't stumble over them, waiting for them to get freed in the hypervisor.
> > 
> > As also mentioned above, I think other vcpus will not allocate/free on the 
> > same page that is in progress of being allocated/freed.
> 
> If a page is in the buddy but stuck in some other bitmap, there is
> nothing stopping another VCPU from trying to allocate it. Nitesh has
> been fighting with this problem already :)
> 
> > 
> >> This sounds more like "the host requests to get free pages once in a while"
> >> compared to "the host is always informed about free pages". At the time
> >> where the host actually has to ask the guest (e.g. because the host is low 
> >> on
> >> memory), it might be to late to wait for guest action.
> > 
> > Option 1: Host asks for free pages:
> > Not necessary to ask only when the host has been in memory pressure.
> > This could be the orchestration layer's job to monitor the host memory 
> > usage.
> > For example, people could set the condition "when 50% of the host memory
> > has been used, start to ask a guest for some amount of free pages" 
> > 
> > Option 2: Guest actively offers free pages:
> > Add a balloon callback to arch_free_page so that whenever a page gets freed 
> > its gfn
> > will be filled into the balloon's report_vq and the host will take away the 
> > backing
> > host page.
> > 
> > Both options can be implemented. But I think option 1 would be more
> > efficient as the guest free pages are offered on demand.  
> 
> Yes, but as I mentioned this has other drawbacks. Relying on a a guest
> to free up memory when you really need it is not going to work. It might
> work for some scenarios but should not dictate the design. It is a good
> start though if it makes things easier.

Besides, it has already been implemented in Linux.

> Enabling/disabling free page hintning by the hypervisor via some
> mechanism is on the other hand a good idea. "I have plenty of free
> space, don't worry".

Existing mechanism includes ability to cancel reporting.

> > 
> >> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
> >> candidates for removal and if the host is low on memory, only scanning the
> >> guest page tables is sufficient to free up memory.
> >>
> >> But both points might just be an implementation detail in the example you
> >> describe.
> > 
> > Yes, it is an implementation detail. I think DONTNEED would be easier
> > for the first step.
> > 
> >>
> >>>
> >>> In above 2), get_free_page_hints clears the bits which indicates that 
> >>> those
> >> pages are not ready to be used by the guest yet. Why?
> >>> This is because 3) will unmap the underlying physical pages from EPT.
> >> Normally, when guest re-visits those pages, EPT violations and QEMU page
> >> faults will get a new host page to set up the related EPT entry. If guest 
> >> uses
> >> that page before the page gets unmapped (i.e. right before step 3), no EPT
> >> violation happens and the guest will use the same physical page that will 
> >> be
> >> unmapped and given to other host threads. So we need to make sure that
> >> the guest free page is usable only after step 3 finishes.
> >>>
> >>> Back to arch_alloc_page(), it needs to check if the allocated pages
> >>> have "1" set in the bitmap, if that's true, just clear the bits. 
> >>> Otherwise, it
> >> means step 2) above has happened and step 4) hasn't been reached. In 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-13 Thread Michael S. Tsirkin
On Wed, Feb 13, 2019 at 07:17:13AM -0500, Nitesh Narayan Lal wrote:
> 
> On 2/13/19 4:19 AM, David Hildenbrand wrote:
> > On 13.02.19 09:55, Wang, Wei W wrote:
> >> On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
> >>> Global means all VCPUs will be competing potentially for a single lock 
> >>> when
> >>> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
> >>> memory like crazy?
> >> I think the key point is that the 64 vcpus won't allocate/free on the same 
> >> page simultaneously, so no need to have a global big lock, isn’t it?
> >> I think atomic operations on the bitmap would be enough.
> > If you have to resize/alloc/coordinate who will report, you will need
> > locking. Especially, I doubt that there is an atomic xbitmap  (prove me
> > wrong :) ).
> >
> >>> (I assume some kind of locking is required even if the bitmap would be
> >>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
> >>> memory at places where we don't want to - e.g. from arch_free_page ?)
> >> arch_free_pages is in free_pages_prepare, why can't we have memory 
> >> allocation there?
> > I remember we were stumbling over some issues that were non-trivial. I
> > am not 100% sure yet anymore, but allocating memory while deep down in
> > the freeing part of MM core smells like "be careful".
> >
> >> It would also be doable to find a preferred place to preallocate some 
> >> amount of memory for the bitmap.
> > That makes things very ugly. Especially, preallocation will most likely
> > require locking.
> >
> >>> That's the big benefit of taking the pages of the buddy free list. Other 
> >>> VCPUs
> >>> won't stumble over them, waiting for them to get freed in the hypervisor.
> >> As also mentioned above, I think other vcpus will not allocate/free on the 
> >> same page that is in progress of being allocated/freed.
> > If a page is in the buddy but stuck in some other bitmap, there is
> > nothing stopping another VCPU from trying to allocate it. Nitesh has
> > been fighting with this problem already :)
> >
> >>> This sounds more like "the host requests to get free pages once in a 
> >>> while"
> >>> compared to "the host is always informed about free pages". At the time
> >>> where the host actually has to ask the guest (e.g. because the host is 
> >>> low on
> >>> memory), it might be to late to wait for guest action.
> >> Option 1: Host asks for free pages:
> >> Not necessary to ask only when the host has been in memory pressure.
> >> This could be the orchestration layer's job to monitor the host memory 
> >> usage.
> >> For example, people could set the condition "when 50% of the host memory
> >> has been used, start to ask a guest for some amount of free pages" 
> >>
> >> Option 2: Guest actively offers free pages:
> >> Add a balloon callback to arch_free_page so that whenever a page gets 
> >> freed its gfn
> >> will be filled into the balloon's report_vq and the host will take away 
> >> the backing
> >> host page.
> >>
> >> Both options can be implemented. But I think option 1 would be more
> >> efficient as the guest free pages are offered on demand.  
> > Yes, but as I mentioned this has other drawbacks. Relying on a a guest
> > to free up memory when you really need it is not going to work. It might
> > work for some scenarios but should not dictate the design. It is a good
> > start though if it makes things easier.
> >
> > Enabling/disabling free page hintning by the hypervisor via some
> > mechanism is on the other hand a good idea. "I have plenty of free
> > space, don't worry".
> >
> >>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
> >>> candidates for removal and if the host is low on memory, only scanning the
> >>> guest page tables is sufficient to free up memory.
> >>>
> >>> But both points might just be an implementation detail in the example you
> >>> describe.
> >> Yes, it is an implementation detail. I think DONTNEED would be easier
> >> for the first step.
> >>
>  In above 2), get_free_page_hints clears the bits which indicates that 
>  those
> >>> pages are not ready to be used by the guest yet. Why?
>  This is because 3) will unmap the underlying physical pages from EPT.
> >>> Normally, when guest re-visits those pages, EPT violations and QEMU page
> >>> faults will get a new host page to set up the related EPT entry. If guest 
> >>> uses
> >>> that page before the page gets unmapped (i.e. right before step 3), no EPT
> >>> violation happens and the guest will use the same physical page that will 
> >>> be
> >>> unmapped and given to other host threads. So we need to make sure that
> >>> the guest free page is usable only after step 3 finishes.
>  Back to arch_alloc_page(), it needs to check if the allocated pages
>  have "1" set in the bitmap, if that's true, just clear the bits. 
>  Otherwise, it
> >>> means step 2) above has happened and step 4) hasn't been reached. In this
> >>> case, we can 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-13 Thread Nitesh Narayan Lal

On 2/13/19 4:19 AM, David Hildenbrand wrote:
> On 13.02.19 09:55, Wang, Wei W wrote:
>> On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
>>> Global means all VCPUs will be competing potentially for a single lock when
>>> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
>>> memory like crazy?
>> I think the key point is that the 64 vcpus won't allocate/free on the same 
>> page simultaneously, so no need to have a global big lock, isn’t it?
>> I think atomic operations on the bitmap would be enough.
> If you have to resize/alloc/coordinate who will report, you will need
> locking. Especially, I doubt that there is an atomic xbitmap  (prove me
> wrong :) ).
>
>>> (I assume some kind of locking is required even if the bitmap would be
>>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
>>> memory at places where we don't want to - e.g. from arch_free_page ?)
>> arch_free_pages is in free_pages_prepare, why can't we have memory 
>> allocation there?
> I remember we were stumbling over some issues that were non-trivial. I
> am not 100% sure yet anymore, but allocating memory while deep down in
> the freeing part of MM core smells like "be careful".
>
>> It would also be doable to find a preferred place to preallocate some amount 
>> of memory for the bitmap.
> That makes things very ugly. Especially, preallocation will most likely
> require locking.
>
>>> That's the big benefit of taking the pages of the buddy free list. Other 
>>> VCPUs
>>> won't stumble over them, waiting for them to get freed in the hypervisor.
>> As also mentioned above, I think other vcpus will not allocate/free on the 
>> same page that is in progress of being allocated/freed.
> If a page is in the buddy but stuck in some other bitmap, there is
> nothing stopping another VCPU from trying to allocate it. Nitesh has
> been fighting with this problem already :)
>
>>> This sounds more like "the host requests to get free pages once in a while"
>>> compared to "the host is always informed about free pages". At the time
>>> where the host actually has to ask the guest (e.g. because the host is low 
>>> on
>>> memory), it might be to late to wait for guest action.
>> Option 1: Host asks for free pages:
>> Not necessary to ask only when the host has been in memory pressure.
>> This could be the orchestration layer's job to monitor the host memory usage.
>> For example, people could set the condition "when 50% of the host memory
>> has been used, start to ask a guest for some amount of free pages" 
>>
>> Option 2: Guest actively offers free pages:
>> Add a balloon callback to arch_free_page so that whenever a page gets freed 
>> its gfn
>> will be filled into the balloon's report_vq and the host will take away the 
>> backing
>> host page.
>>
>> Both options can be implemented. But I think option 1 would be more
>> efficient as the guest free pages are offered on demand.  
> Yes, but as I mentioned this has other drawbacks. Relying on a a guest
> to free up memory when you really need it is not going to work. It might
> work for some scenarios but should not dictate the design. It is a good
> start though if it makes things easier.
>
> Enabling/disabling free page hintning by the hypervisor via some
> mechanism is on the other hand a good idea. "I have plenty of free
> space, don't worry".
>
>>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
>>> candidates for removal and if the host is low on memory, only scanning the
>>> guest page tables is sufficient to free up memory.
>>>
>>> But both points might just be an implementation detail in the example you
>>> describe.
>> Yes, it is an implementation detail. I think DONTNEED would be easier
>> for the first step.
>>
 In above 2), get_free_page_hints clears the bits which indicates that those
>>> pages are not ready to be used by the guest yet. Why?
 This is because 3) will unmap the underlying physical pages from EPT.
>>> Normally, when guest re-visits those pages, EPT violations and QEMU page
>>> faults will get a new host page to set up the related EPT entry. If guest 
>>> uses
>>> that page before the page gets unmapped (i.e. right before step 3), no EPT
>>> violation happens and the guest will use the same physical page that will be
>>> unmapped and given to other host threads. So we need to make sure that
>>> the guest free page is usable only after step 3 finishes.
 Back to arch_alloc_page(), it needs to check if the allocated pages
 have "1" set in the bitmap, if that's true, just clear the bits. 
 Otherwise, it
>>> means step 2) above has happened and step 4) hasn't been reached. In this
>>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
>>> for that page Or better to have a balloon callback which prioritize 3) and 
>>> 4)
>>> to make this page usable by the guest.
>>>
>>> Regarding the latter, the VCPU allocating a page cannot do anything if the
>>> 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-13 Thread Nitesh Narayan Lal

On 2/13/19 4:00 AM, Wang, Wei W wrote:
> On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
>> The following patch-set proposes an efficient mechanism for handing freed
>> memory between the guest and the host. It enables the guests with no page
>> cache to rapidly free and reclaims memory to and from the host respectively.
>>
>> Benefit:
>> With this patch-series, in our test-case, executed on a single system and
>> single NUMA node with 15GB memory, we were able to successfully launch
>> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
>> explanation of the test procedure is provided at the bottom).
>>
>> Changelog in V8:
>> In this patch-series, the earlier approach [1] which was used to capture and
>> scan the pages freed by the guest has been changed. The new approach is
>> briefly described below:
>>
>> The patch-set still leverages the existing arch_free_page() to add this
>> functionality. It maintains a per CPU array which is used to store the pages
>> freed by the guest. The maximum number of entries which it can hold is
>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
>> is scanned and only the pages which are available in the buddy are stored.
>> This process continues until the array is filled with pages which are part of
>> the buddy free list. After which it wakes up a kernel per-cpu-thread.
>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
>> and if the page is not reallocated and present in the buddy, the kernel
>> thread attempts to isolate it from the buddy. If it is successfully 
>> isolated, the
>> page is added to another per-cpu array. Once the entire scanning process is
>> complete, all the isolated pages are reported to the host through an existing
>> virtio-balloon driver.
> The free page is removed from the buddy list here. When will they get 
> returned to the buddy list so that the guest threads can use them normally?
Once the host free the pages. All the isolated pages are returned back
to the buddy. (This is implemented in hyperlist_ready())
>
> Best,
> Wei
-- 
Regards
Nitesh



signature.asc
Description: OpenPGP digital signature


Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-13 Thread David Hildenbrand
On 13.02.19 09:55, Wang, Wei W wrote:
> On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
>> Global means all VCPUs will be competing potentially for a single lock when
>> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
>> memory like crazy?
> 
> I think the key point is that the 64 vcpus won't allocate/free on the same 
> page simultaneously, so no need to have a global big lock, isn’t it?
> I think atomic operations on the bitmap would be enough.

If you have to resize/alloc/coordinate who will report, you will need
locking. Especially, I doubt that there is an atomic xbitmap  (prove me
wrong :) ).

> 
>> (I assume some kind of locking is required even if the bitmap would be
>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
>> memory at places where we don't want to - e.g. from arch_free_page ?)
> 
> arch_free_pages is in free_pages_prepare, why can't we have memory allocation 
> there?

I remember we were stumbling over some issues that were non-trivial. I
am not 100% sure yet anymore, but allocating memory while deep down in
the freeing part of MM core smells like "be careful".

> 
> It would also be doable to find a preferred place to preallocate some amount 
> of memory for the bitmap.

That makes things very ugly. Especially, preallocation will most likely
require locking.

> 
>>
>> That's the big benefit of taking the pages of the buddy free list. Other 
>> VCPUs
>> won't stumble over them, waiting for them to get freed in the hypervisor.
> 
> As also mentioned above, I think other vcpus will not allocate/free on the 
> same page that is in progress of being allocated/freed.

If a page is in the buddy but stuck in some other bitmap, there is
nothing stopping another VCPU from trying to allocate it. Nitesh has
been fighting with this problem already :)

> 
>> This sounds more like "the host requests to get free pages once in a while"
>> compared to "the host is always informed about free pages". At the time
>> where the host actually has to ask the guest (e.g. because the host is low on
>> memory), it might be to late to wait for guest action.
> 
> Option 1: Host asks for free pages:
> Not necessary to ask only when the host has been in memory pressure.
> This could be the orchestration layer's job to monitor the host memory usage.
> For example, people could set the condition "when 50% of the host memory
> has been used, start to ask a guest for some amount of free pages" 
> 
> Option 2: Guest actively offers free pages:
> Add a balloon callback to arch_free_page so that whenever a page gets freed 
> its gfn
> will be filled into the balloon's report_vq and the host will take away the 
> backing
> host page.
> 
> Both options can be implemented. But I think option 1 would be more
> efficient as the guest free pages are offered on demand.  

Yes, but as I mentioned this has other drawbacks. Relying on a a guest
to free up memory when you really need it is not going to work. It might
work for some scenarios but should not dictate the design. It is a good
start though if it makes things easier.

Enabling/disabling free page hintning by the hypervisor via some
mechanism is on the other hand a good idea. "I have plenty of free
space, don't worry".

> 
>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
>> candidates for removal and if the host is low on memory, only scanning the
>> guest page tables is sufficient to free up memory.
>>
>> But both points might just be an implementation detail in the example you
>> describe.
> 
> Yes, it is an implementation detail. I think DONTNEED would be easier
> for the first step.
> 
>>
>>>
>>> In above 2), get_free_page_hints clears the bits which indicates that those
>> pages are not ready to be used by the guest yet. Why?
>>> This is because 3) will unmap the underlying physical pages from EPT.
>> Normally, when guest re-visits those pages, EPT violations and QEMU page
>> faults will get a new host page to set up the related EPT entry. If guest 
>> uses
>> that page before the page gets unmapped (i.e. right before step 3), no EPT
>> violation happens and the guest will use the same physical page that will be
>> unmapped and given to other host threads. So we need to make sure that
>> the guest free page is usable only after step 3 finishes.
>>>
>>> Back to arch_alloc_page(), it needs to check if the allocated pages
>>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, 
>>> it
>> means step 2) above has happened and step 4) hasn't been reached. In this
>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
>> for that page Or better to have a balloon callback which prioritize 3) and 4)
>> to make this page usable by the guest.
>>
>> Regarding the latter, the VCPU allocating a page cannot do anything if the
>> page (along with other pages) is just being freed by the hypervisor.
>> It has to busy-wait, no chance to 

RE: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-13 Thread Wang, Wei W
On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
> The following patch-set proposes an efficient mechanism for handing freed
> memory between the guest and the host. It enables the guests with no page
> cache to rapidly free and reclaims memory to and from the host respectively.
> 
> Benefit:
> With this patch-series, in our test-case, executed on a single system and
> single NUMA node with 15GB memory, we were able to successfully launch
> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
> explanation of the test procedure is provided at the bottom).
> 
> Changelog in V8:
> In this patch-series, the earlier approach [1] which was used to capture and
> scan the pages freed by the guest has been changed. The new approach is
> briefly described below:
> 
> The patch-set still leverages the existing arch_free_page() to add this
> functionality. It maintains a per CPU array which is used to store the pages
> freed by the guest. The maximum number of entries which it can hold is
> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
> is scanned and only the pages which are available in the buddy are stored.
> This process continues until the array is filled with pages which are part of
> the buddy free list. After which it wakes up a kernel per-cpu-thread.
> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
> and if the page is not reallocated and present in the buddy, the kernel
> thread attempts to isolate it from the buddy. If it is successfully isolated, 
> the
> page is added to another per-cpu array. Once the entire scanning process is
> complete, all the isolated pages are reported to the host through an existing
> virtio-balloon driver.

The free page is removed from the buddy list here. When will they get returned 
to the buddy list so that the guest threads can use them normally?

Best,
Wei


RE: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-13 Thread Wang, Wei W
On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
> Global means all VCPUs will be competing potentially for a single lock when
> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
> memory like crazy?

I think the key point is that the 64 vcpus won't allocate/free on the same page 
simultaneously, so no need to have a global big lock, isn’t it?
I think atomic operations on the bitmap would be enough.

> (I assume some kind of locking is required even if the bitmap would be
> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
> memory at places where we don't want to - e.g. from arch_free_page ?)

arch_free_pages is in free_pages_prepare, why can't we have memory allocation 
there?

It would also be doable to find a preferred place to preallocate some amount of 
memory for the bitmap.

> 
> That's the big benefit of taking the pages of the buddy free list. Other VCPUs
> won't stumble over them, waiting for them to get freed in the hypervisor.

As also mentioned above, I think other vcpus will not allocate/free on the same 
page that is in progress of being allocated/freed.

> This sounds more like "the host requests to get free pages once in a while"
> compared to "the host is always informed about free pages". At the time
> where the host actually has to ask the guest (e.g. because the host is low on
> memory), it might be to late to wait for guest action.

Option 1: Host asks for free pages:
Not necessary to ask only when the host has been in memory pressure.
This could be the orchestration layer's job to monitor the host memory usage.
For example, people could set the condition "when 50% of the host memory
has been used, start to ask a guest for some amount of free pages" 

Option 2: Guest actively offers free pages:
Add a balloon callback to arch_free_page so that whenever a page gets freed its 
gfn
will be filled into the balloon's report_vq and the host will take away the 
backing
host page.

Both options can be implemented. But I think option 1 would be more
efficient as the guest free pages are offered on demand.  

> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
> candidates for removal and if the host is low on memory, only scanning the
> guest page tables is sufficient to free up memory.
> 
> But both points might just be an implementation detail in the example you
> describe.

Yes, it is an implementation detail. I think DONTNEED would be easier
for the first step.

> 
> >
> > In above 2), get_free_page_hints clears the bits which indicates that those
> pages are not ready to be used by the guest yet. Why?
> > This is because 3) will unmap the underlying physical pages from EPT.
> Normally, when guest re-visits those pages, EPT violations and QEMU page
> faults will get a new host page to set up the related EPT entry. If guest uses
> that page before the page gets unmapped (i.e. right before step 3), no EPT
> violation happens and the guest will use the same physical page that will be
> unmapped and given to other host threads. So we need to make sure that
> the guest free page is usable only after step 3 finishes.
> >
> > Back to arch_alloc_page(), it needs to check if the allocated pages
> > have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, 
> > it
> means step 2) above has happened and step 4) hasn't been reached. In this
> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
> for that page Or better to have a balloon callback which prioritize 3) and 4)
> to make this page usable by the guest.
> 
> Regarding the latter, the VCPU allocating a page cannot do anything if the
> page (along with other pages) is just being freed by the hypervisor.
> It has to busy-wait, no chance to prioritize.

I meant this:
With this approach, essentially the free pages have 2 states:
ready free page: the page is on the free list and it has "1" in the bitmap
non-ready free page: the page is on the free list and it has "0" in the bitmap
Ready free pages are those who can be allocated to use.
Non-ready free pages are those who are in progress of being reported to
host and the related EPT mapping is about to be zapped. 

The non-ready pages are inserted into the report_vq and waiting for the
host to zap the mappings one by one. After the mapping gets zapped
(which means the backing host page has been taken away), host acks to
the guest to mark the free page as ready free page (set the bit to 1 in the 
bitmap).

So the non-ready free page may happen to be used when they are waiting in
the report_vq to be handled by the host to zap the mapping, balloon could
have a fast path to notify the host:
"page 0x1000 is about to be used, don’t zap the mapping when you get
0x1000 from the report_vq"  /*option [1] */

Or

"page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) 
above,
so that the free page will be marked as ready free page and the guest can use 
it".

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-12 Thread David Hildenbrand
On 12.02.19 18:24, Nitesh Narayan Lal wrote:
> 
> On 2/12/19 4:24 AM, David Hildenbrand wrote:
>> On 12.02.19 10:03, Wang, Wei W wrote:
>>> On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
 The following patch-set proposes an efficient mechanism for handing freed
 memory between the guest and the host. It enables the guests with no page
 cache to rapidly free and reclaims memory to and from the host 
 respectively.

 Benefit:
 With this patch-series, in our test-case, executed on a single system and
 single NUMA node with 15GB memory, we were able to successfully launch
 atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
 explanation of the test procedure is provided at the bottom).

 Changelog in V8:
 In this patch-series, the earlier approach [1] which was used to capture 
 and
 scan the pages freed by the guest has been changed. The new approach is
 briefly described below:

 The patch-set still leverages the existing arch_free_page() to add this
 functionality. It maintains a per CPU array which is used to store the 
 pages
 freed by the guest. The maximum number of entries which it can hold is
 defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
 is scanned and only the pages which are available in the buddy are stored.
 This process continues until the array is filled with pages which are part 
 of
 the buddy free list. After which it wakes up a kernel per-cpu-thread.
 This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
 and if the page is not reallocated and present in the buddy, the kernel
 thread attempts to isolate it from the buddy. If it is successfully 
 isolated, the
 page is added to another per-cpu array. Once the entire scanning process is
 complete, all the isolated pages are reported to the host through an 
 existing
 virtio-balloon driver.
>>>  Hi Nitesh,
>>>
>>> Have you guys thought about something like below, which would be simpler:
>> Responding because I'm the first to stumble over this mail, hah! :)
>>
>>> - use bitmaps to record free pages, e.g. xbitmap: 
>>> https://lkml.org/lkml/2018/1/9/304.
>>>   The bitmap can be indexed by the guest pfn, and it's globally accessed by 
>>> all the CPUs;
>> Global means all VCPUs will be competing potentially for a single lock
>> when freeing/allocating a page, no? What if you have 64VCPUs
>> allocating/freeing memory like crazy?
>>
>> (I assume some kind of locking is required even if the bitmap would be
>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
>> memory at places where we don't want to - e.g. from arch_free_page ?)
>>
>> That's the big benefit of taking the pages of the buddy free list. Other
>> VCPUs won't stumble over them, waiting for them to get freed in the
>> hypervisor.
>>
>>> - arch_free_page(): set the bits of the freed pages from the bitmap
>>>  (no per-CPU array with hardcoded fixed length and no per-cpu scanning 
>>> thread)
>>> - arch_alloc_page(): clear the related bits from the bitmap
>>> - expose 2 APIs for the callers:
>>>   -- unsigned long get_free_page_hints(unsigned long pfn_start, unsigned 
>>> int nr); 
>>>  This API searches for the next free page chunk (@nr of pages), 
>>> starting from @pfn_start.
>>>  Bits of those free pages will be cleared after this function returns.
>>>   -- void put_free_page_hints(unsigned long pfn_start, unsigned int nr);
>>>  This API sets the @nr continuous bits starting from pfn_start.
>>>
>>> Usage example with balloon:
>>> 1) host requests to start ballooning;
>>> 2) balloon driver get_free_page_hints and report the hints to host via 
>>> report_vq;
>>> 3) host calls madvise(pfn_start, DONTNEED) for each reported chunk of free 
>>> pages and put back pfn_start to ack_vq;
>>> 4) balloon driver receives pfn_start and calls 
>>> put_free_page_hints(pfn_start) to have the related bits from the bitmap to 
>>> be set, indicating that those free pages are ready to be allocated.
>> This sounds more like "the host requests to get free pages once in a
>> while" compared to "the host is always informed about free pages". At
>> the time where the host actually has to ask the guest (e.g. because the
>> host is low on memory), it might be to late to wait for guest action.
>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages
>> as candidates for removal and if the host is low on memory, only
>> scanning the guest page tables is sufficient to free up memory.
>>
>> But both points might just be an implementation detail in the example
>> you describe.
>>
>>> In above 2), get_free_page_hints clears the bits which indicates that those 
>>> pages are not ready to be used by the guest yet. Why?
>>> This is because 3) will unmap the underlying physical pages from EPT. 
>>> Normally, when guest re-visits those pages, EPT 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-12 Thread Nitesh Narayan Lal

On 2/12/19 4:24 AM, David Hildenbrand wrote:
> On 12.02.19 10:03, Wang, Wei W wrote:
>> On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
>>> The following patch-set proposes an efficient mechanism for handing freed
>>> memory between the guest and the host. It enables the guests with no page
>>> cache to rapidly free and reclaims memory to and from the host respectively.
>>>
>>> Benefit:
>>> With this patch-series, in our test-case, executed on a single system and
>>> single NUMA node with 15GB memory, we were able to successfully launch
>>> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
>>> explanation of the test procedure is provided at the bottom).
>>>
>>> Changelog in V8:
>>> In this patch-series, the earlier approach [1] which was used to capture and
>>> scan the pages freed by the guest has been changed. The new approach is
>>> briefly described below:
>>>
>>> The patch-set still leverages the existing arch_free_page() to add this
>>> functionality. It maintains a per CPU array which is used to store the pages
>>> freed by the guest. The maximum number of entries which it can hold is
>>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
>>> is scanned and only the pages which are available in the buddy are stored.
>>> This process continues until the array is filled with pages which are part 
>>> of
>>> the buddy free list. After which it wakes up a kernel per-cpu-thread.
>>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
>>> and if the page is not reallocated and present in the buddy, the kernel
>>> thread attempts to isolate it from the buddy. If it is successfully 
>>> isolated, the
>>> page is added to another per-cpu array. Once the entire scanning process is
>>> complete, all the isolated pages are reported to the host through an 
>>> existing
>>> virtio-balloon driver.
>>  Hi Nitesh,
>>
>> Have you guys thought about something like below, which would be simpler:
> Responding because I'm the first to stumble over this mail, hah! :)
>
>> - use bitmaps to record free pages, e.g. xbitmap: 
>> https://lkml.org/lkml/2018/1/9/304.
>>   The bitmap can be indexed by the guest pfn, and it's globally accessed by 
>> all the CPUs;
> Global means all VCPUs will be competing potentially for a single lock
> when freeing/allocating a page, no? What if you have 64VCPUs
> allocating/freeing memory like crazy?
>
> (I assume some kind of locking is required even if the bitmap would be
> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
> memory at places where we don't want to - e.g. from arch_free_page ?)
>
> That's the big benefit of taking the pages of the buddy free list. Other
> VCPUs won't stumble over them, waiting for them to get freed in the
> hypervisor.
>
>> - arch_free_page(): set the bits of the freed pages from the bitmap
>>  (no per-CPU array with hardcoded fixed length and no per-cpu scanning 
>> thread)
>> - arch_alloc_page(): clear the related bits from the bitmap
>> - expose 2 APIs for the callers:
>>   -- unsigned long get_free_page_hints(unsigned long pfn_start, unsigned int 
>> nr); 
>>  This API searches for the next free page chunk (@nr of pages), starting 
>> from @pfn_start.
>>  Bits of those free pages will be cleared after this function returns.
>>   -- void put_free_page_hints(unsigned long pfn_start, unsigned int nr);
>>  This API sets the @nr continuous bits starting from pfn_start.
>>
>> Usage example with balloon:
>> 1) host requests to start ballooning;
>> 2) balloon driver get_free_page_hints and report the hints to host via 
>> report_vq;
>> 3) host calls madvise(pfn_start, DONTNEED) for each reported chunk of free 
>> pages and put back pfn_start to ack_vq;
>> 4) balloon driver receives pfn_start and calls 
>> put_free_page_hints(pfn_start) to have the related bits from the bitmap to 
>> be set, indicating that those free pages are ready to be allocated.
> This sounds more like "the host requests to get free pages once in a
> while" compared to "the host is always informed about free pages". At
> the time where the host actually has to ask the guest (e.g. because the
> host is low on memory), it might be to late to wait for guest action.
> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages
> as candidates for removal and if the host is low on memory, only
> scanning the guest page tables is sufficient to free up memory.
>
> But both points might just be an implementation detail in the example
> you describe.
>
>> In above 2), get_free_page_hints clears the bits which indicates that those 
>> pages are not ready to be used by the guest yet. Why?
>> This is because 3) will unmap the underlying physical pages from EPT. 
>> Normally, when guest re-visits those pages, EPT violations and QEMU page 
>> faults will get a new host page to set up the related EPT entry. If guest 
>> uses that page before the page gets unmapped (i.e. 

Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-12 Thread David Hildenbrand
On 12.02.19 10:03, Wang, Wei W wrote:
> On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
>> The following patch-set proposes an efficient mechanism for handing freed
>> memory between the guest and the host. It enables the guests with no page
>> cache to rapidly free and reclaims memory to and from the host respectively.
>>
>> Benefit:
>> With this patch-series, in our test-case, executed on a single system and
>> single NUMA node with 15GB memory, we were able to successfully launch
>> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
>> explanation of the test procedure is provided at the bottom).
>>
>> Changelog in V8:
>> In this patch-series, the earlier approach [1] which was used to capture and
>> scan the pages freed by the guest has been changed. The new approach is
>> briefly described below:
>>
>> The patch-set still leverages the existing arch_free_page() to add this
>> functionality. It maintains a per CPU array which is used to store the pages
>> freed by the guest. The maximum number of entries which it can hold is
>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
>> is scanned and only the pages which are available in the buddy are stored.
>> This process continues until the array is filled with pages which are part of
>> the buddy free list. After which it wakes up a kernel per-cpu-thread.
>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
>> and if the page is not reallocated and present in the buddy, the kernel
>> thread attempts to isolate it from the buddy. If it is successfully 
>> isolated, the
>> page is added to another per-cpu array. Once the entire scanning process is
>> complete, all the isolated pages are reported to the host through an existing
>> virtio-balloon driver.
> 
>  Hi Nitesh,
> 
> Have you guys thought about something like below, which would be simpler:

Responding because I'm the first to stumble over this mail, hah! :)

> 
> - use bitmaps to record free pages, e.g. xbitmap: 
> https://lkml.org/lkml/2018/1/9/304.
>   The bitmap can be indexed by the guest pfn, and it's globally accessed by 
> all the CPUs;

Global means all VCPUs will be competing potentially for a single lock
when freeing/allocating a page, no? What if you have 64VCPUs
allocating/freeing memory like crazy?

(I assume some kind of locking is required even if the bitmap would be
atomic. Also, doesn't xbitmap mean that we eventually have to allocate
memory at places where we don't want to - e.g. from arch_free_page ?)

That's the big benefit of taking the pages of the buddy free list. Other
VCPUs won't stumble over them, waiting for them to get freed in the
hypervisor.

> - arch_free_page(): set the bits of the freed pages from the bitmap
>  (no per-CPU array with hardcoded fixed length and no per-cpu scanning thread)
> - arch_alloc_page(): clear the related bits from the bitmap
> - expose 2 APIs for the callers:
>   -- unsigned long get_free_page_hints(unsigned long pfn_start, unsigned int 
> nr); 
>  This API searches for the next free page chunk (@nr of pages), starting 
> from @pfn_start.
>  Bits of those free pages will be cleared after this function returns.
>   -- void put_free_page_hints(unsigned long pfn_start, unsigned int nr);
>  This API sets the @nr continuous bits starting from pfn_start.
> 
> Usage example with balloon:
> 1) host requests to start ballooning;
> 2) balloon driver get_free_page_hints and report the hints to host via 
> report_vq;
> 3) host calls madvise(pfn_start, DONTNEED) for each reported chunk of free 
> pages and put back pfn_start to ack_vq;
> 4) balloon driver receives pfn_start and calls put_free_page_hints(pfn_start) 
> to have the related bits from the bitmap to be set, indicating that those 
> free pages are ready to be allocated.

This sounds more like "the host requests to get free pages once in a
while" compared to "the host is always informed about free pages". At
the time where the host actually has to ask the guest (e.g. because the
host is low on memory), it might be to late to wait for guest action.
Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages
as candidates for removal and if the host is low on memory, only
scanning the guest page tables is sufficient to free up memory.

But both points might just be an implementation detail in the example
you describe.

> 
> In above 2), get_free_page_hints clears the bits which indicates that those 
> pages are not ready to be used by the guest yet. Why?
> This is because 3) will unmap the underlying physical pages from EPT. 
> Normally, when guest re-visits those pages, EPT violations and QEMU page 
> faults will get a new host page to set up the related EPT entry. If guest 
> uses that page before the page gets unmapped (i.e. right before step 3), no 
> EPT violation happens and the guest will use the same physical page that will 
> be unmapped and given to other host threads. So we 

RE: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-12 Thread Wang, Wei W
On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
> The following patch-set proposes an efficient mechanism for handing freed
> memory between the guest and the host. It enables the guests with no page
> cache to rapidly free and reclaims memory to and from the host respectively.
> 
> Benefit:
> With this patch-series, in our test-case, executed on a single system and
> single NUMA node with 15GB memory, we were able to successfully launch
> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
> explanation of the test procedure is provided at the bottom).
> 
> Changelog in V8:
> In this patch-series, the earlier approach [1] which was used to capture and
> scan the pages freed by the guest has been changed. The new approach is
> briefly described below:
> 
> The patch-set still leverages the existing arch_free_page() to add this
> functionality. It maintains a per CPU array which is used to store the pages
> freed by the guest. The maximum number of entries which it can hold is
> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
> is scanned and only the pages which are available in the buddy are stored.
> This process continues until the array is filled with pages which are part of
> the buddy free list. After which it wakes up a kernel per-cpu-thread.
> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
> and if the page is not reallocated and present in the buddy, the kernel
> thread attempts to isolate it from the buddy. If it is successfully isolated, 
> the
> page is added to another per-cpu array. Once the entire scanning process is
> complete, all the isolated pages are reported to the host through an existing
> virtio-balloon driver.

 Hi Nitesh,

Have you guys thought about something like below, which would be simpler:

- use bitmaps to record free pages, e.g. xbitmap: 
https://lkml.org/lkml/2018/1/9/304.
  The bitmap can be indexed by the guest pfn, and it's globally accessed by all 
the CPUs;
- arch_free_page(): set the bits of the freed pages from the bitmap
 (no per-CPU array with hardcoded fixed length and no per-cpu scanning thread)
- arch_alloc_page(): clear the related bits from the bitmap
- expose 2 APIs for the callers:
  -- unsigned long get_free_page_hints(unsigned long pfn_start, unsigned int 
nr); 
 This API searches for the next free page chunk (@nr of pages), starting 
from @pfn_start.
 Bits of those free pages will be cleared after this function returns.
  -- void put_free_page_hints(unsigned long pfn_start, unsigned int nr);
 This API sets the @nr continuous bits starting from pfn_start.

Usage example with balloon:
1) host requests to start ballooning;
2) balloon driver get_free_page_hints and report the hints to host via 
report_vq;
3) host calls madvise(pfn_start, DONTNEED) for each reported chunk of free 
pages and put back pfn_start to ack_vq;
4) balloon driver receives pfn_start and calls put_free_page_hints(pfn_start) 
to have the related bits from the bitmap to be set, indicating that those free 
pages are ready to be allocated.

In above 2), get_free_page_hints clears the bits which indicates that those 
pages are not ready to be used by the guest yet. Why?
This is because 3) will unmap the underlying physical pages from EPT. Normally, 
when guest re-visits those pages, EPT violations and QEMU page faults will get 
a new host page to set up the related EPT entry. If guest uses that page before 
the page gets unmapped (i.e. right before step 3), no EPT violation happens and 
the guest will use the same physical page that will be unmapped and given to 
other host threads. So we need to make sure that the guest free page is usable 
only after step 3 finishes.

Back to arch_alloc_page(), it needs to check if the allocated pages have "1" 
set in the bitmap, if that's true, just clear the bits. Otherwise, it means 
step 2) above has happened and step 4) hasn't been reached. In this case, we 
can either have arch_alloc_page() busywaiting a bit till 4) is done for that 
page
Or better to have a balloon callback which prioritize 3) and 4) to make this 
page usable by the guest.

Using bitmaps to record free page hints don't need to take the free pages off 
the buddy list and return them later, which needs to go through the long 
allocation/free code path.

Best,
Wei


[RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

2019-02-04 Thread Nitesh Narayan Lal
The following patch-set proposes an efficient mechanism for handing freed 
memory between the guest and the host. It enables the guests with no page cache 
to rapidly free and reclaims memory to and from the host respectively.

Benefit:
With this patch-series, in our test-case, executed on a single system and 
single NUMA node with 15GB memory, we were able to successfully launch atleast 
5 guests 
when page hinting was enabled and 3 without it. (Detailed explanation of the 
test procedure is provided at the bottom).

Changelog in V8:
In this patch-series, the earlier approach [1] which was used to capture and 
scan the pages freed by the guest has been changed. The new approach is briefly 
described below:

The patch-set still leverages the existing arch_free_page() to add this 
functionality. It maintains a per CPU array which is used to store the pages 
freed by the guest. The maximum number of entries which it can hold is defined 
by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it is scanned 
and only the pages which are available in the buddy are stored. This process 
continues until the array is filled with pages which are part of the buddy free 
list. After which it wakes up a kernel per-cpu-thread.
This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation and 
if the page is not reallocated and present in the buddy, the kernel thread 
attempts to isolate it from the buddy. If it is successfully isolated, the page 
is added to another per-cpu array. Once the entire scanning process is 
complete, all the isolated pages are reported to the host through an existing 
virtio-balloon driver.

Known Issues:
* Fixed array size: The problem with having a fixed/hardcoded array 
size arises when the size of the guest varies. For example when the guest size 
increases and it starts making large allocations fixed size limits this 
solution's ability to capture all the freed pages. This will result in less 
guest free memory getting reported to the host.

Known code re-work:
* Plan to re-use Wei's work, which communicates the poison value to the 
host.
* The nomenclatures used in virtio-balloon needs to be changed so that 
the code can easily be distinguished from Wei's Free Page Hint code.
* Sorting based on zonenum, to avoid repetitive zone locks for the same 
zone.

Other required work:
* Run other benchmarks to evaluate the performance/impact of this 
approach.

Test case:
Setup:
Memory-15837 MB
Guest Memory Size-5 GB
Swap-Disabled
Test Program-Simple program which allocates 4GB memory via malloc, touches it 
via memset and exits.
Use case-Number of guests that can be launched completely including the 
successful execution of the test program.
Procedure: 
The first guest is launched and once its console is up, the test allocation 
program is executed with 4 GB memory request (Due to this the guest occupies 
almost 4-5 GB of memory in the host in a system without page hinting). Once 
this program exits at that time another guest is launched in the host and the 
same process is followed. We continue launching the guests until a guest gets 
killed due to low memory condition in the host.

Result:
Without Hinting-3 Guests
With Hinting-5 to 7 Guests(Based on the amount of memory freed/captured).

[1] https://www.spinics.net/lists/kvm/msg170113.html