Re: [virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-09-04 Thread David Hildenbrand
>>> For some reason, I am not seeing this work as I would have expected
>>> but I don't have solid reasoning to share yet. It could be simply
>>> because I am putting my hook at the wrong place. I will continue
>>> investigating this.
>>>
>>> In any case, I may be over complicating things here, so please let me
>>> if there is a better way to do this.
>> I have already been demonstrating the "better way" I think there is to
>> do this. I will push v7 of it early next week unless there is some
>> other feedback. By putting the bit in the page and controlling what
>> comes into and out of the lists it makes most of this quite a bit
>> easier. The only limitation is you have to modify where things get
>> placed in the lists so you don't create a "vapor lock" that would
>> stall the feed of pages into the reporting engine.
>>
>>> If this overhead is not significant we can probably live with it.
>> You have bigger issues you still have to overcome as I recall. Didn't
>> you still need to sort out hotplu
> 
> For memory hotplug, my impression is that it should
> not be a blocker for taking the first step (in case we do decide to
> go ahead with this approach). Another reason why I am considering
> this as future work is that memory hot(un)plug is still under
> development and requires fixing. (Specifically, issue such as zone
> shrinking which will directly impact the bitmap approach is still
> under discussion).

Memory hotplug is way more reliable than memory unplug - however, but
both are used in production. E.g. the zone shrinking you mention is
something that is not "optimal", but works in most scenarios. So both
features are rather in a "production/fixing" stage than a pure
"development" stage.

However, I do agree that memory hot(un)plug is not something
high-priority for free page hinting in the first shot. If it works out
of the box (Alexander's approach) - great - if not it is just one
additional scenario where free page hinting might not be optimal yet
(like vfio where we can't use it completely right now). After all, most
virtual environment (under KVM) I am aware of don't use DIMM-base memory
hot(un)plug at all.

The important part is that it will not break when memory is added/removed.

> 
>> g and a sparse map with a wide span
>> in a zone? Without those resolved the bitmap approach is still a no-go
>> regardless of performance.
> 
> For sparsity, the memory wastage should not be significant as I
> am tracking pages on the granularity of (MAX_ORDER - 2) and maintaining
> the bitmaps on a per-zone basis (which was not the case earlier).

To handle sparsity one would have to have separate bitmaps for each
section. Roughly 64 bit per 128MB section. Scanning the scattered bitmap
would get more expensive. Of course, one could have a hierarchical
bitmap to speed up the search etc.

But again, I think we should have a look how much of an issue sparse
zones/nodes actually are in virtual machines (KVM). The setups I am
aware of minimize sparsity (e.g., QEMU will place DIMMs consecutively in
memory, only aligning to e.g, 128MB on x86 if required). Usually all
memory in VMs is onlined to the NORMAL zone (except special cases of
course). The only "issue" would be if you start mixing DIMMs of
different nodes when hotplugging memory. The overhead for 1bit per 2MB
could be almost neglectable in most setups.

But I do agree that for the bitmap-based approach there might be more
scenarios where you'd rather not want to use free page hinting and
instead disable it.

> 
> However, if you do consider this as a block I will think about it and try to 
> fix it.
> In the worst case, if I don't find a solution I will add this as a known 
> limitation
> for this approach in my cover.
> 
>> - Alex


-- 

Thanks,

David / dhildenb

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-30 Thread Nitesh Narayan Lal


On 8/30/19 11:31 AM, Alexander Duyck wrote:
> On Fri, Aug 30, 2019 at 8:15 AM Nitesh Narayan Lal  wrote:
>>
>> On 8/12/19 2:47 PM, Alexander Duyck wrote:
>>> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal  
>>> wrote:
 This patch introduces the core infrastructure for free page reporting in
 virtual environments. It enables the kernel to track the free pages which
 can be reported to its hypervisor so that the hypervisor could
 free and reuse that memory as per its requirement.

 While the pages are getting processed in the hypervisor (e.g.,
 via MADV_DONTNEED), the guest must not use them, otherwise, data loss
 would be possible. To avoid such a situation, these pages are
 temporarily removed from the buddy. The amount of pages removed
 temporarily from the buddy is governed by the backend(virtio-balloon
 in our case).

 To efficiently identify free pages that can to be reported to the
 hypervisor, bitmaps in a coarse granularity are used. Only fairly big
 chunks are reported to the hypervisor - especially, to not break up THP
 in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
 in the bitmap are an indication whether a page *might* be free, not a
 guarantee. A new hook after buddy merging sets the bits.

 Bitmaps are stored per zone, protected by the zone lock. A workqueue
 asynchronously processes the bitmaps, trying to isolate and report pages
 that are still free. The backend (virtio-balloon) is responsible for
 reporting these batched pages to the host synchronously. Once reporting/
 freeing is complete, isolated pages are returned back to the buddy.

 Signed-off-by: Nitesh Narayan Lal 
>> [...]
 +static void scan_zone_bitmap(struct page_reporting_config *phconf,
 +struct zone *zone)
 +{
 +   unsigned long setbit;
 +   struct page *page;
 +   int count = 0;
 +
 +   sg_init_table(phconf->sg, phconf->max_pages);
 +
 +   for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
 +   /* Process only if the page is still online */
 +   page = pfn_to_online_page((setbit << 
 PAGE_REPORTING_MIN_ORDER) +
 + zone->base_pfn);
 +   if (!page)
 +   continue;
 +
>>> Shouldn't you be clearing the bit and dropping the reference to
>>> free_pages before you move on to the next bit? Otherwise you are going
>>> to be stuck with those aren't you?
>>>
 +   spin_lock(>lock);
 +
 +   /* Ensure page is still free and can be processed */
 +   if (PageBuddy(page) && page_private(page) >=
 +   PAGE_REPORTING_MIN_ORDER)
 +   count = process_free_page(page, phconf, count);
 +
 +   spin_unlock(>lock);
>>> So I kind of wonder just how much overhead you are taking for bouncing
>>> the zone lock once per page here. Especially since it can result in
>>> you not actually making any progress since the page may have already
>>> been reallocated.
>>>
>> I am wondering if there is a way to measure this overhead?
>> After thinking about this, I do understand your point.
>> One possible way which I can think of to address this is by having a
>> page_reporting_dequeue() hook somewhere in the allocation path.
> Really in order to stress this you probably need to have a lot of
> CPUs, a lot of memory, and something that forces a lot of pages to get
> hit such as the memory shuffling feature.

I will think about it, thanks for the suggestion.

>
>> For some reason, I am not seeing this work as I would have expected
>> but I don't have solid reasoning to share yet. It could be simply
>> because I am putting my hook at the wrong place. I will continue
>> investigating this.
>>
>> In any case, I may be over complicating things here, so please let me
>> if there is a better way to do this.
> I have already been demonstrating the "better way" I think there is to
> do this. I will push v7 of it early next week unless there is some
> other feedback. By putting the bit in the page and controlling what
> comes into and out of the lists it makes most of this quite a bit
> easier. The only limitation is you have to modify where things get
> placed in the lists so you don't create a "vapor lock" that would
> stall the feed of pages into the reporting engine.
>
>> If this overhead is not significant we can probably live with it.
> You have bigger issues you still have to overcome as I recall. Didn't
> you still need to sort out hotplu

For memory hotplug, my impression is that it should
not be a blocker for taking the first step (in case we do decide to
go ahead with this approach). Another reason why I am considering
this as future work is that memory hot(un)plug is still under
development and requires 

[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-30 Thread Alexander Duyck
On Fri, Aug 30, 2019 at 8:15 AM Nitesh Narayan Lal  wrote:
>
>
> On 8/12/19 2:47 PM, Alexander Duyck wrote:
> > On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal  
> > wrote:
> >> This patch introduces the core infrastructure for free page reporting in
> >> virtual environments. It enables the kernel to track the free pages which
> >> can be reported to its hypervisor so that the hypervisor could
> >> free and reuse that memory as per its requirement.
> >>
> >> While the pages are getting processed in the hypervisor (e.g.,
> >> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
> >> would be possible. To avoid such a situation, these pages are
> >> temporarily removed from the buddy. The amount of pages removed
> >> temporarily from the buddy is governed by the backend(virtio-balloon
> >> in our case).
> >>
> >> To efficiently identify free pages that can to be reported to the
> >> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >> chunks are reported to the hypervisor - especially, to not break up THP
> >> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >> in the bitmap are an indication whether a page *might* be free, not a
> >> guarantee. A new hook after buddy merging sets the bits.
> >>
> >> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >> asynchronously processes the bitmaps, trying to isolate and report pages
> >> that are still free. The backend (virtio-balloon) is responsible for
> >> reporting these batched pages to the host synchronously. Once reporting/
> >> freeing is complete, isolated pages are returned back to the buddy.
> >>
> >> Signed-off-by: Nitesh Narayan Lal 
> [...]
> >> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
> >> +struct zone *zone)
> >> +{
> >> +   unsigned long setbit;
> >> +   struct page *page;
> >> +   int count = 0;
> >> +
> >> +   sg_init_table(phconf->sg, phconf->max_pages);
> >> +
> >> +   for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
> >> +   /* Process only if the page is still online */
> >> +   page = pfn_to_online_page((setbit << 
> >> PAGE_REPORTING_MIN_ORDER) +
> >> + zone->base_pfn);
> >> +   if (!page)
> >> +   continue;
> >> +
> > Shouldn't you be clearing the bit and dropping the reference to
> > free_pages before you move on to the next bit? Otherwise you are going
> > to be stuck with those aren't you?
> >
> >> +   spin_lock(>lock);
> >> +
> >> +   /* Ensure page is still free and can be processed */
> >> +   if (PageBuddy(page) && page_private(page) >=
> >> +   PAGE_REPORTING_MIN_ORDER)
> >> +   count = process_free_page(page, phconf, count);
> >> +
> >> +   spin_unlock(>lock);
> > So I kind of wonder just how much overhead you are taking for bouncing
> > the zone lock once per page here. Especially since it can result in
> > you not actually making any progress since the page may have already
> > been reallocated.
> >
>
> I am wondering if there is a way to measure this overhead?
> After thinking about this, I do understand your point.
> One possible way which I can think of to address this is by having a
> page_reporting_dequeue() hook somewhere in the allocation path.

Really in order to stress this you probably need to have a lot of
CPUs, a lot of memory, and something that forces a lot of pages to get
hit such as the memory shuffling feature.

> For some reason, I am not seeing this work as I would have expected
> but I don't have solid reasoning to share yet. It could be simply
> because I am putting my hook at the wrong place. I will continue
> investigating this.
>
> In any case, I may be over complicating things here, so please let me
> if there is a better way to do this.

I have already been demonstrating the "better way" I think there is to
do this. I will push v7 of it early next week unless there is some
other feedback. By putting the bit in the page and controlling what
comes into and out of the lists it makes most of this quite a bit
easier. The only limitation is you have to modify where things get
placed in the lists so you don't create a "vapor lock" that would
stall the feed of pages into the reporting engine.

> If this overhead is not significant we can probably live with it.

You have bigger issues you still have to overcome as I recall. Didn't
you still need to sort out hotplug and a sparse map with a wide span
in a zone? Without those resolved the bitmap approach is still a no-go
regardless of performance.

- Alex

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-30 Thread Nitesh Narayan Lal


On 8/12/19 2:47 PM, Alexander Duyck wrote:
> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal  wrote:
>> This patch introduces the core infrastructure for free page reporting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be reported to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> Signed-off-by: Nitesh Narayan Lal 
[...]
>> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
>> +struct zone *zone)
>> +{
>> +   unsigned long setbit;
>> +   struct page *page;
>> +   int count = 0;
>> +
>> +   sg_init_table(phconf->sg, phconf->max_pages);
>> +
>> +   for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
>> +   /* Process only if the page is still online */
>> +   page = pfn_to_online_page((setbit << 
>> PAGE_REPORTING_MIN_ORDER) +
>> + zone->base_pfn);
>> +   if (!page)
>> +   continue;
>> +
> Shouldn't you be clearing the bit and dropping the reference to
> free_pages before you move on to the next bit? Otherwise you are going
> to be stuck with those aren't you?
>
>> +   spin_lock(>lock);
>> +
>> +   /* Ensure page is still free and can be processed */
>> +   if (PageBuddy(page) && page_private(page) >=
>> +   PAGE_REPORTING_MIN_ORDER)
>> +   count = process_free_page(page, phconf, count);
>> +
>> +   spin_unlock(>lock);
> So I kind of wonder just how much overhead you are taking for bouncing
> the zone lock once per page here. Especially since it can result in
> you not actually making any progress since the page may have already
> been reallocated.
>

I am wondering if there is a way to measure this overhead?
After thinking about this, I do understand your point.
One possible way which I can think of to address this is by having a
page_reporting_dequeue() hook somewhere in the allocation path.

For some reason, I am not seeing this work as I would have expected
but I don't have solid reasoning to share yet. It could be simply
because I am putting my hook at the wrong place. I will continue
investigating this.

In any case, I may be over complicating things here, so please let me
if there is a better way to do this.

If this overhead is not significant we can probably live with it.

-- 
Thanks
Nitesh


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-20 Thread Nitesh Narayan Lal


On 8/12/19 4:04 PM, Nitesh Narayan Lal wrote:
> On 8/12/19 2:47 PM, Alexander Duyck wrote:
>> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal  wrote:
>>> This patch introduces the core infrastructure for free page reporting in
>>> virtual environments. It enables the kernel to track the free pages which
>>> can be reported to its hypervisor so that the hypervisor could
>>> free and reuse that memory as per its requirement.
>>>
>>> While the pages are getting processed in the hypervisor (e.g.,
>>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>>> would be possible. To avoid such a situation, these pages are
>>> temporarily removed from the buddy. The amount of pages removed
>>> temporarily from the buddy is governed by the backend(virtio-balloon
>>> in our case).
>>>
>>> To efficiently identify free pages that can to be reported to the
>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>>> chunks are reported to the hypervisor - especially, to not break up THP
>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>>> in the bitmap are an indication whether a page *might* be free, not a
>>> guarantee. A new hook after buddy merging sets the bits.
>>>
>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>>> asynchronously processes the bitmaps, trying to isolate and report pages
>>> that are still free. The backend (virtio-balloon) is responsible for
>>> reporting these batched pages to the host synchronously. Once reporting/
>>> freeing is complete, isolated pages are returned back to the buddy.
>>>
>>> Signed-off-by: Nitesh Narayan Lal 
>> So if I understand correctly the hotplug support for this is still not
>> included correct? 
>
> That is correct, I have it as an ongoing-item in my cover-email.
> In case, we decide to go with this approach do you think it is a blocker?

I am planning to defer memory hotplug/hotremove support for the future. Due to
following reasons:
* I would like to first get a basic framework ready and merged (in case we
  decide to go ahead with this approach) and then build on top of it.
* Memory hotplug/hotremove is not a primary use case in our mind right now and
  hence I am not considering this as a blocker for the first step.

Following are the items which I intend to address before my next submission:
* Use zone flag and reference counter to track the number of zones requesting
  page reporting.
* Move the bitmap and its respective fields into a structure whose rcu-protected
  pointer object is maintained on a per-zone basis.
* Pick Alexander's patch for page poisoning support and test them with my patch
  set. (@Alexander: I will keep your signed-off for these patches to indicate
  you are the original author, please do let me know there is a better way to
  give credit).
* Address other suggestions/comments received on v12.

Looking forward to any suggestions/comments.


[...]
> +
> +   /* assign the configuration object provided by the backend */
> +   rcu_assign_pointer(page_reporting_conf, phconf);
> +
> +out:
> +   mutex_unlock(_reporting_mutex);
> +   return ret;
> +}
> +EXPORT_SYMBOL_GPL(page_reporting_enable);
> --
> 2.21.0
>
-- 
Thanks
Nitesh


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-16 Thread Nitesh Narayan Lal


On 8/15/19 7:00 PM, Alexander Duyck wrote:
> On Thu, Aug 15, 2019 at 12:23 PM Nitesh Narayan Lal  wrote:
[...]
>>> +}
>>> +
>>> +/**
>>> + * __page_reporting_enqueue - tracks the freed page in the respective 
>>> zone's
>>> + * bitmap and enqueues a new page reporting job to the workqueue if 
>>> possible.
>>> + */
>>> +void __page_reporting_enqueue(struct page *page)
>>> +{
>>> +   struct page_reporting_config *phconf;
>>> +   struct zone *zone;
>>> +
>>> +   rcu_read_lock();
>>> +   /*
>>> +* We should not process this page if either page reporting is 
>>> not
>>> +* yet completely enabled or it has been disabled by the 
>>> backend.
>>> +*/
>>> +   phconf = rcu_dereference(page_reporting_conf);
>>> +   if (!phconf)
>>> +   return;
>>> +
>>> +   zone = page_zone(page);
>>> +   bitmap_set_bit(page, zone);
>>> +
>>> +   /*
>>> +* We should not enqueue a job if a previously enqueued 
>>> reporting work
>>> +* is in progress or we don't have enough free pages in the 
>>> zone.
>>> +*/
>>> +   if (atomic_read(>free_pages) >= phconf->max_pages &&
>>> +   !atomic_cmpxchg(>refcnt, 0, 1))
>> This doesn't make any sense to me. Why are you only incrementing the
>> refcount if it is zero? Combining this with the assignment above, this
>> isn't really a refcnt. It is just an oversized bitflag.
> The intent for having an extra variable was to ensure that at a time only 
> one
> reporting job is enqueued. I do agree that for that purpose I really 
> don't need
> a reference counter and I should have used something like bool
> 'page_hinting_active'. But with bool, I think there could be a possible 
> chance
> of race. Maybe I should rename this variable and keep it as atomic.
> Any thoughts?
 You could just use a bitflag to achieve what you are doing here. That
 is the primary use case for many of the test_and_set_bit type
 operations. However one issue with doing it as a bitflag is that you
 have no way of telling that you took care of all requesters.
>>> I think you are right, I might end up missing on certain reporting
>>> opportunities in some special cases. Specifically when the pages which are
>>> part of this new reporting request belongs to a section of the bitmap which
>>> has already been scanned. Although, I have failed to reproduce this kind of
>>> situation in an actual setup.
>>>
  That is
 where having an actual reference count comes in handy as you know
 exactly how many zones are requesting to be reported on.
>>> True.
>>>
>> Also I am pretty sure this results in the opportunity to miss pages
>> because there is nothing to prevent you from possibly missing a ton of
>> pages you could hint on if a large number of pages are pushed out all
>> at once and then the system goes idle in terms of memory allocation
>> and freeing.
> I was looking at how you are enqueuing/processing reporting jobs for each 
> zone.
> I am wondering if I should also consider something on similar lines as 
> having
> that I might be able to address the concern which you have raised above. 
> But it
> would also mean that I have to add an additional flag in the zone_flags. 
> :)
 You could do it either in the zone or outside the zone as yet another
 bitmap. I decided to put the flags inside the zone because there was a
 number of free bits there and it should be faster since we were
 already using the zone structure.
>>> There are two possibilities which could happen while I am reporting:
>>> 1. Another request might come in for a different zone.
>>> 2. Another request could come in for the same zone and the pages belong to a
>>> section of the bitmap which has already been scanned.
>>>
>>> Having a per zone flag to indicate reporting status will solve the first
>>> issue and to an extent the second as well. Having refcnt will possibly solve
>>> both of them. What I am wondering about is that in my case I could easily
>>> impact the performance negatively by performing more bitmap scanning.
>>>
>>>
>> I realized that it may not be possible for me to directly adopt either refcnt
>> or zone flags just because of the way I have page reporting setup right now.
>>
>> For now, I will just replace the refcnt with a bitflag as that should work
>> for most of the cases.  Nevertheless, I will also keep looking for a better 
>> way.
> If nothing else something you could consider is a refcnt for the
> number of bits you have set in your bitfield. Then all you would need
> to be doing is replace the cmpxchg with just a atomic_fetch_inc and
> what you would need to do is have your worker thread track how many
> bits it has cleared and subtract that from 

[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-15 Thread Alexander Duyck
On Thu, Aug 15, 2019 at 12:23 PM Nitesh Narayan Lal  wrote:
>
>
> On 8/15/19 9:15 AM, Nitesh Narayan Lal wrote:
> > On 8/14/19 12:11 PM, Alexander Duyck wrote:
> >> On Wed, Aug 14, 2019 at 8:49 AM Nitesh Narayan Lal  
> >> wrote:
> >>> On 8/12/19 2:47 PM, Alexander Duyck wrote:
>  On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal  
>  wrote:
> > This patch introduces the core infrastructure for free page reporting in
> > virtual environments. It enables the kernel to track the free pages 
> > which
> > can be reported to its hypervisor so that the hypervisor could
> > free and reuse that memory as per its requirement.
> >
> > While the pages are getting processed in the hypervisor (e.g.,
> > via MADV_DONTNEED), the guest must not use them, otherwise, data loss
> > would be possible. To avoid such a situation, these pages are
> > temporarily removed from the buddy. The amount of pages removed
> > temporarily from the buddy is governed by the backend(virtio-balloon
> > in our case).
> >
> > To efficiently identify free pages that can to be reported to the
> > hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> > chunks are reported to the hypervisor - especially, to not break up THP
> > in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> > in the bitmap are an indication whether a page *might* be free, not a
> > guarantee. A new hook after buddy merging sets the bits.
> >
> > Bitmaps are stored per zone, protected by the zone lock. A workqueue
> > asynchronously processes the bitmaps, trying to isolate and report pages
> > that are still free. The backend (virtio-balloon) is responsible for
> > reporting these batched pages to the host synchronously. Once reporting/
> > freeing is complete, isolated pages are returned back to the buddy.
> >
> > Signed-off-by: Nitesh Narayan Lal 
> >>> [...]
> > +}
> > +
> > +/**
> > + * __page_reporting_enqueue - tracks the freed page in the respective 
> > zone's
> > + * bitmap and enqueues a new page reporting job to the workqueue if 
> > possible.
> > + */
> > +void __page_reporting_enqueue(struct page *page)
> > +{
> > +   struct page_reporting_config *phconf;
> > +   struct zone *zone;
> > +
> > +   rcu_read_lock();
> > +   /*
> > +* We should not process this page if either page reporting is 
> > not
> > +* yet completely enabled or it has been disabled by the 
> > backend.
> > +*/
> > +   phconf = rcu_dereference(page_reporting_conf);
> > +   if (!phconf)
> > +   return;
> > +
> > +   zone = page_zone(page);
> > +   bitmap_set_bit(page, zone);
> > +
> > +   /*
> > +* We should not enqueue a job if a previously enqueued 
> > reporting work
> > +* is in progress or we don't have enough free pages in the 
> > zone.
> > +*/
> > +   if (atomic_read(>free_pages) >= phconf->max_pages &&
> > +   !atomic_cmpxchg(>refcnt, 0, 1))
>  This doesn't make any sense to me. Why are you only incrementing the
>  refcount if it is zero? Combining this with the assignment above, this
>  isn't really a refcnt. It is just an oversized bitflag.
> >>> The intent for having an extra variable was to ensure that at a time only 
> >>> one
> >>> reporting job is enqueued. I do agree that for that purpose I really 
> >>> don't need
> >>> a reference counter and I should have used something like bool
> >>> 'page_hinting_active'. But with bool, I think there could be a possible 
> >>> chance
> >>> of race. Maybe I should rename this variable and keep it as atomic.
> >>> Any thoughts?
> >> You could just use a bitflag to achieve what you are doing here. That
> >> is the primary use case for many of the test_and_set_bit type
> >> operations. However one issue with doing it as a bitflag is that you
> >> have no way of telling that you took care of all requesters.
> > I think you are right, I might end up missing on certain reporting
> > opportunities in some special cases. Specifically when the pages which are
> > part of this new reporting request belongs to a section of the bitmap which
> > has already been scanned. Although, I have failed to reproduce this kind of
> > situation in an actual setup.
> >
> >>  That is
> >> where having an actual reference count comes in handy as you know
> >> exactly how many zones are requesting to be reported on.
> >
> > True.
> >
>  Also I am pretty sure this results in the opportunity to miss pages
>  because there is nothing to prevent you from possibly missing a ton of
>  pages you could hint on if a large number of pages are pushed out all
>  at once and then the system goes idle in terms of memory allocation
>  and freeing.
> >>> I 

[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-15 Thread Nitesh Narayan Lal


On 8/15/19 9:15 AM, Nitesh Narayan Lal wrote:
> On 8/14/19 12:11 PM, Alexander Duyck wrote:
>> On Wed, Aug 14, 2019 at 8:49 AM Nitesh Narayan Lal  wrote:
>>> On 8/12/19 2:47 PM, Alexander Duyck wrote:
 On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal  
 wrote:
> This patch introduces the core infrastructure for free page reporting in
> virtual environments. It enables the kernel to track the free pages which
> can be reported to its hypervisor so that the hypervisor could
> free and reuse that memory as per its requirement.
>
> While the pages are getting processed in the hypervisor (e.g.,
> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
> would be possible. To avoid such a situation, these pages are
> temporarily removed from the buddy. The amount of pages removed
> temporarily from the buddy is governed by the backend(virtio-balloon
> in our case).
>
> To efficiently identify free pages that can to be reported to the
> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> chunks are reported to the hypervisor - especially, to not break up THP
> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> in the bitmap are an indication whether a page *might* be free, not a
> guarantee. A new hook after buddy merging sets the bits.
>
> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> asynchronously processes the bitmaps, trying to isolate and report pages
> that are still free. The backend (virtio-balloon) is responsible for
> reporting these batched pages to the host synchronously. Once reporting/
> freeing is complete, isolated pages are returned back to the buddy.
>
> Signed-off-by: Nitesh Narayan Lal 
>>> [...]
> +}
> +
> +/**
> + * __page_reporting_enqueue - tracks the freed page in the respective 
> zone's
> + * bitmap and enqueues a new page reporting job to the workqueue if 
> possible.
> + */
> +void __page_reporting_enqueue(struct page *page)
> +{
> +   struct page_reporting_config *phconf;
> +   struct zone *zone;
> +
> +   rcu_read_lock();
> +   /*
> +* We should not process this page if either page reporting is not
> +* yet completely enabled or it has been disabled by the backend.
> +*/
> +   phconf = rcu_dereference(page_reporting_conf);
> +   if (!phconf)
> +   return;
> +
> +   zone = page_zone(page);
> +   bitmap_set_bit(page, zone);
> +
> +   /*
> +* We should not enqueue a job if a previously enqueued reporting 
> work
> +* is in progress or we don't have enough free pages in the zone.
> +*/
> +   if (atomic_read(>free_pages) >= phconf->max_pages &&
> +   !atomic_cmpxchg(>refcnt, 0, 1))
 This doesn't make any sense to me. Why are you only incrementing the
 refcount if it is zero? Combining this with the assignment above, this
 isn't really a refcnt. It is just an oversized bitflag.
>>> The intent for having an extra variable was to ensure that at a time only 
>>> one
>>> reporting job is enqueued. I do agree that for that purpose I really don't 
>>> need
>>> a reference counter and I should have used something like bool
>>> 'page_hinting_active'. But with bool, I think there could be a possible 
>>> chance
>>> of race. Maybe I should rename this variable and keep it as atomic.
>>> Any thoughts?
>> You could just use a bitflag to achieve what you are doing here. That
>> is the primary use case for many of the test_and_set_bit type
>> operations. However one issue with doing it as a bitflag is that you
>> have no way of telling that you took care of all requesters.
> I think you are right, I might end up missing on certain reporting
> opportunities in some special cases. Specifically when the pages which are
> part of this new reporting request belongs to a section of the bitmap which
> has already been scanned. Although, I have failed to reproduce this kind of
> situation in an actual setup.
>
>>  That is
>> where having an actual reference count comes in handy as you know
>> exactly how many zones are requesting to be reported on.
>
> True.
>
 Also I am pretty sure this results in the opportunity to miss pages
 because there is nothing to prevent you from possibly missing a ton of
 pages you could hint on if a large number of pages are pushed out all
 at once and then the system goes idle in terms of memory allocation
 and freeing.
>>> I was looking at how you are enqueuing/processing reporting jobs for each 
>>> zone.
>>> I am wondering if I should also consider something on similar lines as 
>>> having
>>> that I might be able to address the concern which you have raised above. 
>>> But it
>>> would also mean that I have to add an 

[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-15 Thread Nitesh Narayan Lal


On 8/14/19 12:11 PM, Alexander Duyck wrote:
> On Wed, Aug 14, 2019 at 8:49 AM Nitesh Narayan Lal  wrote:
>>
>> On 8/12/19 2:47 PM, Alexander Duyck wrote:
>>> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal  
>>> wrote:
 This patch introduces the core infrastructure for free page reporting in
 virtual environments. It enables the kernel to track the free pages which
 can be reported to its hypervisor so that the hypervisor could
 free and reuse that memory as per its requirement.

 While the pages are getting processed in the hypervisor (e.g.,
 via MADV_DONTNEED), the guest must not use them, otherwise, data loss
 would be possible. To avoid such a situation, these pages are
 temporarily removed from the buddy. The amount of pages removed
 temporarily from the buddy is governed by the backend(virtio-balloon
 in our case).

 To efficiently identify free pages that can to be reported to the
 hypervisor, bitmaps in a coarse granularity are used. Only fairly big
 chunks are reported to the hypervisor - especially, to not break up THP
 in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
 in the bitmap are an indication whether a page *might* be free, not a
 guarantee. A new hook after buddy merging sets the bits.

 Bitmaps are stored per zone, protected by the zone lock. A workqueue
 asynchronously processes the bitmaps, trying to isolate and report pages
 that are still free. The backend (virtio-balloon) is responsible for
 reporting these batched pages to the host synchronously. Once reporting/
 freeing is complete, isolated pages are returned back to the buddy.

 Signed-off-by: Nitesh Narayan Lal 
>> [...]
 +}
 +
 +/**
 + * __page_reporting_enqueue - tracks the freed page in the respective 
 zone's
 + * bitmap and enqueues a new page reporting job to the workqueue if 
 possible.
 + */
 +void __page_reporting_enqueue(struct page *page)
 +{
 +   struct page_reporting_config *phconf;
 +   struct zone *zone;
 +
 +   rcu_read_lock();
 +   /*
 +* We should not process this page if either page reporting is not
 +* yet completely enabled or it has been disabled by the backend.
 +*/
 +   phconf = rcu_dereference(page_reporting_conf);
 +   if (!phconf)
 +   return;
 +
 +   zone = page_zone(page);
 +   bitmap_set_bit(page, zone);
 +
 +   /*
 +* We should not enqueue a job if a previously enqueued reporting 
 work
 +* is in progress or we don't have enough free pages in the zone.
 +*/
 +   if (atomic_read(>free_pages) >= phconf->max_pages &&
 +   !atomic_cmpxchg(>refcnt, 0, 1))
>>> This doesn't make any sense to me. Why are you only incrementing the
>>> refcount if it is zero? Combining this with the assignment above, this
>>> isn't really a refcnt. It is just an oversized bitflag.
>>
>> The intent for having an extra variable was to ensure that at a time only one
>> reporting job is enqueued. I do agree that for that purpose I really don't 
>> need
>> a reference counter and I should have used something like bool
>> 'page_hinting_active'. But with bool, I think there could be a possible 
>> chance
>> of race. Maybe I should rename this variable and keep it as atomic.
>> Any thoughts?
> You could just use a bitflag to achieve what you are doing here. That
> is the primary use case for many of the test_and_set_bit type
> operations. However one issue with doing it as a bitflag is that you
> have no way of telling that you took care of all requesters.

I think you are right, I might end up missing on certain reporting
opportunities in some special cases. Specifically when the pages which are
part of this new reporting request belongs to a section of the bitmap which
has already been scanned. Although, I have failed to reproduce this kind of
situation in an actual setup.

>  That is
> where having an actual reference count comes in handy as you know
> exactly how many zones are requesting to be reported on.


True.

>
>>> Also I am pretty sure this results in the opportunity to miss pages
>>> because there is nothing to prevent you from possibly missing a ton of
>>> pages you could hint on if a large number of pages are pushed out all
>>> at once and then the system goes idle in terms of memory allocation
>>> and freeing.
>>
>> I was looking at how you are enqueuing/processing reporting jobs for each 
>> zone.
>> I am wondering if I should also consider something on similar lines as having
>> that I might be able to address the concern which you have raised above. But 
>> it
>> would also mean that I have to add an additional flag in the zone_flags. :)
> You could do it either in the zone or outside the zone as yet another
> bitmap. I decided to put the flags 

[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-14 Thread Alexander Duyck
On Wed, Aug 14, 2019 at 8:49 AM Nitesh Narayan Lal  wrote:
>
>
> On 8/12/19 2:47 PM, Alexander Duyck wrote:
> > On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal  
> > wrote:
> >> This patch introduces the core infrastructure for free page reporting in
> >> virtual environments. It enables the kernel to track the free pages which
> >> can be reported to its hypervisor so that the hypervisor could
> >> free and reuse that memory as per its requirement.
> >>
> >> While the pages are getting processed in the hypervisor (e.g.,
> >> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
> >> would be possible. To avoid such a situation, these pages are
> >> temporarily removed from the buddy. The amount of pages removed
> >> temporarily from the buddy is governed by the backend(virtio-balloon
> >> in our case).
> >>
> >> To efficiently identify free pages that can to be reported to the
> >> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >> chunks are reported to the hypervisor - especially, to not break up THP
> >> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >> in the bitmap are an indication whether a page *might* be free, not a
> >> guarantee. A new hook after buddy merging sets the bits.
> >>
> >> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >> asynchronously processes the bitmaps, trying to isolate and report pages
> >> that are still free. The backend (virtio-balloon) is responsible for
> >> reporting these batched pages to the host synchronously. Once reporting/
> >> freeing is complete, isolated pages are returned back to the buddy.
> >>
> >> Signed-off-by: Nitesh Narayan Lal 
> >
> [...]
> >> +}
> >> +
> >> +/**
> >> + * __page_reporting_enqueue - tracks the freed page in the respective 
> >> zone's
> >> + * bitmap and enqueues a new page reporting job to the workqueue if 
> >> possible.
> >> + */
> >> +void __page_reporting_enqueue(struct page *page)
> >> +{
> >> +   struct page_reporting_config *phconf;
> >> +   struct zone *zone;
> >> +
> >> +   rcu_read_lock();
> >> +   /*
> >> +* We should not process this page if either page reporting is not
> >> +* yet completely enabled or it has been disabled by the backend.
> >> +*/
> >> +   phconf = rcu_dereference(page_reporting_conf);
> >> +   if (!phconf)
> >> +   return;
> >> +
> >> +   zone = page_zone(page);
> >> +   bitmap_set_bit(page, zone);
> >> +
> >> +   /*
> >> +* We should not enqueue a job if a previously enqueued reporting 
> >> work
> >> +* is in progress or we don't have enough free pages in the zone.
> >> +*/
> >> +   if (atomic_read(>free_pages) >= phconf->max_pages &&
> >> +   !atomic_cmpxchg(>refcnt, 0, 1))
> > This doesn't make any sense to me. Why are you only incrementing the
> > refcount if it is zero? Combining this with the assignment above, this
> > isn't really a refcnt. It is just an oversized bitflag.
>
>
> The intent for having an extra variable was to ensure that at a time only one
> reporting job is enqueued. I do agree that for that purpose I really don't 
> need
> a reference counter and I should have used something like bool
> 'page_hinting_active'. But with bool, I think there could be a possible chance
> of race. Maybe I should rename this variable and keep it as atomic.
> Any thoughts?

You could just use a bitflag to achieve what you are doing here. That
is the primary use case for many of the test_and_set_bit type
operations. However one issue with doing it as a bitflag is that you
have no way of telling that you took care of all requesters. That is
where having an actual reference count comes in handy as you know
exactly how many zones are requesting to be reported on.

> >
> > Also I am pretty sure this results in the opportunity to miss pages
> > because there is nothing to prevent you from possibly missing a ton of
> > pages you could hint on if a large number of pages are pushed out all
> > at once and then the system goes idle in terms of memory allocation
> > and freeing.
>
>
> I was looking at how you are enqueuing/processing reporting jobs for each 
> zone.
> I am wondering if I should also consider something on similar lines as having
> that I might be able to address the concern which you have raised above. But 
> it
> would also mean that I have to add an additional flag in the zone_flags. :)

You could do it either in the zone or outside the zone as yet another
bitmap. I decided to put the flags inside the zone because there was a
number of free bits there and it should be faster since we were
already using the zone structure.

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-14 Thread Nitesh Narayan Lal


On 8/12/19 2:47 PM, Alexander Duyck wrote:
> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal  wrote:
>> This patch introduces the core infrastructure for free page reporting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be reported to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> Signed-off-by: Nitesh Narayan Lal 
>
[...]
>> +}
>> +
>> +/**
>> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
>> + * bitmap and enqueues a new page reporting job to the workqueue if 
>> possible.
>> + */
>> +void __page_reporting_enqueue(struct page *page)
>> +{
>> +   struct page_reporting_config *phconf;
>> +   struct zone *zone;
>> +
>> +   rcu_read_lock();
>> +   /*
>> +* We should not process this page if either page reporting is not
>> +* yet completely enabled or it has been disabled by the backend.
>> +*/
>> +   phconf = rcu_dereference(page_reporting_conf);
>> +   if (!phconf)
>> +   return;
>> +
>> +   zone = page_zone(page);
>> +   bitmap_set_bit(page, zone);
>> +
>> +   /*
>> +* We should not enqueue a job if a previously enqueued reporting 
>> work
>> +* is in progress or we don't have enough free pages in the zone.
>> +*/
>> +   if (atomic_read(>free_pages) >= phconf->max_pages &&
>> +   !atomic_cmpxchg(>refcnt, 0, 1))
> This doesn't make any sense to me. Why are you only incrementing the
> refcount if it is zero? Combining this with the assignment above, this
> isn't really a refcnt. It is just an oversized bitflag.


The intent for having an extra variable was to ensure that at a time only one
reporting job is enqueued. I do agree that for that purpose I really don't need
a reference counter and I should have used something like bool
'page_hinting_active'. But with bool, I think there could be a possible chance
of race. Maybe I should rename this variable and keep it as atomic.
Any thoughts?


>
> Also I am pretty sure this results in the opportunity to miss pages
> because there is nothing to prevent you from possibly missing a ton of
> pages you could hint on if a large number of pages are pushed out all
> at once and then the system goes idle in terms of memory allocation
> and freeing.


I was looking at how you are enqueuing/processing reporting jobs for each zone.
I am wondering if I should also consider something on similar lines as having
that I might be able to address the concern which you have raised above. But it
would also mean that I have to add an additional flag in the zone_flags. :)

>
[...]
>
>> +EXPORT_SYMBOL_GPL(page_reporting_enable);
>> --
>> 2.21.0
>>
-- 
Thanks
Nitesh

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-14 Thread Nitesh Narayan Lal


On 8/14/19 3:07 AM, David Hildenbrand wrote:
> On 14.08.19 01:14, Alexander Duyck wrote:
>> On Tue, Aug 13, 2019 at 3:34 AM David Hildenbrand  wrote:
>>> +static int process_free_page(struct page *page,
>>> +struct page_reporting_config *phconf, int 
>>> count)
>>> +{
>>> +   int mt, order, ret = 0;
>>> +
>>> +   mt = get_pageblock_migratetype(page);
>>> +   order = page_private(page);
>>> +   ret = __isolate_free_page(page, order);
>>> +
> I just started looking into the wonderful world of
> isolation/compaction/migration.
>
> I don't think saving/restoring the migratetype is correct here. AFAIK,
> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
> movable pages and up in UNMOVABLE or ordinary kernel allocations on
> MOVABLE. So that shouldn't be an issue - I guess.
>
> 1. You should never allocate something that is no
> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
> CMA here. There should at least be a !is_migrate_isolate_page() check
> somewhere
>
> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
> with isolation code, you have to hold the zone lock. Your code seems to
> do that, so at least you cannot race against isolation.
>
> 3. You could end up temporarily allocating something in the
> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
> would have to be a way to make alloc_contig_range()/offlining code
> properly wait until the pages have been processed. Not sure about the
> real implications, though - too many details in the code (I wonder if
> Alex' series has a way of dealing with that)
>
> When you restore the migratetype, you could suddenly overwrite e.g.,
> ISOLATE, which feels wrong.

 I was triggering an occasional CPU stall bug earlier, with saving and 
 restoring
 the migratetype I was able to fix it.
 But I will further look into this to figure out if it is really required.

>>> You should especially look into handling isolated/cma pages. Maybe that
>>> was the original issue. Alex seems to have added that in his latest
>>> series (skipping isolated/cma pageblocks completely) as well.
>> So as far as skipping isolated pageblocks, I get the reason for
>> skipping isolated, but why would we need to skip CMA? I had made the
>> change I did based on comments you had made earlier. But while working
>> on some of the changes to address isolation better and looking over
>> several spots in the code it seems like CMA is already being used as
>> an allocation fallback for MIGRATE_MOVABLE. If that is the case
>> wouldn't it make sense to allow pulling pages and reporting them while
>> they are in the free_list?
> I was assuming that CMA is also to be skipped because "static int
> fallbacks[MIGRATE_TYPES][4]" in mm/page_alloc.c doesn't handle CMA at
> all, meaning we should never fallback to CMA or from CMA to another type
> - at least when stealing pages from another migratetype. So it smells
> like MIGRATE_CMA is static -> the area is marked once and will never be
> converted to something else (except MIGRATE_ISOLATE temporarily).
>
> I assume you are talking about gfp_to_alloc_flags()/prepare_alloc_pages():


I am also trying to look into this to get more understanding of it.
Another thing which I am looking into right now is the difference between
get/set_pcppage_migratetype() and ge/set_pageblock_migratetype().
To an extent, I do understand what is the benefit of using
get/set_pcppage_migratetype() by reading the comments. However, I am not sure
how it gets along with MIGRATE_CMA.
Hopefully, I will have an understanding of it soon.

> #ifdef CONFIG_CMA
>   if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
>   alloc_flags |= ALLOC_CMA;
> #endif
>
> Yeah, this looks like MOVABLE allocations can fallback to CMA
> pageblocks. And from what I read, "CMA may use its own migratetype
> (MIGRATE_CMA) which behaves similarly to ZONE_MOVABLE but can be put in
> arbitrary places."
>
> So I think you are right, it could be that it is safe to temporarily
> pull out CMA pages (in contrast to isolated pages) - assuming it is fine
> to have temporary unmovable allocations on them (different discussion).
>
> (I am learning about the details as we discuss :) )
>
> The important part would then be to never allocate from the isolated
> pageblocks and to never overwrite MIGRATE_ISOLATE.


Agreed. I think I should just avoid isolating pages with migratetype
MIGRATE_ISOLATE.
Adding a check with is_migrate_isolate_page() before isolating the page should
do it.


>
-- 
Thanks
Nitesh


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-14 Thread David Hildenbrand
On 14.08.19 01:14, Alexander Duyck wrote:
> On Tue, Aug 13, 2019 at 3:34 AM David Hildenbrand  wrote:
>>
>> +static int process_free_page(struct page *page,
>> +struct page_reporting_config *phconf, int 
>> count)
>> +{
>> +   int mt, order, ret = 0;
>> +
>> +   mt = get_pageblock_migratetype(page);
>> +   order = page_private(page);
>> +   ret = __isolate_free_page(page, order);
>> +
 I just started looking into the wonderful world of
 isolation/compaction/migration.

 I don't think saving/restoring the migratetype is correct here. AFAIK,
 MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
 movable pages and up in UNMOVABLE or ordinary kernel allocations on
 MOVABLE. So that shouldn't be an issue - I guess.

 1. You should never allocate something that is no
 MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
 CMA here. There should at least be a !is_migrate_isolate_page() check
 somewhere

 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
 with isolation code, you have to hold the zone lock. Your code seems to
 do that, so at least you cannot race against isolation.

 3. You could end up temporarily allocating something in the
 ZONE_MOVABLE. The pages you allocate are, however, not movable. There
 would have to be a way to make alloc_contig_range()/offlining code
 properly wait until the pages have been processed. Not sure about the
 real implications, though - too many details in the code (I wonder if
 Alex' series has a way of dealing with that)

 When you restore the migratetype, you could suddenly overwrite e.g.,
 ISOLATE, which feels wrong.
>>>
>>>
>>> I was triggering an occasional CPU stall bug earlier, with saving and 
>>> restoring
>>> the migratetype I was able to fix it.
>>> But I will further look into this to figure out if it is really required.
>>>
>>
>> You should especially look into handling isolated/cma pages. Maybe that
>> was the original issue. Alex seems to have added that in his latest
>> series (skipping isolated/cma pageblocks completely) as well.
> 
> So as far as skipping isolated pageblocks, I get the reason for
> skipping isolated, but why would we need to skip CMA? I had made the
> change I did based on comments you had made earlier. But while working
> on some of the changes to address isolation better and looking over
> several spots in the code it seems like CMA is already being used as
> an allocation fallback for MIGRATE_MOVABLE. If that is the case
> wouldn't it make sense to allow pulling pages and reporting them while
> they are in the free_list?

I was assuming that CMA is also to be skipped because "static int
fallbacks[MIGRATE_TYPES][4]" in mm/page_alloc.c doesn't handle CMA at
all, meaning we should never fallback to CMA or from CMA to another type
- at least when stealing pages from another migratetype. So it smells
like MIGRATE_CMA is static -> the area is marked once and will never be
converted to something else (except MIGRATE_ISOLATE temporarily).

I assume you are talking about gfp_to_alloc_flags()/prepare_alloc_pages():

#ifdef CONFIG_CMA
if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
#endif

Yeah, this looks like MOVABLE allocations can fallback to CMA
pageblocks. And from what I read, "CMA may use its own migratetype
(MIGRATE_CMA) which behaves similarly to ZONE_MOVABLE but can be put in
arbitrary places."

So I think you are right, it could be that it is safe to temporarily
pull out CMA pages (in contrast to isolated pages) - assuming it is fine
to have temporary unmovable allocations on them (different discussion).

(I am learning about the details as we discuss :) )

The important part would then be to never allocate from the isolated
pageblocks and to never overwrite MIGRATE_ISOLATE.

-- 

Thanks,

David / dhildenb

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-13 Thread Alexander Duyck
On Tue, Aug 13, 2019 at 3:34 AM David Hildenbrand  wrote:
>
>  +static int process_free_page(struct page *page,
>  +struct page_reporting_config *phconf, int 
>  count)
>  +{
>  +   int mt, order, ret = 0;
>  +
>  +   mt = get_pageblock_migratetype(page);
>  +   order = page_private(page);
>  +   ret = __isolate_free_page(page, order);
>  +
> >> I just started looking into the wonderful world of
> >> isolation/compaction/migration.
> >>
> >> I don't think saving/restoring the migratetype is correct here. AFAIK,
> >> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
> >> movable pages and up in UNMOVABLE or ordinary kernel allocations on
> >> MOVABLE. So that shouldn't be an issue - I guess.
> >>
> >> 1. You should never allocate something that is no
> >> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
> >> CMA here. There should at least be a !is_migrate_isolate_page() check
> >> somewhere
> >>
> >> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
> >> with isolation code, you have to hold the zone lock. Your code seems to
> >> do that, so at least you cannot race against isolation.
> >>
> >> 3. You could end up temporarily allocating something in the
> >> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
> >> would have to be a way to make alloc_contig_range()/offlining code
> >> properly wait until the pages have been processed. Not sure about the
> >> real implications, though - too many details in the code (I wonder if
> >> Alex' series has a way of dealing with that)
> >>
> >> When you restore the migratetype, you could suddenly overwrite e.g.,
> >> ISOLATE, which feels wrong.
> >
> >
> > I was triggering an occasional CPU stall bug earlier, with saving and 
> > restoring
> > the migratetype I was able to fix it.
> > But I will further look into this to figure out if it is really required.
> >
>
> You should especially look into handling isolated/cma pages. Maybe that
> was the original issue. Alex seems to have added that in his latest
> series (skipping isolated/cma pageblocks completely) as well.

So as far as skipping isolated pageblocks, I get the reason for
skipping isolated, but why would we need to skip CMA? I had made the
change I did based on comments you had made earlier. But while working
on some of the changes to address isolation better and looking over
several spots in the code it seems like CMA is already being used as
an allocation fallback for MIGRATE_MOVABLE. If that is the case
wouldn't it make sense to allow pulling pages and reporting them while
they are in the free_list?

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-13 Thread David Hildenbrand
On 13.08.19 12:42, Nitesh Narayan Lal wrote:
> 
> On 8/13/19 6:34 AM, David Hildenbrand wrote:
>> +static int process_free_page(struct page *page,
>> +struct page_reporting_config *phconf, int 
>> count)
>> +{
>> +   int mt, order, ret = 0;
> [...]
>> +/**
>> + * zone_reporting_init - For each zone initializes the page reporting 
>> fields
>> + * and allocates the respective bitmap.
>> + *
>> + * This function returns 0 on successful initialization, -ENOMEM 
>> otherwise.
>> + */
>> +static int zone_reporting_init(void)
>> +{
>> +   struct zone *zone;
>> +   int ret;
>> +
>> +   for_each_populated_zone(zone) {
>> +#ifdef CONFIG_ZONE_DEVICE
>> +   /* we can not report pages which are not in the buddy */
>> +   if (zone_idx(zone) == ZONE_DEVICE)
>> +   continue;
>> +#endif
> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
> zone will be considered "populated".
>
 I think you are right (although it's confusing, we will have present
 sections part of a zone but the zone has no present_pages - screams like
 a re factoring - leftover from ZONE_DEVICE introduction).
>>>
>>> I think in that case it is safe to have this check here.
>>> What do you guys suggest?
>> If it's not needed, I'd say drop it (eventually add a comment).
> 
> 
> Comment to mention that we do not expect a zone with non-buddy page to be
> initialized here?

Something along these lines, or something like

/* ZONE_DEVICE is never considered populated */

-- 

Thanks,

David / dhildenb

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-13 Thread Nitesh Narayan Lal


On 8/13/19 6:34 AM, David Hildenbrand wrote:
> +static int process_free_page(struct page *page,
> +struct page_reporting_config *phconf, int 
> count)
> +{
> +   int mt, order, ret = 0;
[...]
> +/**
> + * zone_reporting_init - For each zone initializes the page reporting 
> fields
> + * and allocates the respective bitmap.
> + *
> + * This function returns 0 on successful initialization, -ENOMEM 
> otherwise.
> + */
> +static int zone_reporting_init(void)
> +{
> +   struct zone *zone;
> +   int ret;
> +
> +   for_each_populated_zone(zone) {
> +#ifdef CONFIG_ZONE_DEVICE
> +   /* we can not report pages which are not in the buddy */
> +   if (zone_idx(zone) == ZONE_DEVICE)
> +   continue;
> +#endif
 I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
 zone will be considered "populated".

>>> I think you are right (although it's confusing, we will have present
>>> sections part of a zone but the zone has no present_pages - screams like
>>> a re factoring - leftover from ZONE_DEVICE introduction).
>>
>> I think in that case it is safe to have this check here.
>> What do you guys suggest?
> If it's not needed, I'd say drop it (eventually add a comment).


Comment to mention that we do not expect a zone with non-buddy page to be
initialized here?

>
>
-- 
Thanks
Nitesh

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-13 Thread David Hildenbrand
 +static int process_free_page(struct page *page,
 +struct page_reporting_config *phconf, int 
 count)
 +{
 +   int mt, order, ret = 0;
 +
 +   mt = get_pageblock_migratetype(page);
 +   order = page_private(page);
 +   ret = __isolate_free_page(page, order);
 +
>> I just started looking into the wonderful world of
>> isolation/compaction/migration.
>>
>> I don't think saving/restoring the migratetype is correct here. AFAIK,
>> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
>> movable pages and up in UNMOVABLE or ordinary kernel allocations on
>> MOVABLE. So that shouldn't be an issue - I guess.
>>
>> 1. You should never allocate something that is no
>> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
>> CMA here. There should at least be a !is_migrate_isolate_page() check
>> somewhere
>>
>> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
>> with isolation code, you have to hold the zone lock. Your code seems to
>> do that, so at least you cannot race against isolation.
>>
>> 3. You could end up temporarily allocating something in the
>> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
>> would have to be a way to make alloc_contig_range()/offlining code
>> properly wait until the pages have been processed. Not sure about the
>> real implications, though - too many details in the code (I wonder if
>> Alex' series has a way of dealing with that)
>>
>> When you restore the migratetype, you could suddenly overwrite e.g.,
>> ISOLATE, which feels wrong.
> 
> 
> I was triggering an occasional CPU stall bug earlier, with saving and 
> restoring
> the migratetype I was able to fix it.
> But I will further look into this to figure out if it is really required.
> 

You should especially look into handling isolated/cma pages. Maybe that
was the original issue. Alex seems to have added that in his latest
series (skipping isolated/cma pageblocks completely) as well.

>> [...]
>>> So as per your comments in the cover page, the two functions above
>>> should also probably be plugged into the zone resizing logic somewhere
>>> so if a zone is resized the bitmap is adjusted.
>>>
 +/**
 + * zone_reporting_init - For each zone initializes the page reporting 
 fields
 + * and allocates the respective bitmap.
 + *
 + * This function returns 0 on successful initialization, -ENOMEM 
 otherwise.
 + */
 +static int zone_reporting_init(void)
 +{
 +   struct zone *zone;
 +   int ret;
 +
 +   for_each_populated_zone(zone) {
 +#ifdef CONFIG_ZONE_DEVICE
 +   /* we can not report pages which are not in the buddy */
 +   if (zone_idx(zone) == ZONE_DEVICE)
 +   continue;
 +#endif
>>> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
>>> zone will be considered "populated".
>>>
>> I think you are right (although it's confusing, we will have present
>> sections part of a zone but the zone has no present_pages - screams like
>> a re factoring - leftover from ZONE_DEVICE introduction).
> 
> 
> I think in that case it is safe to have this check here.
> What do you guys suggest?

If it's not needed, I'd say drop it (eventually add a comment).


-- 

Thanks,

David / dhildenb

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-13 Thread Nitesh Narayan Lal


On 8/12/19 4:05 PM, David Hildenbrand wrote:
>>> ---
>>>  include/linux/mmzone.h |  11 ++
>>>  include/linux/page_reporting.h |  63 +++
>>>  mm/Kconfig |   6 +
>>>  mm/Makefile|   1 +
>>>  mm/page_alloc.c|  42 -
>>>  mm/page_reporting.c| 332 +
>>>  6 files changed, 448 insertions(+), 7 deletions(-)
>>>  create mode 100644 include/linux/page_reporting.h
>>>  create mode 100644 mm/page_reporting.c
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index d77d717c620c..ba5f5b508f25 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -559,6 +559,17 @@ struct zone {
>>> /* Zone statistics */
>>> atomic_long_t   vm_stat[NR_VM_ZONE_STAT_ITEMS];
>>> atomic_long_t   vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>>> +#ifdef CONFIG_PAGE_REPORTING
>>> +   /* Pointer to the bitmap in PAGE_REPORTING_MIN_ORDER granularity */
>>> +   unsigned long *bitmap;
>>> +   /* Preserve start and end PFN in case they change due to hotplug */
>>> +   unsigned long base_pfn;
>>> +   unsigned long end_pfn;
>>> +   /* Free pages of granularity PAGE_REPORTING_MIN_ORDER */
>>> +   atomic_t free_pages;
>>> +   /* Number of bits required in the bitmap */
>>> +   unsigned long nbits;
>>> +#endif
>>>  } cacheline_internodealigned_in_smp;
>> Okay, so the original thing this patch set had going for it was that
>> it was non-invasive. However, now you are adding a bunch of stuff to
>> the zone. That kind of loses the non-invasive argument for this patch
>> set compared to mine.
>>
> Adding something to "struct zone" is certainly less invasive than core
> buddy modifications, just saying (I agree that this is suboptimal. I
> would have guessed that all that's needed is a pointer to some private
> structure here). 


I think having just a pointer to a private structure makes sense here.
If I am not wrong then I can probably make an allocation for it for each
populated zone at the time I enable page reporting.

> However, the migratetype thingy below looks fishy to me.
>
>> If we are going to continue further with this patch set then it might
>> be worth looking into dynamically allocating the space you need for
>> this block. At a minimum you could probably look at making the bitmap
>> an RCU based setup so you could define the base and end along with the
>> bitmap. It would probably help to resolve the hotplug issues you still
>> need to address.
> Yeah, I guess that makes sense.
>
> [...]
>>> +
>>> +static int process_free_page(struct page *page,
>>> +struct page_reporting_config *phconf, int 
>>> count)
>>> +{
>>> +   int mt, order, ret = 0;
>>> +
>>> +   mt = get_pageblock_migratetype(page);
>>> +   order = page_private(page);
>>> +   ret = __isolate_free_page(page, order);
>>> +
> I just started looking into the wonderful world of
> isolation/compaction/migration.
>
> I don't think saving/restoring the migratetype is correct here. AFAIK,
> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
> movable pages and up in UNMOVABLE or ordinary kernel allocations on
> MOVABLE. So that shouldn't be an issue - I guess.
>
> 1. You should never allocate something that is no
> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
> CMA here. There should at least be a !is_migrate_isolate_page() check
> somewhere
>
> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
> with isolation code, you have to hold the zone lock. Your code seems to
> do that, so at least you cannot race against isolation.
>
> 3. You could end up temporarily allocating something in the
> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
> would have to be a way to make alloc_contig_range()/offlining code
> properly wait until the pages have been processed. Not sure about the
> real implications, though - too many details in the code (I wonder if
> Alex' series has a way of dealing with that)
>
> When you restore the migratetype, you could suddenly overwrite e.g.,
> ISOLATE, which feels wrong.


I was triggering an occasional CPU stall bug earlier, with saving and restoring
the migratetype I was able to fix it.
But I will further look into this to figure out if it is really required.

> [...]
>> So as per your comments in the cover page, the two functions above
>> should also probably be plugged into the zone resizing logic somewhere
>> so if a zone is resized the bitmap is adjusted.
>>
>>> +/**
>>> + * zone_reporting_init - For each zone initializes the page reporting 
>>> fields
>>> + * and allocates the respective bitmap.
>>> + *
>>> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
>>> + */
>>> +static int zone_reporting_init(void)
>>> +{
>>> +   struct zone *zone;
>>> +   int ret;
>>> +
>>> +

[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-12 Thread Nitesh Narayan Lal


On 8/12/19 2:47 PM, Alexander Duyck wrote:
> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal  wrote:
>> This patch introduces the core infrastructure for free page reporting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be reported to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> Signed-off-by: Nitesh Narayan Lal 
> So if I understand correctly the hotplug support for this is still not
> included correct? 


That is correct, I have it as an ongoing-item in my cover-email.
In case, we decide to go with this approach do you think it is a blocker?


> I assume that is the case since I don't see any
> logic for zone resizing.
>
> Also I don't see how this dealt with the sparse issue that was pointed
> out earlier. Specifically how would you deal with a zone that has a
> wide range between the base and the end and a huge gap somewhere
> in-between?

It doesn't. However, considering we are tracking page on order of (MAX_ORDER -
2) I don't think the loss will be significant.
In any case, this is certainly a drawback of this approach and I should add this
in my cover.
The one thing which I did change in this version is that now I started
maintaining bitmap for each zone.

>
>> ---
>>  include/linux/mmzone.h |  11 ++
>>  include/linux/page_reporting.h |  63 +++
>>  mm/Kconfig |   6 +
>>  mm/Makefile|   1 +
>>  mm/page_alloc.c|  42 -
>>  mm/page_reporting.c| 332 +
>>  6 files changed, 448 insertions(+), 7 deletions(-)
>>  create mode 100644 include/linux/page_reporting.h
>>  create mode 100644 mm/page_reporting.c
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index d77d717c620c..ba5f5b508f25 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -559,6 +559,17 @@ struct zone {
>> /* Zone statistics */
>> atomic_long_t   vm_stat[NR_VM_ZONE_STAT_ITEMS];
>> atomic_long_t   vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>> +#ifdef CONFIG_PAGE_REPORTING
>> +   /* Pointer to the bitmap in PAGE_REPORTING_MIN_ORDER granularity */
>> +   unsigned long *bitmap;
>> +   /* Preserve start and end PFN in case they change due to hotplug */
>> +   unsigned long base_pfn;
>> +   unsigned long end_pfn;
>> +   /* Free pages of granularity PAGE_REPORTING_MIN_ORDER */
>> +   atomic_t free_pages;
>> +   /* Number of bits required in the bitmap */
>> +   unsigned long nbits;
>> +#endif
>>  } cacheline_internodealigned_in_smp;
> Okay, so the original thing this patch set had going for it was that
> it was non-invasive. However, now you are adding a bunch of stuff to
> the zone. That kind of loses the non-invasive argument for this patch
> set compared to mine.

I think it is fair to add that it not as invasive as yours. :) (But that has its
own pros and cons)
In any case, I do understand your point.

>
>
> If we are going to continue further with this patch set then it might
> be worth looking into dynamically allocating the space you need for
> this block.

Not sure if I understood this part. Dynamic allocation for the structure which
you are suggesting below?


>  At a minimum you could probably look at making the bitmap
> an RCU based setup so you could define the base and end along with the
> bitmap. It would probably help to resolve the hotplug issues you still
> need to address.


Thanks for the suggestion. I will look into it.


>
>>  enum pgdat_flags {
>> diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
>> new file mode 100644
>> index ..37a39589939d
>> --- /dev/null
>> +++ b/include/linux/page_reporting.h
>> @@ -0,0 +1,63 

[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-12 Thread David Hildenbrand
>> ---
>>  include/linux/mmzone.h |  11 ++
>>  include/linux/page_reporting.h |  63 +++
>>  mm/Kconfig |   6 +
>>  mm/Makefile|   1 +
>>  mm/page_alloc.c|  42 -
>>  mm/page_reporting.c| 332 +
>>  6 files changed, 448 insertions(+), 7 deletions(-)
>>  create mode 100644 include/linux/page_reporting.h
>>  create mode 100644 mm/page_reporting.c
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index d77d717c620c..ba5f5b508f25 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -559,6 +559,17 @@ struct zone {
>> /* Zone statistics */
>> atomic_long_t   vm_stat[NR_VM_ZONE_STAT_ITEMS];
>> atomic_long_t   vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>> +#ifdef CONFIG_PAGE_REPORTING
>> +   /* Pointer to the bitmap in PAGE_REPORTING_MIN_ORDER granularity */
>> +   unsigned long *bitmap;
>> +   /* Preserve start and end PFN in case they change due to hotplug */
>> +   unsigned long base_pfn;
>> +   unsigned long end_pfn;
>> +   /* Free pages of granularity PAGE_REPORTING_MIN_ORDER */
>> +   atomic_t free_pages;
>> +   /* Number of bits required in the bitmap */
>> +   unsigned long nbits;
>> +#endif
>>  } cacheline_internodealigned_in_smp;
> 
> Okay, so the original thing this patch set had going for it was that
> it was non-invasive. However, now you are adding a bunch of stuff to
> the zone. That kind of loses the non-invasive argument for this patch
> set compared to mine.
> 

Adding something to "struct zone" is certainly less invasive than core
buddy modifications, just saying (I agree that this is suboptimal. I
would have guessed that all that's needed is a pointer to some private
structure here). However, the migratetype thingy below looks fishy to me.

> If we are going to continue further with this patch set then it might
> be worth looking into dynamically allocating the space you need for
> this block. At a minimum you could probably look at making the bitmap
> an RCU based setup so you could define the base and end along with the
> bitmap. It would probably help to resolve the hotplug issues you still
> need to address.

Yeah, I guess that makes sense.

[...]
>> +
>> +static int process_free_page(struct page *page,
>> +struct page_reporting_config *phconf, int count)
>> +{
>> +   int mt, order, ret = 0;
>> +
>> +   mt = get_pageblock_migratetype(page);
>> +   order = page_private(page);
>> +   ret = __isolate_free_page(page, order);
>> +

I just started looking into the wonderful world of
isolation/compaction/migration.

I don't think saving/restoring the migratetype is correct here. AFAIK,
MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
movable pages and up in UNMOVABLE or ordinary kernel allocations on
MOVABLE. So that shouldn't be an issue - I guess.

1. You should never allocate something that is no
MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
CMA here. There should at least be a !is_migrate_isolate_page() check
somewhere

2. set_migratetype_isolate() takes the zone lock, so to avoid racing
with isolation code, you have to hold the zone lock. Your code seems to
do that, so at least you cannot race against isolation.

3. You could end up temporarily allocating something in the
ZONE_MOVABLE. The pages you allocate are, however, not movable. There
would have to be a way to make alloc_contig_range()/offlining code
properly wait until the pages have been processed. Not sure about the
real implications, though - too many details in the code (I wonder if
Alex' series has a way of dealing with that)

When you restore the migratetype, you could suddenly overwrite e.g.,
ISOLATE, which feels wrong.

[...]
> So as per your comments in the cover page, the two functions above
> should also probably be plugged into the zone resizing logic somewhere
> so if a zone is resized the bitmap is adjusted.
> 
>> +/**
>> + * zone_reporting_init - For each zone initializes the page reporting fields
>> + * and allocates the respective bitmap.
>> + *
>> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
>> + */
>> +static int zone_reporting_init(void)
>> +{
>> +   struct zone *zone;
>> +   int ret;
>> +
>> +   for_each_populated_zone(zone) {
>> +#ifdef CONFIG_ZONE_DEVICE
>> +   /* we can not report pages which are not in the buddy */
>> +   if (zone_idx(zone) == ZONE_DEVICE)
>> +   continue;
>> +#endif
> 
> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
> zone will be considered "populated".
> 
I think you are right (although it's confusing, we will have present
sections part of a zone but the zone has no present_pages - screams like
a re factoring - leftover from ZONE_DEVICE introduction).

-- 

Thanks,


[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-12 Thread Alexander Duyck
On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal  wrote:
>
> This patch introduces the core infrastructure for free page reporting in
> virtual environments. It enables the kernel to track the free pages which
> can be reported to its hypervisor so that the hypervisor could
> free and reuse that memory as per its requirement.
>
> While the pages are getting processed in the hypervisor (e.g.,
> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
> would be possible. To avoid such a situation, these pages are
> temporarily removed from the buddy. The amount of pages removed
> temporarily from the buddy is governed by the backend(virtio-balloon
> in our case).
>
> To efficiently identify free pages that can to be reported to the
> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> chunks are reported to the hypervisor - especially, to not break up THP
> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> in the bitmap are an indication whether a page *might* be free, not a
> guarantee. A new hook after buddy merging sets the bits.
>
> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> asynchronously processes the bitmaps, trying to isolate and report pages
> that are still free. The backend (virtio-balloon) is responsible for
> reporting these batched pages to the host synchronously. Once reporting/
> freeing is complete, isolated pages are returned back to the buddy.
>
> Signed-off-by: Nitesh Narayan Lal 

So if I understand correctly the hotplug support for this is still not
included correct? I assume that is the case since I don't see any
logic for zone resizing.

Also I don't see how this dealt with the sparse issue that was pointed
out earlier. Specifically how would you deal with a zone that has a
wide range between the base and the end and a huge gap somewhere
in-between?

> ---
>  include/linux/mmzone.h |  11 ++
>  include/linux/page_reporting.h |  63 +++
>  mm/Kconfig |   6 +
>  mm/Makefile|   1 +
>  mm/page_alloc.c|  42 -
>  mm/page_reporting.c| 332 +
>  6 files changed, 448 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/page_reporting.h
>  create mode 100644 mm/page_reporting.c
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d77d717c620c..ba5f5b508f25 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -559,6 +559,17 @@ struct zone {
> /* Zone statistics */
> atomic_long_t   vm_stat[NR_VM_ZONE_STAT_ITEMS];
> atomic_long_t   vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
> +#ifdef CONFIG_PAGE_REPORTING
> +   /* Pointer to the bitmap in PAGE_REPORTING_MIN_ORDER granularity */
> +   unsigned long *bitmap;
> +   /* Preserve start and end PFN in case they change due to hotplug */
> +   unsigned long base_pfn;
> +   unsigned long end_pfn;
> +   /* Free pages of granularity PAGE_REPORTING_MIN_ORDER */
> +   atomic_t free_pages;
> +   /* Number of bits required in the bitmap */
> +   unsigned long nbits;
> +#endif
>  } cacheline_internodealigned_in_smp;

Okay, so the original thing this patch set had going for it was that
it was non-invasive. However, now you are adding a bunch of stuff to
the zone. That kind of loses the non-invasive argument for this patch
set compared to mine.

If we are going to continue further with this patch set then it might
be worth looking into dynamically allocating the space you need for
this block. At a minimum you could probably look at making the bitmap
an RCU based setup so you could define the base and end along with the
bitmap. It would probably help to resolve the hotplug issues you still
need to address.

>  enum pgdat_flags {
> diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
> new file mode 100644
> index ..37a39589939d
> --- /dev/null
> +++ b/include/linux/page_reporting.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_PAGE_REPORTING_H
> +#define _LINUX_PAGE_REPORTING_H
> +
> +#define PAGE_REPORTING_MIN_ORDER   (MAX_ORDER - 2)
> +#define PAGE_REPORTING_MAX_PAGES   16
> +
> +#ifdef CONFIG_PAGE_REPORTING
> +struct page_reporting_config {
> +   /* function to hint batch of isolated pages */
> +   void (*report)(struct page_reporting_config *phconf,
> +  unsigned int num_pages);
> +
> +   /* scatterlist to hold the isolated pages to be hinted */
> +   struct scatterlist *sg;
> +
> +   /*
> +* Maxmimum pages that are going to be hinted to the hypervisor at a
> +* time of granularity >= PAGE_REPORTING_MIN_ORDER.
> +*/
> +   int max_pages;
> +
> +   /* work object to process page reporting rqeuests */
> +   struct work_struct reporting_work;
> +
> +   /* tracks the number