Re: [PATCH 4/6] hugetlb: avoid allocation failed when page reporting is on going

2021-01-10 Thread Liang Li
> > > Please don't use this email address for me anymore. Either use
> > > alexander.du...@gmail.com or alexanderdu...@fb.com. I am getting
> > > bounces when I reply to this thread because of the old address.
> >
> > No problem.
> >
> > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > index eb533995cb49..0fccd5f96954 100644
> > > > --- a/mm/hugetlb.c
> > > > +++ b/mm/hugetlb.c
> > > > @@ -2320,6 +2320,12 @@ struct page *alloc_huge_page(struct 
> > > > vm_area_struct *vma,
> > > > goto out_uncharge_cgroup_reservation;
> > > >
> > > > spin_lock(_lock);
> > > > +   while (h->free_huge_pages <= 1 && h->isolated_huge_pages) {
> > > > +   spin_unlock(_lock);
> > > > +   mutex_lock(>mtx_prezero);
> > > > +   mutex_unlock(>mtx_prezero);
> > > > +   spin_lock(_lock);
> > > > +   }
> > >
> > > This seems like a bad idea. It kind of defeats the whole point of
> > > doing the page zeroing outside of the hugetlb_lock. Also it is
> > > operating on the assumption that the only way you might get a page is
> > > from the page zeroing logic.
> > >
> > > With the page reporting code we wouldn't drop the count to zero. We
> > > had checks that were going through and monitoring the watermarks and
> > > if we started to hit the low watermark we would stop page reporting
> > > and just assume there aren't enough pages to report. You might need to
> > > look at doing something similar here so that you can avoid colliding
> > > with the allocator.
> >
> > For hugetlb, things are a little different, Just like Mike points out:
> >  "On some systems, hugetlb pages are a precious resource and
> >   the sysadmin carefully configures the number needed by
> >   applications.  Removing a hugetlb page (even for a very short
> >   period of time) could cause serious application failure."
> >
> > Just keeping some pages in the freelist is not enough to prevent that from
> > happening, because these pages may be allocated while zero out is on
> > going, and application may still run into a situation for not available free
> > pages.
>
> I get what you are saying. However I don't know if it is acceptable
> for the allocating thread to be put to sleep in this situation. There
> are two scenarios where I can see this being problematic.
>
> One is a setup where you put the page allocator to sleep and while it
> is sleeping another thread is then freeing a page and your thread
> cannot respond to that newly freed page and is stuck waiting on the
> zeroed page.
>
> The second issue is that users may want a different option of just
> breaking up the request into smaller pages rather than waiting on the
> page zeroing, or to do something else while waiting on the page. So
> instead of sitting on the request and waiting it might make more sense
> to return an error pointer like EAGAIN or EBUSY to indicate that there
> is a page there, but it is momentarily tied up.

It seems returning EAGAIN or EBUSY will still change the application's
behavior,  I am not sure if it's acceptable.

Thanks
Liang


Re: [PATCH 3/6] hugetlb: add free page reporting support

2021-01-10 Thread Liang Li
On Fri, Jan 8, 2021 at 6:04 AM Mike Kravetz  wrote:
>
> On 1/5/21 7:49 PM, Liang Li wrote:
> > hugetlb manages its page in hstate's free page list, not in buddy
> > system, this patch try to make it works for hugetlbfs. It canbe
> > used for memory overcommit in virtualization and hugetlb pre zero
> > out.
>
> I am not looking closely at the hugetlb changes yet.  There seem to be
> higher level questions about page reporting/etc.  Once those are sorted,
> I will be happy to take a closer look.  One quick question below.
>
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -41,6 +41,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include "page_reporting.h"
> >  #include "internal.h"
> >
> >  int hugetlb_max_hstate __read_mostly;
> > @@ -1028,6 +1029,9 @@ static void enqueue_huge_page(struct hstate *h, 
> > struct page *page)
> >   list_move(>lru, >hugepage_freelists[nid]);
> >   h->free_huge_pages++;
> >   h->free_huge_pages_node[nid]++;
> > + if (hugepage_reported(page))
> > + __ClearPageReported(page);
> > + hugepage_reporting_notify_free(h->order);
> >  }
> >
> >  static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> > @@ -5531,6 +5535,21 @@ follow_huge_pgd(struct mm_struct *mm, unsigned long 
> > address, pgd_t *pgd, int fla
> >   return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> 
> > PAGE_SHIFT);
> >  }
> >
> > +void isolate_free_huge_page(struct page *page, struct hstate *h, int nid)
> > +{
> > + VM_BUG_ON_PAGE(!PageHead(page), page);
> > +
> > + list_move(>lru, >hugepage_activelist);
> > + set_page_refcounted(page);
> > +}
> > +
> > +void putback_isolate_huge_page(struct hstate *h, struct page *page)
> > +{
> > + int nid = page_to_nid(page);
> > +
> > + list_move(>lru, >hugepage_freelists[nid]);
> > +}
>
> The above routines move pages between the free and active lists without any
> update to free page counts.  How does that work?  Will the number of entries
> on the free list get out of sync with the free_huge_pages counters?
> --
> Mike Kravetz
>
Yes. the  free_huge_pages counters will be out of sync with the free list.
There are two reasons for the above code: 1. Hide the free page reporting
to the user; 2. Simplify the logic to sync 'free_huge_pages' and
'resv_huge_pages'.  I am not sure if it will break something else.

Thanks
Liang


Re: [PATCH 3/6] hugetlb: add free page reporting support

2021-01-10 Thread Liang Li
> >> On Tue 05-01-21 22:49:21, Liang Li wrote:
> >>> hugetlb manages its page in hstate's free page list, not in buddy
> >>> system, this patch try to make it works for hugetlbfs. It canbe
> >>> used for memory overcommit in virtualization and hugetlb pre zero
> >>> out.
> >>
> >> David has layed down some more fundamental questions in the reply to the
> >> cover letter (btw. can you fix your scripts to send patches and make all
> >> the patches to be in reply to the cover letter please?). But I would
> >
> > Do you mean attach the patches in the email for the cover letter ?
>
> You should be using "git format-patch --cover-letter . .." followed by
> "git send-email ...", so the end result is a nicely structured thread.
>

I got it.  Thanks!

Liang


Re: [PATCH 4/6] hugetlb: avoid allocation failed when page reporting is on going

2021-01-06 Thread Liang Li
> > Page reporting isolates free pages temporarily when reporting
> > free pages information. It will reduce the actual free pages
> > and may cause application failed for no enough available memory.
> > This patch try to solve this issue, when there is no free page
> > and page repoting is on going, wait until it is done.
> >
> > Cc: Alexander Duyck 
>
> Please don't use this email address for me anymore. Either use
> alexander.du...@gmail.com or alexanderdu...@fb.com. I am getting
> bounces when I reply to this thread because of the old address.

No problem.

> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index eb533995cb49..0fccd5f96954 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2320,6 +2320,12 @@ struct page *alloc_huge_page(struct vm_area_struct 
> > *vma,
> > goto out_uncharge_cgroup_reservation;
> >
> > spin_lock(_lock);
> > +   while (h->free_huge_pages <= 1 && h->isolated_huge_pages) {
> > +   spin_unlock(_lock);
> > +   mutex_lock(>mtx_prezero);
> > +   mutex_unlock(>mtx_prezero);
> > +   spin_lock(_lock);
> > +   }
>
> This seems like a bad idea. It kind of defeats the whole point of
> doing the page zeroing outside of the hugetlb_lock. Also it is
> operating on the assumption that the only way you might get a page is
> from the page zeroing logic.
>
> With the page reporting code we wouldn't drop the count to zero. We
> had checks that were going through and monitoring the watermarks and
> if we started to hit the low watermark we would stop page reporting
> and just assume there aren't enough pages to report. You might need to
> look at doing something similar here so that you can avoid colliding
> with the allocator.

For hugetlb, things are a little different, Just like Mike points out:
 "On some systems, hugetlb pages are a precious resource and
  the sysadmin carefully configures the number needed by
  applications.  Removing a hugetlb page (even for a very short
  period of time) could cause serious application failure."

Just keeping some pages in the freelist is not enough to prevent that from
happening, because these pages may be allocated while zero out is on
going, and application may still run into a situation for not available free
pages.

Thanks
Liang


Re: [PATCH 3/6] hugetlb: add free page reporting support

2021-01-06 Thread Liang Li
On Thu, Jan 7, 2021 at 12:08 AM Michal Hocko  wrote:
>
> On Tue 05-01-21 22:49:21, Liang Li wrote:
> > hugetlb manages its page in hstate's free page list, not in buddy
> > system, this patch try to make it works for hugetlbfs. It canbe
> > used for memory overcommit in virtualization and hugetlb pre zero
> > out.
>
> David has layed down some more fundamental questions in the reply to the
> cover letter (btw. can you fix your scripts to send patches and make all
> the patches to be in reply to the cover letter please?). But I would

Do you mean attach the patches in the email for the cover letter ?

> like to point out that this changelog would need to change a lot as
> well. It doesn't explain really what, why and how. E.g. what would any
> guest gain by being able to report free huge pages? What would guarantee
> that the pool is replenished when there is a demand? Can this make the
> fault fail or it just takes more time to be satisfied? Why did you
> decide that the reporting infrastructure should be abused to do the
> zeroying? I do remember Alexander pushing back against that and so you
> should better have a very strong arguments to proceed that way.
>
> I am pretty sure there are more questions to come when more details are
> uncovered.
> --
> Michal Hocko
> SUSE Labs

I will try to add more detail about the aspect you referred to. Thanks!

Liang


Re: [PATCH 2/6] mm: let user decide page reporting option

2021-01-06 Thread Liang Li
> >  enum {
> > PAGE_REPORTING_IDLE = 0,
> > @@ -44,7 +45,7 @@ __page_reporting_request(struct page_reporting_dev_info 
> > *prdev)
> >  * now we are limiting this to running no more than once every
> >  * couple of seconds.
> >  */
> > -   schedule_delayed_work(>work, PAGE_REPORTING_DELAY);
> > +   schedule_delayed_work(>work, prdev->delay_jiffies);
> >  }
> >
>
> So this ends up being the reason why you needed to add the batch size
> value. However I don't really see it working as expected since you
> could essentially have 1 page freed 4M times that could trigger your
> page zeroing logic. So for example if a NIC is processing frames and
> ends up freeing and then reallocating some small batch of pages this
> could would be running often even though there isn't really all that
> many pages that needed zeroing.

Good catch, it works not like batch size means.

> >  /* notify prdev of free page reporting request */
> > @@ -230,7 +231,7 @@ page_reporting_process_zone(struct 
> > page_reporting_dev_info *prdev,
> >
> > /* Generate minimum watermark to be able to guarantee progress */
> > watermark = low_wmark_pages(zone) +
> > -   (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER);
> > +   (PAGE_REPORTING_CAPACITY << prdev->mini_order);
> >
> > /*
> >  * Cancel request if insufficient free memory or if we failed
>
> With the page order being able to be greatly reduced this could have a
> significant impact on if this code really has any value. Previously we
> were able to guarantee a pretty significant number of higher order
> pages free. With this we might only be guaranteeing something like 32
> 4K pages which is pretty small compared to what can end up being
> pulled out at the higher end.

I have dropped the 'buddy free page pre zero'  patch, so the mini order will
not change to a small value.

> > @@ -240,7 +241,7 @@ page_reporting_process_zone(struct 
> > page_reporting_dev_info *prdev,
> > return err;
> >
> > /* Process each free list starting from lowest order/mt */
> > -   for (order = PAGE_REPORTING_MIN_ORDER; order < MAX_ORDER; order++) {
> > +   for (order = prdev->mini_order; order < MAX_ORDER; order++) {
> > for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> > /* We do not pull pages from the isolate free list 
> > */
> > if (is_migrate_isolate(mt))
> > @@ -307,7 +308,7 @@ static void page_reporting_process(struct work_struct 
> > *work)
> >  */
> > state = atomic_cmpxchg(>state, state, PAGE_REPORTING_IDLE);
> > if (state == PAGE_REPORTING_REQUESTED)
> > -   schedule_delayed_work(>work, PAGE_REPORTING_DELAY);
> > +   schedule_delayed_work(>work, prdev->delay_jiffies);
> >  }
> >
> >  static DEFINE_MUTEX(page_reporting_mutex);
> > @@ -335,6 +336,8 @@ int page_reporting_register(struct 
> > page_reporting_dev_info *prdev)
> > /* Assign device to allow notifications */
> > rcu_assign_pointer(pr_dev_info, prdev);
> >
> > +   page_report_mini_order = prdev->mini_order;
> > +   page_report_batch_size = prdev->batch_size;
> > /* enable page reporting notification */
> > if (!static_key_enabled(_reporting_enabled)) {
> > static_branch_enable(_reporting_enabled);
> > @@ -352,6 +355,8 @@ void page_reporting_unregister(struct 
> > page_reporting_dev_info *prdev)
> > mutex_lock(_reporting_mutex);
> >
> > if (rcu_access_pointer(pr_dev_info) == prdev) {
> > +   if (static_key_enabled(_reporting_enabled))
> > +   static_branch_disable(_reporting_enabled);
> > /* Disable page reporting notification */
> > RCU_INIT_POINTER(pr_dev_info, NULL);
> > synchronize_rcu();
>
> If we are going to use this we are using it. Once we NULL out the
> prdev that should stop page reporting from running. We shouldn't be
> relying on the static key.

The benefits for this is that the function call of '__page_reporting_notify' in
'page_reporting_notify_free' can be skipped, it helps to save some
cycles.

> > diff --git a/mm/page_reporting.h b/mm/page_reporting.h
> > index b8fb3bbb345f..86ac6ffad970 100644
> > --- a/mm/page_reporting.h
> > +++ b/mm/page_reporting.h
> > @@ -9,9 +9,9 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> > -#define PAGE_REPORTING_MIN_ORDER   pageblock_order
> > -
> > +extern int page_report_mini_order;
> >  extern unsigned long page_report_batch_size;
> >
> >  #ifdef CONFIG_PAGE_REPORTING
> > @@ -42,7 +42,7 @@ static inline void page_reporting_notify_free(unsigned 
> > int order)
> > return;
> >
> > /* Determine if we have crossed reporting threshold */
> > -   if (order < PAGE_REPORTING_MIN_ORDER)
> > +   if (order < page_report_mini_order)
> > 

Re: [PATCH 1/6] mm: Add batch size for free page reporting

2021-01-06 Thread Liang Li
> So you are going to need a lot more explanation for this. Page
> reporting already had the concept of batching as you could only scan
> once every 2 seconds as I recall. Thus the "PAGE_REPORTING_DELAY". The
> change you are making doesn't make any sense without additional
> context.

The reason for adding a batch is mainly for page prezero, I just want to make it
configurable to control the 'cache pollution', for that case, the
reporting thread
should not be woken up too frequently.

> > ---
> >  mm/page_reporting.c |  1 +
> >  mm/page_reporting.h | 12 ++--
> >  2 files changed, 11 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> > index cd8e13d41df4..694df981ddd2 100644
> > --- a/mm/page_reporting.c
> > +++ b/mm/page_reporting.c
> > @@ -12,6 +12,7 @@
> >
> >  #define PAGE_REPORTING_DELAY   (2 * HZ)
> >  static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
> > +unsigned long page_report_batch_size  __read_mostly = 16 * 1024 * 1024UL;
> >
> >  enum {
> > PAGE_REPORTING_IDLE = 0,
> > diff --git a/mm/page_reporting.h b/mm/page_reporting.h
> > index 2c385dd4ddbd..b8fb3bbb345f 100644
> > --- a/mm/page_reporting.h
> > +++ b/mm/page_reporting.h
> > @@ -12,6 +12,8 @@
> >
> >  #define PAGE_REPORTING_MIN_ORDER   pageblock_order
> >
> > +extern unsigned long page_report_batch_size;
> > +
> >  #ifdef CONFIG_PAGE_REPORTING
> >  DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
> >  void __page_reporting_notify(void);
> > @@ -33,6 +35,8 @@ static inline bool page_reported(struct page *page)
> >   */
> >  static inline void page_reporting_notify_free(unsigned int order)
> >  {
> > +   static long batch_size;
> > +
>
> I'm not sure this makes a tone of sense to place the value in an
> inline function. It might make more sense to put this new code in
> __page_reporting_notify so that all callers would be referring to the
> same batch_size value and you don't have to bother with the export of
> the page_report_batch_size value.

you are right, will change.

> > /* Called from hot path in __free_one_page() */
> > if (!static_branch_unlikely(_reporting_enabled))
> > return;
> > @@ -41,8 +45,12 @@ static inline void page_reporting_notify_free(unsigned 
> > int order)
> > if (order < PAGE_REPORTING_MIN_ORDER)
> > return;
> >
> > -   /* This will add a few cycles, but should be called infrequently */
> > -   __page_reporting_notify();
> > +   batch_size += (1 << order) << PAGE_SHIFT;
> > +   if (batch_size >= page_report_batch_size) {
> > +   batch_size = 0;
>
> I would probably run this in the opposite direction. Rather than
> running batch_size to zero I would look at adding a "batch_remaining"
> and then when it is < 0 you could then reset it back to
> page_report_batch_size. Doing that you only have to read one variable
> most of the time instead of doing a comparison against two.

You are right again.

Thanks
Liang


Re: [PATCH 0/6] hugetlbfs: support free page reporting

2021-01-06 Thread Liang Li
On Wed, Jan 6, 2021 at 5:41 PM David Hildenbrand  wrote:
>
> On 06.01.21 04:46, Liang Li wrote:
> > A typical usage of hugetlbfs it's to reserve amount of memory
> > during the kernel booting stage, and the reserved pages are
> > unlikely to return to the buddy system. When application need
> > hugepages, kernel will allocate them from the reserved pool.
> > when application terminates, huge pages will return to the
> > reserved pool and are kept in the free list for hugetlbfs,
> > these free pages will not return to buddy freelist unless the
> > size of reserved pool is changed.
> > Free page reporting only supports buddy pages, it can't report
> > the free pages reserved for hugetlbfs. On the other hand,
> > hugetlbfs is a good choice for system with a huge amount of RAM,
> > because it can help to reduce the memory management overhead and
> > improve system performance.
> > This patch add the support for reporting hugepages in the free
> > list of hugetlbfs, it can be used by virtio_balloon driver for
> > memory overcommit and pre zero out free pages for speeding up
> > memory population and page fault handling.
>
> You should lay out the use case + measurements. Further you should
> describe what this patch set actually does, how behavior can be tuned,
> pros and cons, etc... And you should most probably keep this RFC.
>
> >
> > Most of the code are 'copied' from free page reporting because
> > they are working in the same way. So the code can be refined to
> > remove duplication. It can be done later.
>
> Nothing speaks about getting it right from the beginning. Otherwise it
> will most likely never happen.
>
> >
> > Since some guys have some concern about side effect of the 'buddy
> > free page pre zero out' feature brings, I remove it from this
> > serier.
>
> You should really point out what changed size the last version. I
> remember Alex and Mike had some pretty solid points of what they don't
> want to see (especially: don't use free page reporting infrastructure
> and don't temporarily allocate huge pages for processing them).
>
> I am not convinced that we want to use the free page reporting
> infrastructure for this (pre-zeroing huge pages). What speaks about a
> thread simply iterating over huge pages one at a time, zeroing them? The
> whole free page reporting infrastructure was invented because we have to
> do expensive coordination (+ locking) when going via the hypervisor. For
> the main use case of zeroing huge pages in the background, I don't see a
> real need for that. If you believe this is the right thing to do, please
> add a discussion regarding this.
>
> --
> Thanks,
>
> David / dhildenb
>
>
I will take all your advice and give more detail in the next revision,
Thanks for your comments!

Liang


[PATCH 6/6] hugetlb: support free hugepage pre zero out

2021-01-05 Thread Liang Li
This patch add support of pre zero out free hugepage, we can use
this feature to speed up page population and page fault handing.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Liang Li 
Signed-off-by: Liang Li 
---
 include/linux/page-flags.h |  12 ++
 mm/Kconfig |  10 ++
 mm/huge_memory.c   |   3 +-
 mm/hugetlb.c   | 243 +
 mm/memory.c|   4 +
 5 files changed, 271 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index ec5d0290e0ee..f177c5e85632 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -173,6 +173,9 @@ enum pageflags {
 
/* Only valid for buddy pages. Used to track pages that are reported */
PG_reported = PG_uptodate,
+
+   /* Only valid for hugetlb pages. Used to mark zero pages */
+   PG_zero = PG_slab,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -451,6 +454,15 @@ PAGEFLAG(Idle, idle, PF_ANY)
  */
 __PAGEFLAG(Reported, reported, PF_NO_COMPOUND)
 
+/*
+ * PageZero() is used to track hugetlb free pages within the free list
+ * of hugetlbfs. We can use the non-atomic version of the test and set
+ * operations as both should be shielded with the hugetlb lock to prevent
+ * any possible races on the setting or clearing of the bit.
+ */
+__PAGEFLAG(Zero, zero, PF_ONLY_HEAD)
+
+
 /*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;
diff --git a/mm/Kconfig b/mm/Kconfig
index 630cde982186..1d91e182825d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -254,6 +254,16 @@ config PAGE_REPORTING
  those pages to another entity, such as a hypervisor, so that the
  memory can be freed within the host for other uses.
 
+#
+# support for pre zero out hugetlbfs free page
+config PREZERO_HPAGE
+   bool "Pre zero out hugetlbfs free page"
+   def_bool n
+   depends on PAGE_REPORTING
+   help
+ Allows pre zero out hugetlbfs free pages in freelist based on free
+ page reporting
+
 #
 # support for page migration
 #
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9237976abe72..4ff99724d669 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2407,7 +2407,8 @@ static void __split_huge_page_tail(struct page *head, int 
tail,
 #ifdef CONFIG_64BIT
 (1L << PG_arch_2) |
 #endif
-(1L << PG_dirty)));
+(1L << PG_dirty) |
+(1L << PG_zero)));
 
/* ->mapping in first tail page is compound_mapcount */
VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0fccd5f96954..2029668a0864 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1029,6 +1029,7 @@ static void enqueue_huge_page(struct hstate *h, struct 
page *page)
list_move(>lru, >hugepage_freelists[nid]);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
+   __ClearPageZero(page);
if (hugepage_reported(page))
__ClearPageReported(page);
hugepage_reporting_notify_free(h->order);
@@ -1315,6 +1316,7 @@ static void update_and_free_page(struct hstate *h, struct 
page *page)
VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
set_page_refcounted(page);
+   __ClearPageZero(page);
if (hstate_is_gigantic(h)) {
/*
 * Temporarily drop the hugetlb_lock, because
@@ -2963,6 +2965,237 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, 
struct kobject *parent,
return retval;
 }
 
+#ifdef CONFIG_PREZERO_HPAGE
+
+#define PRE_ZERO_STOP  0
+#define PRE_ZERO_RUN   1
+
+static int mini_page_order;
+static unsigned long batch_size = 16 * 1024 * 1024;
+static unsigned long delay_millisecs = 1000;
+static unsigned long prezero_enable __read_mostly;
+static DEFINE_MUTEX(kprezerod_mutex);
+static struct page_reporting_dev_info pre_zero_hpage_dev_info;
+
+static int zero_out_pages(struct page_reporting_dev_info *pr_dev_info,
+  struct scatterlist *sgl, unsigned int nents)
+{
+   struct scatterlist *sg = sgl;
+
+   might_sleep();
+   do {
+   struct page *page = sg_page(sg);
+   unsigned int i, order = get_order(sg->length);
+
+   VM_BUG_ON(PageBuddy(page) || buddy_order(page));
+
+   if (PageZero(page))
+   continue;
+   for (i = 0; i < (1 << order); i++) {
+   cond_resched();
+   clear_highpage(page + i);
+  

[PATCH 5/6] virtio-balloon: reporting hugetlb free page to host

2021-01-05 Thread Liang Li
Free page reporting only supports buddy pages, it can't report the
free pages reserved for hugetlbfs case. On the other hand, hugetlbfs
is a good choice for a system with a huge amount of RAM, because it
can help to reduce the memory management overhead and improve system
performance.  This patch add support for reporting free hugepage to
host when guest use hugetlbfs.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Liang Li 
Signed-off-by: Liang Li 
---
 drivers/virtio/virtio_balloon.c | 55 +++--
 1 file changed, 53 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 684bcc39ef5a..7bd7fcacee8c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -126,6 +126,10 @@ struct virtio_balloon {
/* Free page reporting device */
struct virtqueue *reporting_vq;
struct page_reporting_dev_info pr_dev_info;
+
+   /* Free hugepage reporting device */
+   struct page_reporting_dev_info hpr_dev_info;
+   struct mutex mtx_report;
 };
 
 static const struct virtio_device_id id_table[] = {
@@ -173,6 +177,38 @@ static int virtballoon_free_page_report(struct 
page_reporting_dev_info *pr_dev_i
struct virtqueue *vq = vb->reporting_vq;
unsigned int unused, err;
 
+   mutex_lock(>mtx_report);
+   /* We should always be able to add these buffers to an empty queue. */
+   err = virtqueue_add_inbuf(vq, sg, nents, vb, GFP_NOWAIT | __GFP_NOWARN);
+
+   /*
+* In the extremely unlikely case that something has occurred and we
+* are able to trigger an error we will simply display a warning
+* and exit without actually processing the pages.
+*/
+   if (WARN_ON_ONCE(err)) {
+   mutex_unlock(>mtx_report);
+   return err;
+   }
+
+   virtqueue_kick(vq);
+
+   /* When host has read buffer, this completes via balloon_ack */
+   wait_event(vb->acked, virtqueue_get_buf(vq, ));
+   mutex_unlock(>mtx_report);
+
+   return 0;
+}
+
+static int virtballoon_free_hugepage_report(struct page_reporting_dev_info 
*hpr_dev_info,
+  struct scatterlist *sg, unsigned int nents)
+{
+   struct virtio_balloon *vb =
+   container_of(hpr_dev_info, struct virtio_balloon, hpr_dev_info);
+   struct virtqueue *vq = vb->reporting_vq;
+   unsigned int unused, err;
+
+   mutex_lock(>mtx_report);
/* We should always be able to add these buffers to an empty queue. */
err = virtqueue_add_inbuf(vq, sg, nents, vb, GFP_NOWAIT | __GFP_NOWARN);
 
@@ -181,13 +217,16 @@ static int virtballoon_free_page_report(struct 
page_reporting_dev_info *pr_dev_i
 * are able to trigger an error we will simply display a warning
 * and exit without actually processing the pages.
 */
-   if (WARN_ON_ONCE(err))
+   if (WARN_ON_ONCE(err)) {
+   mutex_unlock(>mtx_report);
return err;
+   }
 
virtqueue_kick(vq);
 
/* When host has read buffer, this completes via balloon_ack */
wait_event(vb->acked, virtqueue_get_buf(vq, ));
+   mutex_unlock(>mtx_report);
 
return 0;
 }
@@ -984,9 +1023,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
}
 
vb->pr_dev_info.report = virtballoon_free_page_report;
+   vb->hpr_dev_info.report = virtballoon_free_hugepage_report;
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
unsigned int capacity;
 
+   mutex_init(>mtx_report);
capacity = virtqueue_get_vring_size(vb->reporting_vq);
if (capacity < PAGE_REPORTING_CAPACITY) {
err = -ENOSPC;
@@ -999,6 +1040,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
err = page_reporting_register(>pr_dev_info);
if (err)
goto out_unregister_oom;
+
+   vb->hpr_dev_info.mini_order = MAX_ORDER - 1;
+   vb->hpr_dev_info.batch_size = 16 * 1024 * 1024; /* 16M */
+   vb->hpr_dev_info.delay_jiffies = 2 * HZ; /* 2 seconds */
+   err = hugepage_reporting_register(>hpr_dev_info);
+   if (err)
+   goto out_unregister_oom;
+
}
 
virtio_device_ready(vdev);
@@ -1051,8 +1100,10 @@ static void virtballoon_remove(struct virtio_device 
*vdev)
 {
struct virtio_balloon *vb = vdev->priv;
 
-   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
page_reporting_unregister(&g

[PATCH 3/6] hugetlb: add free page reporting support

2021-01-05 Thread Liang Li
hugetlb manages its page in hstate's free page list, not in buddy
system, this patch try to make it works for hugetlbfs. It canbe
used for memory overcommit in virtualization and hugetlb pre zero
out.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Liang Li 
Signed-off-by: Liang Li 
---
 include/linux/hugetlb.h|   3 +
 include/linux/page_reporting.h |   4 +
 mm/Kconfig |   1 +
 mm/hugetlb.c   |  19 +++
 mm/page_reporting.c| 297 +
 mm/page_reporting.h|  34 
 6 files changed, 358 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ebca2ef02212..d55e6a00b3dc 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct ctl_table;
 struct user_struct;
@@ -114,6 +115,8 @@ int hugetlb_treat_movable_handler(struct ctl_table *, int, 
void *, size_t *,
 int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void *, size_t *,
loff_t *);
 
+void isolate_free_huge_page(struct page *page, struct hstate *h, int nid);
+void putback_isolate_huge_page(struct hstate *h, struct page *page);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct 
vm_area_struct *);
 long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 struct page **, struct vm_area_struct **,
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index 63e1e9fbcaa2..884afa2ad70c 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -26,4 +26,8 @@ struct page_reporting_dev_info {
 /* Tear-down and bring-up for page reporting devices */
 void page_reporting_unregister(struct page_reporting_dev_info *prdev);
 int page_reporting_register(struct page_reporting_dev_info *prdev);
+
+/* Tear-down and bring-up for hugepage reporting devices */
+void hugepage_reporting_unregister(struct page_reporting_dev_info *prdev);
+int hugepage_reporting_register(struct page_reporting_dev_info *prdev);
 #endif /*_LINUX_PAGE_REPORTING_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 4275c25b5d8a..630cde982186 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -247,6 +247,7 @@ config COMPACTION
 config PAGE_REPORTING
bool "Free page reporting"
def_bool n
+   select HUGETLBFS
help
  Free page reporting allows for the incremental acquisition of
  free pages from the buddy allocator for the purpose of reporting
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cbf32d2824fd..eb533995cb49 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include "page_reporting.h"
 #include "internal.h"
 
 int hugetlb_max_hstate __read_mostly;
@@ -1028,6 +1029,9 @@ static void enqueue_huge_page(struct hstate *h, struct 
page *page)
list_move(>lru, >hugepage_freelists[nid]);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
+   if (hugepage_reported(page))
+   __ClearPageReported(page);
+   hugepage_reporting_notify_free(h->order);
 }
 
 static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
@@ -5531,6 +5535,21 @@ follow_huge_pgd(struct mm_struct *mm, unsigned long 
address, pgd_t *pgd, int fla
return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> 
PAGE_SHIFT);
 }
 
+void isolate_free_huge_page(struct page *page, struct hstate *h, int nid)
+{
+   VM_BUG_ON_PAGE(!PageHead(page), page);
+
+   list_move(>lru, >hugepage_activelist);
+   set_page_refcounted(page);
+}
+
+void putback_isolate_huge_page(struct hstate *h, struct page *page)
+{
+   int nid = page_to_nid(page);
+
+   list_move(>lru, >hugepage_freelists[nid]);
+}
+
 bool isolate_huge_page(struct page *page, struct list_head *list)
 {
bool ret = true;
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 39bc6a9d7b73..cc31696225bb 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -6,15 +6,22 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "page_reporting.h"
 #include "internal.h"
 
 #define PAGE_REPORTING_DELAY   (2 * HZ)
+#define MAX_SCAN_NUM   1024
+
 static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
 unsigned long page_report_batch_size  __read_mostly = 16 * 1024 * 1024UL;
 int page_report_mini_order = pageblock_order;
 
+static struct page_reporting_dev_info __rcu *hgpr_dev_info __read_mostly;
+int hugepage_report_mini_order = pageblock_order;
+unsigned long hugepage_report_batch_size = 16 * 1024 * 1024;
+
 enum {
PAGE_REPORTING_IDLE = 0,
PAGE_REPORTING_REQUESTED,
@@ -66,6 +73,24 @@ void __page_r

[PATCH 4/6] hugetlb: avoid allocation failed when page reporting is on going

2021-01-05 Thread Liang Li
Page reporting isolates free pages temporarily when reporting
free pages information. It will reduce the actual free pages
and may cause application failed for no enough available memory.
This patch try to solve this issue, when there is no free page
and page repoting is on going, wait until it is done.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Liang Li 
Signed-off-by: Liang Li 
---
 include/linux/hugetlb.h | 2 ++
 mm/hugetlb.c| 9 +
 mm/page_reporting.c | 6 +-
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d55e6a00b3dc..73b2934ba91c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -490,6 +490,7 @@ struct hstate {
unsigned long resv_huge_pages;
unsigned long surplus_huge_pages;
unsigned long nr_overcommit_huge_pages;
+   unsigned long isolated_huge_pages;
struct list_head hugepage_activelist;
struct list_head hugepage_freelists[MAX_NUMNODES];
unsigned int nr_huge_pages_node[MAX_NUMNODES];
@@ -500,6 +501,7 @@ struct hstate {
struct cftype cgroup_files_dfl[7];
struct cftype cgroup_files_legacy[9];
 #endif
+   struct mutex mtx_prezero;
char name[HSTATE_NAME_LEN];
 };
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eb533995cb49..0fccd5f96954 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2320,6 +2320,12 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
goto out_uncharge_cgroup_reservation;
 
spin_lock(_lock);
+   while (h->free_huge_pages <= 1 && h->isolated_huge_pages) {
+   spin_unlock(_lock);
+   mutex_lock(>mtx_prezero);
+   mutex_unlock(>mtx_prezero);
+   spin_lock(_lock);
+   }
/*
 * glb_chg is passed to indicate whether or not a page must be taken
 * from the global free pool (global change).  gbl_chg == 0 indicates
@@ -3208,6 +3214,7 @@ void __init hugetlb_add_hstate(unsigned int order)
INIT_LIST_HEAD(>hugepage_activelist);
h->next_nid_to_alloc = first_memory_node;
h->next_nid_to_free = first_memory_node;
+   mutex_init(>mtx_prezero);
snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
huge_page_size(h)/1024);
 
@@ -5541,6 +5548,7 @@ void isolate_free_huge_page(struct page *page, struct 
hstate *h, int nid)
 
list_move(>lru, >hugepage_activelist);
set_page_refcounted(page);
+   h->isolated_huge_pages++;
 }
 
 void putback_isolate_huge_page(struct hstate *h, struct page *page)
@@ -5548,6 +5556,7 @@ void putback_isolate_huge_page(struct hstate *h, struct 
page *page)
int nid = page_to_nid(page);
 
list_move(>lru, >hugepage_freelists[nid]);
+   h->isolated_huge_pages--;
 }
 
 bool isolate_huge_page(struct page *page, struct list_head *list)
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index cc31696225bb..99e1e688d7c1 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -272,12 +272,15 @@ hugepage_reporting_process_hstate(struct 
page_reporting_dev_info *prdev,
int ret = 0, nid;
 
offset = max_items;
+   mutex_lock(>mtx_prezero);
for (nid = 0; nid < MAX_NUMNODES; nid++) {
ret = hugepage_reporting_cycle(prdev, h, nid, sgl, ,
   max_items);
 
-   if (ret < 0)
+   if (ret < 0) {
+   mutex_unlock(>mtx_prezero);
return ret;
+   }
}
 
/* report the leftover pages before going idle */
@@ -291,6 +294,7 @@ hugepage_reporting_process_hstate(struct 
page_reporting_dev_info *prdev,
hugepage_reporting_drain(prdev, h, sgl, leftover, !ret);
spin_unlock(_lock);
}
+   mutex_unlock(>mtx_prezero);
 
return ret;
 }
-- 
2.18.2



[PATCH 2/6] mm: let user decide page reporting option

2021-01-05 Thread Liang Li
Some key parameters for page reporting are now hard coded, different
users of the framework may have their special requirements, make
these parameter configrable and let the user decide them.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Liang Li 
Signed-off-by: Liang Li 
---
 drivers/virtio/virtio_balloon.c |  3 +++
 include/linux/page_reporting.h  |  3 +++
 mm/page_reporting.c | 13 +
 mm/page_reporting.h |  6 +++---
 4 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 8985fc2cea86..684bcc39ef5a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -993,6 +993,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
goto out_unregister_oom;
}
 
+   vb->pr_dev_info.mini_order = pageblock_order;
+   vb->pr_dev_info.batch_size = 16 * 1024 * 1024; /* 16M */
+   vb->pr_dev_info.delay_jiffies = 2 * HZ; /* 2 seconds */
err = page_reporting_register(>pr_dev_info);
if (err)
goto out_unregister_oom;
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index 3b99e0ec24f2..63e1e9fbcaa2 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -13,6 +13,9 @@ struct page_reporting_dev_info {
int (*report)(struct page_reporting_dev_info *prdev,
  struct scatterlist *sg, unsigned int nents);
 
+   unsigned long batch_size;
+   unsigned long delay_jiffies;
+   int mini_order;
/* work struct for processing reports */
struct delayed_work work;
 
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 694df981ddd2..39bc6a9d7b73 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -13,6 +13,7 @@
 #define PAGE_REPORTING_DELAY   (2 * HZ)
 static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
 unsigned long page_report_batch_size  __read_mostly = 16 * 1024 * 1024UL;
+int page_report_mini_order = pageblock_order;
 
 enum {
PAGE_REPORTING_IDLE = 0,
@@ -44,7 +45,7 @@ __page_reporting_request(struct page_reporting_dev_info 
*prdev)
 * now we are limiting this to running no more than once every
 * couple of seconds.
 */
-   schedule_delayed_work(>work, PAGE_REPORTING_DELAY);
+   schedule_delayed_work(>work, prdev->delay_jiffies);
 }
 
 /* notify prdev of free page reporting request */
@@ -230,7 +231,7 @@ page_reporting_process_zone(struct page_reporting_dev_info 
*prdev,
 
/* Generate minimum watermark to be able to guarantee progress */
watermark = low_wmark_pages(zone) +
-   (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER);
+   (PAGE_REPORTING_CAPACITY << prdev->mini_order);
 
/*
 * Cancel request if insufficient free memory or if we failed
@@ -240,7 +241,7 @@ page_reporting_process_zone(struct page_reporting_dev_info 
*prdev,
return err;
 
/* Process each free list starting from lowest order/mt */
-   for (order = PAGE_REPORTING_MIN_ORDER; order < MAX_ORDER; order++) {
+   for (order = prdev->mini_order; order < MAX_ORDER; order++) {
for (mt = 0; mt < MIGRATE_TYPES; mt++) {
/* We do not pull pages from the isolate free list */
if (is_migrate_isolate(mt))
@@ -307,7 +308,7 @@ static void page_reporting_process(struct work_struct *work)
 */
state = atomic_cmpxchg(>state, state, PAGE_REPORTING_IDLE);
if (state == PAGE_REPORTING_REQUESTED)
-   schedule_delayed_work(>work, PAGE_REPORTING_DELAY);
+   schedule_delayed_work(>work, prdev->delay_jiffies);
 }
 
 static DEFINE_MUTEX(page_reporting_mutex);
@@ -335,6 +336,8 @@ int page_reporting_register(struct page_reporting_dev_info 
*prdev)
/* Assign device to allow notifications */
rcu_assign_pointer(pr_dev_info, prdev);
 
+   page_report_mini_order = prdev->mini_order;
+   page_report_batch_size = prdev->batch_size;
/* enable page reporting notification */
if (!static_key_enabled(_reporting_enabled)) {
static_branch_enable(_reporting_enabled);
@@ -352,6 +355,8 @@ void page_reporting_unregister(struct 
page_reporting_dev_info *prdev)
mutex_lock(_reporting_mutex);
 
if (rcu_access_pointer(pr_dev_info) == prdev) {
+   if (static_key_enabled(_reporting_enabled))
+   static_branch_disable(_reporting_enabled);
/* Disable page reporting notification */
RCU_INIT_POINTER(pr_d

[PATCH 1/6] mm: Add batch size for free page reporting

2021-01-05 Thread Liang Li
Use the page order as the only threshold for page reporting
is not flexible and has some flaws. Because scan a long free
list is not cheap, it's better to wake up the page reporting
worker when there are more pages, wake it up for a sigle page
may not worth.
This patch add a batch size as another threshold to control the
waking up of reporting worker.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Liang Li 
Signed-off-by: Liang Li 
---
 mm/page_reporting.c |  1 +
 mm/page_reporting.h | 12 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index cd8e13d41df4..694df981ddd2 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -12,6 +12,7 @@
 
 #define PAGE_REPORTING_DELAY   (2 * HZ)
 static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
+unsigned long page_report_batch_size  __read_mostly = 16 * 1024 * 1024UL;
 
 enum {
PAGE_REPORTING_IDLE = 0,
diff --git a/mm/page_reporting.h b/mm/page_reporting.h
index 2c385dd4ddbd..b8fb3bbb345f 100644
--- a/mm/page_reporting.h
+++ b/mm/page_reporting.h
@@ -12,6 +12,8 @@
 
 #define PAGE_REPORTING_MIN_ORDER   pageblock_order
 
+extern unsigned long page_report_batch_size;
+
 #ifdef CONFIG_PAGE_REPORTING
 DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
 void __page_reporting_notify(void);
@@ -33,6 +35,8 @@ static inline bool page_reported(struct page *page)
  */
 static inline void page_reporting_notify_free(unsigned int order)
 {
+   static long batch_size;
+
/* Called from hot path in __free_one_page() */
if (!static_branch_unlikely(_reporting_enabled))
return;
@@ -41,8 +45,12 @@ static inline void page_reporting_notify_free(unsigned int 
order)
if (order < PAGE_REPORTING_MIN_ORDER)
return;
 
-   /* This will add a few cycles, but should be called infrequently */
-   __page_reporting_notify();
+   batch_size += (1 << order) << PAGE_SHIFT;
+   if (batch_size >= page_report_batch_size) {
+   batch_size = 0;
+   /* This add a few cycles, but should be called infrequently */
+   __page_reporting_notify();
+   }
 }
 #else /* CONFIG_PAGE_REPORTING */
 #define page_reported(_page)   false
-- 
2.18.2



[PATCH 0/6] hugetlbfs: support free page reporting

2021-01-05 Thread Liang Li
A typical usage of hugetlbfs it's to reserve amount of memory
during the kernel booting stage, and the reserved pages are
unlikely to return to the buddy system. When application need
hugepages, kernel will allocate them from the reserved pool.
when application terminates, huge pages will return to the
reserved pool and are kept in the free list for hugetlbfs,
these free pages will not return to buddy freelist unless the
size of reserved pool is changed. 
Free page reporting only supports buddy pages, it can't report
the free pages reserved for hugetlbfs. On the other hand,
hugetlbfs is a good choice for system with a huge amount of RAM,
because it can help to reduce the memory management overhead and
improve system performance.
This patch add the support for reporting hugepages in the free
list of hugetlbfs, it can be used by virtio_balloon driver for
memory overcommit and pre zero out free pages for speeding up
memory population and page fault handling.

Most of the code are 'copied' from free page reporting because
they are working in the same way. So the code can be refined to
remove duplication. It can be done later.

Since some guys have some concern about side effect of the 'buddy
free page pre zero out' feature brings, I remove it from this
serier.

Liang Li (6):
  mm: Add batch size for free page reporting
  mm: let user decide page reporting option
  hugetlb: add free page reporting support
  hugetlb: avoid allocation failed when page reporting is on going
  virtio-balloon: reporting hugetlb free page to host
  hugetlb: support free hugepage pre zero out

 drivers/virtio/virtio_balloon.c |  58 +-
 include/linux/hugetlb.h |   5 +
 include/linux/page-flags.h  |  12 ++
 include/linux/page_reporting.h  |   7 +
 mm/Kconfig  |  11 ++
 mm/huge_memory.c|   3 +-
 mm/hugetlb.c| 271 +++
 mm/memory.c |   4 +
 mm/page_reporting.c | 315 +++-
 mm/page_reporting.h |  50 -
 10 files changed, 725 insertions(+), 11 deletions(-)

-- 
2.18.2



Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO

2021-01-05 Thread Liang Li
> >> That‘s mostly already existing scheduling logic, no? (How many vms can I 
> >> put onto a specific machine eventually)
> >
> > It depends on how the scheduling component is designed. Yes, you can put
> > 10 VMs with 4C8G(4CPU, 8G RAM) on a host and 20 VMs with 2C4G on
> > another one. But if one type of them, e.g. 4C8G are sold out, customers
> > can't by more 4C8G VM while there are some free 2C4G VMs, the resource
> > reserved for them can be provided as 4C8G VMs
> >
>
> 1. You can, just the startup time will be a little slower? E.g., grow
> pre-allocated 4G file to 8G.
>
> 2. Or let's be creative: teach QEMU to construct a single
> RAMBlock/MemoryRegion out of multiple tmpfs files. Works as long as you
> don't go crazy on different VM sizes / size differences.
>
> 3. In your example above, you can dynamically rebalance as VMs are
> getting sold, to make sure you always have "big ones" lying around you
> can shrink on demand.
>
Yes, we can always come up with some ways to make things work.
it will make the developer of the upper layer component crazy :)

> >
> > You must know there are a lot of functions in the kernel which can
> > be done in userspace. e.g. Some of the device emulations like APIC,
> > vhost-net backend which has userspace implementation.   :)
> > Bad or not depends on the benefits the solution brings.
> > From the viewpoint of a user space application, the kernel should
> > provide high performance memory management service. That's why
> > I think it should be done in the kernel.
>
> As I expressed a couple of times already, I don't see why using
> hugetlbfs and implementing some sort of pre-zeroing there isn't sufficient.

Did I miss something before? I thought you doubt the need for
hugetlbfs free page pre zero out. Hugetlbfs is a good choice and is
sufficient.

> We really don't *want* complicated things deep down in the mm core if
> there are reasonable alternatives.
>
I understand your concern, we should have sufficient reason to add a new
feature to the kernel. And for this one, it's most value is to make the
application's life is easier. And implementing it in hugetlbfs can avoid
adding more complexity to core MM.
I will send out a new revision and drop the part for 'buddy free pages pre
zero out'. Thanks for your suggestion!

Liang


Re: [RFC v2 PATCH 4/4] mm: pre zero out free pages to speed up page allocation for __GFP_ZERO

2021-01-05 Thread Liang Li
On Tue, Jan 5, 2021 at 5:30 PM David Hildenbrand  wrote:
>
> On 05.01.21 10:20, Michal Hocko wrote:
> > On Mon 04-01-21 15:00:31, Dave Hansen wrote:
> >> On 1/4/21 12:11 PM, David Hildenbrand wrote:
>  Yeah, it certainly can't be the default, but it *is* useful for
>  thing where we know that there are no cache benefits to zeroing
>  close to where the memory is allocated.
> 
>  The trick is opting into it somehow, either in a process or a VMA.
> 
> >>> The patch set is mostly trying to optimize starting a new process. So
> >>> process/vma doesn‘t really work.
> >>
> >> Let's say you have a system-wide tunable that says: pre-zero pages and
> >> keep 10GB of them around.  Then, you opt-in a process to being allowed
> >> to dip into that pool with a process-wide flag or an madvise() call.
> >> You could even have the flag be inherited across execve() if you wanted
> >> to have helper apps be able to set the policy and access the pool like
> >> how numactl works.
> >
> > While possible, it sounds quite heavy weight to me. Page allocator would
> > have to somehow maintain those pre-zeroed pages. This pool will also
> > become a very scarce resource very soon because everybody just want to
> > run faster. So this would open many more interesting questions.
>
> Agreed.
>
> >
> > A global knob with all or nothing sounds like an easier to use and
> > maintain solution to me.
>
> I mean, that brings me back to my original suggestion: just use
> hugetlbfs and implement some sort of pre-zeroing there (worker thread,
> whatsoever). Most vfio users should already be better of using hugepages.
>
> It's a "pool of pages" already. Selected users use it. I really don't
> see a need to extend the buddy with something like that.
>

OK, since most people prefer hugetlbfs, I will send another revision for this.

Thanks
Liang


Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO

2021-01-04 Thread Liang Li
> >>> In our production environment, there are three main applications have such
> >>> requirement, one is QEMU [creating a VM with SR-IOV passthrough device],
> >>> anther other two are DPDK related applications, DPDK OVS and SPDK vhost,
> >>> for best performance, they populate memory when starting up. For SPDK 
> >>> vhost,
> >>> we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for
> >>> vhost 'live' upgrade, which is done by killing the old process and
> >>> starting a new
> >>> one with the new binary. In this case, we want the new process started as 
> >>> quick
> >>> as possible to shorten the service downtime. We really enable this feature
> >>> to speed up startup time for them  :)
>
> Am I wrong or does using hugeltbfs/tmpfs ... i.e., a file not-deleted between 
> shutting down the old instances and firing up the new instance just solve 
> this issue?

You are right, it works for the SPDK vhost upgrade case.

>
> >>
> >> Thanks for info on the use case!
> >>
> >> All of these use cases either already use, or could use, huge pages
> >> IMHO. It's not your ordinary proprietary gaming app :) This is where
> >> pre-zeroing of huge pages could already help.
> >
> > You are welcome.  For some historical reason, some of our services are
> > not using hugetlbfs, that is why I didn't start with hugetlbfs.
> >
> >> Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ...
> >> creating a file and pre-zeroing it from another process, or am I missing
> >> something important? At least for QEMU this should work AFAIK, where you
> >> can just pass the file to be use using memory-backend-file.
> >>
> > If using another process to create a file, we can offload the overhead to
> > another process, and there is no need to pre-zeroing it's content, just
> > populating the memory is enough.
>
> Right, if non-zero memory can be tolerated (e.g., for vms usually has to).

I mean there is no need to pre-zeroing the file content obviously in user space,
the kernel will do it when populating the memory.

> > If we do it that way, then how to determine the size of the file? it depends
> > on the RAM size of the VM the customer buys.
> > Maybe we can create a file
> > large enough in advance and truncate it to the right size just before the
> > VM is created. Then, how many large files should be created on a host?
>
> That‘s mostly already existing scheduling logic, no? (How many vms can I put 
> onto a specific machine eventually)

It depends on how the scheduling component is designed. Yes, you can put
10 VMs with 4C8G(4CPU, 8G RAM) on a host and 20 VMs with 2C4G on
another one. But if one type of them, e.g. 4C8G are sold out, customers
can't by more 4C8G VM while there are some free 2C4G VMs, the resource
reserved for them can be provided as 4C8G VMs

> > You will find there are a lot of things that have to be handled properly.
> > I think it's possible to make it work well, but we will transfer the
> > management complexity to up layer components. It's a bad practice to let
> > upper layer components process such low level details which should be
> > handled in the OS layer.
>
> It‘s bad practice to squeeze things into the kernel that can just be handled 
> on upper layers ;)
>

You must know there are a lot of functions in the kernel which can
be done in userspace. e.g. Some of the device emulations like APIC,
vhost-net backend which has userspace implementation.   :)
Bad or not depends on the benefits the solution brings.
>From the viewpoint of a user space application, the kernel should
provide high performance memory management service. That's why
I think it should be done in the kernel.

Thanks
Liang


Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO

2021-01-04 Thread Liang Li
On Mon, Jan 4, 2021 at 8:56 PM Michal Hocko  wrote:
>
> On Mon 21-12-20 11:25:22, Liang Li wrote:
> [...]
> > Security
> > 
> > This is a weak version of "introduce init_on_alloc=1 and init_on_free=1
> > boot options", which zero out page in a asynchronous way. For users can't
> > tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings,
> > this feauture provide another choice.
>
> Most of the usecases are about the start up time imporvemtns IIUC. Have
> you tried to use init_on_free or this would be prohibitive for your
> workloads?
>
I have not tried yet. 'init_on_free' may help to shorten the start up time. In
our use case,  we care about both the VM creation time and the VM reboot
time[terminate QEMU process first and launch a new  one], 'init_on_free'
will slow down the termination process and is not helpful for VM reboot.
Our aim is to speed up 'VM start up' and not slow down 'VM shut down'.

Thanks
Liang


Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO

2021-01-04 Thread Liang Li
> > Win or not depends on its effect. For our case, it solves the issue
> > that we faced, so it can be thought as a win for us.  If others don't
> > have the issue we faced, the result will be different, maybe they will
> > be affected by the side effect of this feature. I think this is your
> > concern behind the question. right? I will try to do more tests and
> > provide more benchmark performance data.
>
> Yes, zeroying memory does have a noticeable overhead but we cannot
> simply allow tasks to spil over this overhead to all other users by
> default. So if anything this would need to be an opt-in feature
> configurable by administrator.
> --
> Michal Hocko
> SUSE Labs

I know the overhead, so I add a switch in /sys/ to enable or disable
it dynamically.

Thanks
Liang


Re: [RFC PATCH 1/3] mm: support hugetlb free page reporting

2020-12-23 Thread Liang Li
> >>> +static int
> >>> +hugepage_reporting_cycle(struct page_reporting_dev_info *prdev,
> >>> +  struct hstate *h, unsigned int nid,
> >>> +  struct scatterlist *sgl, unsigned int *offset)
> >>> +{
> >>> + struct list_head *list = >hugepage_freelists[nid];
> >>> + unsigned int page_len = PAGE_SIZE << h->order;
> >>> + struct page *page, *next;
> >>> + long budget;
> >>> + int ret = 0, scan_cnt = 0;
> >>> +
> >>> + /*
> >>> +  * Perform early check, if free area is empty there is
> >>> +  * nothing to process so we can skip this free_list.
> >>> +  */
> >>> + if (list_empty(list))
> >>> + return ret;
> >>
> >> Do note that not all entries on the hugetlb free lists are free.  Reserved
> >> entries are also on the free list.  The actual number of free entries is
> >> 'h->free_huge_pages - h->resv_huge_pages'.
> >> Is the intention to process reserved pages as well as free pages?
> >
> > Yes, Reserved pages was treated as 'free pages'
>
> If that is true, then this code breaks hugetlb.  hugetlb code assumes that
> h->free_huge_pages is ALWAYS >= h->resv_huge_pages.  This code would break
> that assumption.  If you really want to add support for hugetlb pages, then
> you will need to take reserved pages into account.

I didn't know that. thanks!

> P.S. There might be some confusion about 'reservations' based on the
> commit message.  My comments are directed at hugetlb reservations described
> in Documentation/vm/hugetlbfs_reserv.rst.
>
> >>> + /* Attempt to pull page from list and place in scatterlist 
> >>> */
> >>> + if (*offset) {
> >>> + isolate_free_huge_page(page, h, nid);
> >>
> >> Once a hugetlb page is isolated, it can not be used and applications that
> >> depend on hugetlb pages can start to fail.
> >> I assume that is acceptable/expected behavior.  Correct?
> >> On some systems, hugetlb pages are a precious resource and the sysadmin
> >> carefully configures the number needed by applications.  Removing a hugetlb
> >> page (even for a very short period of time) could cause serious application
> >> failure.
> >
> > That' true, especially for 1G pages. Any suggestions?
> > Let the hugepage allocator be aware of this situation and retry ?
>
> I would hate to add that complexity to the allocator.
>
> This question is likely based on my lack of understanding of virtio-balloon
> usage and this reporting mechanism.  But, why do the hugetlb pages have to
> be 'temporarily' allocated for reporting purposes?

The link here will give your more detail about how page reporting
works, https://www.kernel.org/doc/html/latest//vm/free_page_reporting.html
the virtio-balloon driver is based on this framework and will report the
free pages information to QEMU, host can unmap the memory
region corresponding to reported free pages and reclaim the memory
for other use, it's useful for memory overcommit.
Allocated the pages 'temporarily' before reporting is necessary, it make
sure guests will not use the page when the host side unmap the region.
or it will break the guest.

Now I realized we should solve this issue first, it seems adding a lock
will help.

Thanks


Re: [RFC PATCH 1/3] mm: support hugetlb free page reporting

2020-12-23 Thread Liang Li
> > > > +   spin_lock_irq(_lock);
> > > > +
> > > > +   if (huge_page_order(h) > MAX_ORDER)
> > > > +   budget = HUGEPAGE_REPORTING_CAPACITY;
> > > > +   else
> > > > +   budget = HUGEPAGE_REPORTING_CAPACITY * 32;
> > >
> > > Wouldn't huge_page_order always be more than MAX_ORDER? Seems like we
> > > don't even really need budget since this should probably be pulling
> > > out no more than one hugepage at a time.
> >
> > I want to disting a 2M page and 1GB page here. The order of 1GB page is 
> > greater
> > than MAX_ORDER while 2M page's order is less than MAX_ORDER.
>
> The budget here is broken. When I put the budget in page reporting it
> was so that we wouldn't try to report all of the memory in a given
> region. It is meant to hold us to no more than one pass through 1/16
> of the free memory. So essentially we will be slowly processing all of
> memory and it will take 16 calls (32 seconds) for us to process a
> system that is sitting completely idle. It is meant to pace us so we
> don't spend a ton of time doing work that will be undone, not to
> prevent us from burying a CPU which is what seems to be implied here.
>
> Using HUGEPAGE_REPORTING_CAPACITY makes no sense here. I was using it
> in the original definition because it was how many pages we could
> scoop out at a time and then I was aiming for a 16th of that. Here you
> are arbitrarily squaring HUGEPAGE_REPORTING_CAPACITY in terms of the
> amount of work you will doo since you are using it as a multiple
> instead of a divisor.
>
> > >
> > > > +   /* loop through free list adding unreported pages to sg list */
> > > > +   list_for_each_entry_safe(page, next, list, lru) {
> > > > +   /* We are going to skip over the reported pages. */
> > > > +   if (PageReported(page)) {
> > > > +   if (++scan_cnt >= MAX_SCAN_NUM) {
> > > > +   ret = scan_cnt;
> > > > +   break;
> > > > +   }
> > > > +   continue;
> > > > +   }
> > > > +
> > >
> > > It would probably have been better to place this set before your new
> > > set. I don't see your new set necessarily being the best use for page
> > > reporting.
> >
> > I haven't really latched on to what you mean, could you explain it again?
>
> It would be better for you to spend time understanding how this patch
> set works before you go about expanding it to do other things.
> Mistakes like the budget one above kind of point out the fact that you
> don't understand how this code was supposed to work and just kind of
> shoehorned you page zeroing code onto it.
>
> It would be better to look at trying to understand this code first
> before you extend it to support your zeroing use case. So adding huge
> pages first might make more sense than trying to zero and push the
> order down. The fact is the page reporting extension should be minimal
> for huge pages since they are just passed as a scatterlist so you
> should only need to add a small bit to page_reporting.c to extend it
> to support this use case.
>
> > >
> > > > +   /*
> > > > +* If we fully consumed our budget then update our
> > > > +* state to indicate that we are requesting additional
> > > > +* processing and exit this list.
> > > > +*/
> > > > +   if (budget < 0) {
> > > > +   atomic_set(>state, 
> > > > PAGE_REPORTING_REQUESTED);
> > > > +   next = page;
> > > > +   break;
> > > > +   }
> > > > +
> > >
> > > If budget is only ever going to be 1 then we probably could just look
> > > at making this the default case for any time we find a non-reported
> > > page.
> >
> > and here again.
>
> It comes down to the fact that the changes you made have a significant
> impact on how this is supposed to function. Reducing the scatterlist
> to a size of one makes the whole point of doing batching kind of
> pointless. Basically the code should be rewritten with the assumption
> that if you find a page you report it.
>
> The old code would batch things up because there is significant
> overhead to be addressed when going to the hypervisor to report said
> memory. Your code doesn't seem to really take anything like that into
> account and instead is using an arbitrary budget value based on the
> page size.
>
> > > > +   /* Attempt to pull page from list and place in 
> > > > scatterlist */
> > > > +   if (*offset) {
> > > > +   isolate_free_huge_page(page, h, nid);
> > > > +   /* Add page to scatter list */
> > > > +   --(*offset);
> > > > +   sg_set_page([*offset], page, page_len, 0);
> > > > +
> > > > +   continue;
> > > > +   }
> > > > +
> > >
> > > There is no point in the 

Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO

2020-12-23 Thread Liang Li
On Wed, Dec 23, 2020 at 4:41 PM David Hildenbrand  wrote:
>
> [...]
>
> >> I was rather saying that for security it's of little use IMHO.
> >> Application/VM start up time might be improved by using huge pages (and
> >> pre-zeroing these). Free page reporting might be improved by using
> >> MADV_FREE instead of MADV_DONTNEED in the hypervisor.
> >>
> >>> this feature, above all of them, which one is likely to become the
> >>> most strong one?  From the implementation, you will find it is
> >>> configurable, users don't want to use it can turn it off.  This is not
> >>> an option?
> >>
> >> Well, we have to maintain the feature and sacrifice a page flag. For
> >> example, do we expect someone explicitly enabling the feature just to
> >> speed up startup time of an app that consumes a lot of memory? I highly
> >> doubt it.
> >
> > In our production environment, there are three main applications have such
> > requirement, one is QEMU [creating a VM with SR-IOV passthrough device],
> > anther other two are DPDK related applications, DPDK OVS and SPDK vhost,
> > for best performance, they populate memory when starting up. For SPDK vhost,
> > we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for
> > vhost 'live' upgrade, which is done by killing the old process and
> > starting a new
> > one with the new binary. In this case, we want the new process started as 
> > quick
> > as possible to shorten the service downtime. We really enable this feature
> > to speed up startup time for them  :)
>
> Thanks for info on the use case!
>
> All of these use cases either already use, or could use, huge pages
> IMHO. It's not your ordinary proprietary gaming app :) This is where
> pre-zeroing of huge pages could already help.

You are welcome.  For some historical reason, some of our services are
not using hugetlbfs, that is why I didn't start with hugetlbfs.

> Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ...
> creating a file and pre-zeroing it from another process, or am I missing
> something important? At least for QEMU this should work AFAIK, where you
> can just pass the file to be use using memory-backend-file.
>
If using another process to create a file, we can offload the overhead to
another process, and there is no need to pre-zeroing it's content, just
populating the memory is enough.
If we do it that way, then how to determine the size of the file? it depends
on the RAM size of the VM the customer buys. Maybe we can create a file
large enough in advance and truncate it to the right size just before the
VM is created. Then, how many large files should be created on a host?
You will find there are a lot of things that have to be handled properly.
I think it's possible to make it work well, but we will transfer the
management complexity to up layer components. It's a bad practice to let
upper layer components process such low level details which should be
handled in the OS layer.

> >
> >> I'd love to hear opinions of other people. (a lot of people are offline
> >> until beginning of January, including, well, actually me :) )
> >
> > OK. I will wait some time for others' feedback. Happy holidays!
>
> To you too, cheers!
>

I have to work at least two months before the vacation. :(

Liang


Re: [RFC PATCH 1/3] mm: support hugetlb free page reporting

2020-12-22 Thread Liang Li
> On 12/21/20 11:46 PM, Liang Li wrote:
> > Free page reporting only supports buddy pages, it can't report the
> > free pages reserved for hugetlbfs case. On the other hand, hugetlbfs
> > is a good choice for a system with a huge amount of RAM, because it
> > can help to reduce the memory management overhead and improve system
> > performance.
> > This patch add the support for reporting hugepages in the free list
> > of hugetlb, it canbe used by virtio_balloon driver for memory
> > overcommit and pre zero out free pages for speeding up memory population.
>
> My apologies as I do not follow virtio_balloon driver.  Comments from
> the hugetlb perspective.

Any comments are welcome.


> >  static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> > @@ -5531,6 +5537,29 @@ follow_huge_pgd(struct mm_struct *mm, unsigned long 
> > address, pgd_t *pgd, int fla
> >   return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> 
> > PAGE_SHIFT);
> >  }
> >
> > +bool isolate_free_huge_page(struct page *page, struct hstate *h, int nid)
>
> Looks like this always returns true.  Should it be type void?

will change in the next revision.

> > +{
> > + bool ret = true;
> > +
> > + VM_BUG_ON_PAGE(!PageHead(page), page);
> > +
> > + list_move(>lru, >hugepage_activelist);
> > + set_page_refcounted(page);
> > + h->free_huge_pages--;
> > + h->free_huge_pages_node[nid]--;
> > +
> > + return ret;
> > +}
> > +
>
> ...

> > +static void
> > +hugepage_reporting_drain(struct page_reporting_dev_info *prdev,
> > +  struct hstate *h, struct scatterlist *sgl,
> > +  unsigned int nents, bool reported)
> > +{
> > + struct scatterlist *sg = sgl;
> > +
> > + /*
> > +  * Drain the now reported pages back into their respective
> > +  * free lists/areas. We assume at least one page is populated.
> > +  */
> > + do {
> > + struct page *page = sg_page(sg);
> > +
> > + putback_isolate_huge_page(h, page);
> > +
> > + /* If the pages were not reported due to error skip flagging 
> > */
> > + if (!reported)
> > + continue;
> > +
> > + __SetPageReported(page);
> > + } while ((sg = sg_next(sg)));
> > +
> > + /* reinitialize scatterlist now that it is empty */
> > + sg_init_table(sgl, nents);
> > +}
> > +
> > +/*
> > + * The page reporting cycle consists of 4 stages, fill, report, drain, and
> > + * idle. We will cycle through the first 3 stages until we cannot obtain a
> > + * full scatterlist of pages, in that case we will switch to idle.
> > + */
>
> As mentioned, I am not familiar with virtio_balloon and the overall design.
> So, some of this does not make sense to me.
>
> > +static int
> > +hugepage_reporting_cycle(struct page_reporting_dev_info *prdev,
> > +  struct hstate *h, unsigned int nid,
> > +  struct scatterlist *sgl, unsigned int *offset)
> > +{
> > + struct list_head *list = >hugepage_freelists[nid];
> > + unsigned int page_len = PAGE_SIZE << h->order;
> > + struct page *page, *next;
> > + long budget;
> > + int ret = 0, scan_cnt = 0;
> > +
> > + /*
> > +  * Perform early check, if free area is empty there is
> > +  * nothing to process so we can skip this free_list.
> > +  */
> > + if (list_empty(list))
> > + return ret;
>
> Do note that not all entries on the hugetlb free lists are free.  Reserved
> entries are also on the free list.  The actual number of free entries is
> 'h->free_huge_pages - h->resv_huge_pages'.
> Is the intention to process reserved pages as well as free pages?

Yes, Reserved pages was treated as 'free pages'

> > +
> > + spin_lock_irq(_lock);
> > +
> > + if (huge_page_order(h) > MAX_ORDER)
> > + budget = HUGEPAGE_REPORTING_CAPACITY;
> > + else
> > + budget = HUGEPAGE_REPORTING_CAPACITY * 32;
> > +
> > + /* loop through free list adding unreported pages to sg list */
> > + list_for_each_entry_safe(page, next, list, lru) {
> > + /* We are going to skip over the reported pages. */
> > + if (PageReported(page)) {
> > + if (++scan_cnt >= MAX_SCAN_NUM) {
> > + 

Re: [RFC PATCH 1/3] mm: support hugetlb free page reporting

2020-12-22 Thread Liang Li
> On 12/22/20 11:59 AM, Alexander Duyck wrote:
> > On Mon, Dec 21, 2020 at 11:47 PM Liang Li  
> > wrote:
> >> +
> >> +   if (huge_page_order(h) > MAX_ORDER)
> >> +   budget = HUGEPAGE_REPORTING_CAPACITY;
> >> +   else
> >> +   budget = HUGEPAGE_REPORTING_CAPACITY * 32;
> >
> > Wouldn't huge_page_order always be more than MAX_ORDER? Seems like we
> > don't even really need budget since this should probably be pulling
> > out no more than one hugepage at a time.
>
> On standard x86_64 configs, 2MB huge pages are of order 9 < MAX_ORDER (11).
> What is important for hugetlb is the largest order that can be allocated
> from buddy.  Anything bigger is considered a gigantic page and has to be
> allocated differently.
>
> If the code above is trying to distinguish between huge and gigantic pages,
> it is off by 1.  The largest order that can be allocated from the buddy is
> (MAX_ORDER - 1).  So, the check should be '>='.
>
> --
> Mike Kravetz

Yes, you're right!  thanks

Liang


Re: [RFC PATCH 1/3] mm: support hugetlb free page reporting

2020-12-22 Thread Liang Li
> > +hugepage_reporting_cycle(struct page_reporting_dev_info *prdev,
> > +struct hstate *h, unsigned int nid,
> > +struct scatterlist *sgl, unsigned int *offset)
> > +{
> > +   struct list_head *list = >hugepage_freelists[nid];
> > +   unsigned int page_len = PAGE_SIZE << h->order;
> > +   struct page *page, *next;
> > +   long budget;
> > +   int ret = 0, scan_cnt = 0;
> > +
> > +   /*
> > +* Perform early check, if free area is empty there is
> > +* nothing to process so we can skip this free_list.
> > +*/
> > +   if (list_empty(list))
> > +   return ret;
> > +
> > +   spin_lock_irq(_lock);
> > +
> > +   if (huge_page_order(h) > MAX_ORDER)
> > +   budget = HUGEPAGE_REPORTING_CAPACITY;
> > +   else
> > +   budget = HUGEPAGE_REPORTING_CAPACITY * 32;
>
> Wouldn't huge_page_order always be more than MAX_ORDER? Seems like we
> don't even really need budget since this should probably be pulling
> out no more than one hugepage at a time.

I want to disting a 2M page and 1GB page here. The order of 1GB page is greater
than MAX_ORDER while 2M page's order is less than MAX_ORDER.

>
> > +   /* loop through free list adding unreported pages to sg list */
> > +   list_for_each_entry_safe(page, next, list, lru) {
> > +   /* We are going to skip over the reported pages. */
> > +   if (PageReported(page)) {
> > +   if (++scan_cnt >= MAX_SCAN_NUM) {
> > +   ret = scan_cnt;
> > +   break;
> > +   }
> > +   continue;
> > +   }
> > +
>
> It would probably have been better to place this set before your new
> set. I don't see your new set necessarily being the best use for page
> reporting.

I haven't really latched on to what you mean, could you explain it again?

>
> > +   /*
> > +* If we fully consumed our budget then update our
> > +* state to indicate that we are requesting additional
> > +* processing and exit this list.
> > +*/
> > +   if (budget < 0) {
> > +   atomic_set(>state, PAGE_REPORTING_REQUESTED);
> > +   next = page;
> > +   break;
> > +   }
> > +
>
> If budget is only ever going to be 1 then we probably could just look
> at making this the default case for any time we find a non-reported
> page.

and here again.

> > +   /* Attempt to pull page from list and place in scatterlist 
> > */
> > +   if (*offset) {
> > +   isolate_free_huge_page(page, h, nid);
> > +   /* Add page to scatter list */
> > +   --(*offset);
> > +   sg_set_page([*offset], page, page_len, 0);
> > +
> > +   continue;
> > +   }
> > +
>
> There is no point in the continue case if we only have a budget of 1.
> We should probably just tighten up the loop so that all it does is
> search until it finds the 1 page it can pull, pull it, and then return
> it. The scatterlist doesn't serve much purpose and could be reduced to
> just a single entry.

I will think about it more.

> > +static int
> > +hugepage_reporting_process_hstate(struct page_reporting_dev_info *prdev,
> > +   struct scatterlist *sgl, struct hstate *h)
> > +{
> > +   unsigned int leftover, offset = HUGEPAGE_REPORTING_CAPACITY;
> > +   int ret = 0, nid;
> > +
> > +   for (nid = 0; nid < MAX_NUMNODES; nid++) {
> > +   ret = hugepage_reporting_cycle(prdev, h, nid, sgl, );
> > +
> > +   if (ret < 0)
> > +   return ret;
> > +   }
> > +
> > +   /* report the leftover pages before going idle */
> > +   leftover = HUGEPAGE_REPORTING_CAPACITY - offset;
> > +   if (leftover) {
> > +   sgl = [offset];
> > +   ret = prdev->report(prdev, sgl, leftover);
> > +
> > +   /* flush any remaining pages out from the last report */
> > +   spin_lock_irq(_lock);
> > +   hugepage_reporting_drain(prdev, h, sgl, leftover, !ret);
> > +   spin_unlock_irq(_lock);
> > +   }
> > +
> > +   return ret;
> > +}
> > +
>
> If HUGEPAGE_REPORTING_CAPACITY is 1 it would make more sense to
> rewrite this code to just optimize for a find and process a page
> approach rather than trying to batch pages.

Yes, I will make a change. Thanks for your comments!

Liang


Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO

2020-12-22 Thread Liang Li
> > =
> > QEMU use 4K pages, THP is off
> >   round1  round2  round3
> > w/o this patch:23.5s   24.7s   24.6s
> > w/ this patch: 10.2s   10.3s   11.2s
> >
> > QEMU use 4K pages, THP is on
> >   round1  round2  round3
> > w/o this patch:17.9s   14.8s   14.9s
> > w/ this patch: 1.9s1.8s1.9s
> > =
>
> The cost of zeroing pages has to be paid somewhere.  You've successfully
> moved it out of this path that you can measure.  So now you've put it
> somewhere that you're not measuring.  Why is this a win?

Win or not depends on its effect. For our case, it solves the issue that we
faced, so it can be thought as a win for us.
If others don't have the issue we faced, the result will be different,
maybe they
will be affected by the side effect of this feature. I think this is
your concern
behind the question. right? I will try to do more tests and provide more
benchmark performance data.

> > Speed up kernel routine
> > ===
> > This can’t be guaranteed because we don’t pre zero out all the free pages,
> > but is true for most case. It can help to speed up some important system
> > call just like fork, which will allocate zero pages for building page
> > table. And speed up the process of page fault, especially for huge page
> > fault. The POC of Hugetlb free page pre zero out has been done.
>
> Try kernbench with and without your patch.

OK. Thanks for your suggestion!

Liang


Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO

2020-12-22 Thread Liang Li
https://static.sched.com/hosted_files/kvmforum2020/51/The%20Practice%20Method%20to%20Speed%20Up%2010x%20Boot-up%20Time%20for%20Guest%20in%20Alibaba%20Cloud.pdf
> >
> > and the flowing link is mine:
> > https://static.sched.com/hosted_files/kvmforum2020/90/Speed%20Up%20Creation%20of%20a%20VM%20With%20Passthrough%20GPU.pdf
>
> Thanks for the pointers! I actually did watch your presentation.

You're welcome!  And thanks for your time! :)
> >>
> >>>
> >>> Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory
> >>> overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page
> >>> to the VMM, VMM will unmap the corresponding host page for reclaim,
> >>> when guest allocate a page just reclaimed, host will allocate a new page
> >>> and zero it out for guest, in this case pre zero out free page will help
> >>> to speed up the proccess of fault in and reduce the performance impaction.
> >>
> >> Such faults in the VMM are no different to other faults, when first
> >> accessing a page to be populated. Again, I wonder how much of a
> >> difference it actually makes.
> >>
> >
> > I am not just referring to faults in the VMM, I mean the whole process
> > that handles guest page faults.
> > without VIRTIO_BALLOON_F_REPORTING, pages used by guests will be zero
> > out only once by host. With VIRTIO_BALLOON_F_REPORTING, free pages are
> > reclaimed by the host and may return to the host buddy
> > free list. When the pages are given back to the guest, the host kernel
> > needs to zero out it again. It means
> > with VIRTIO_BALLOON_F_REPORTING, guest memory performance will be
> > degraded for frequently
> > zero out operation on host side. The performance degradation will be
> > obvious for huge page case. Free
> > page pre zero out can help to make guest memory performance almost the
> > same as without
> > VIRTIO_BALLOON_F_REPORTING.
>
> Yes, what I am saying is that this fault handling is no different to
> ordinary faults when accessing a virtual memory location the first time
> and populating a page. The only difference is that it happens
> continuously, not only the first time we touch a page.
>
> And we might be able to improve handling in the hypervisor in the
> future. We have been discussing using MADV_FREE instead of MADV_DONTNEED
> in QEMU for handling free page reporting. Then, guest reported pages
> will only get reclaimed by the hypervisor when there is actual memory
> pressure in the hypervisor (e.g., when about to swap). And zeroing a
> page is an obvious improvement over going to swap. The price for zeroing
> pages has to be paid at one point.
>
> Also note that we've been discussing cache-related things already. If
> you zero out before giving the page to the guest, the page will already
> be in the cache - where the guest directly wants to access it.
>

OK, that's very reasonable and much better. Looking forward for your work.

> >>>
> >>> Security
> >>> 
> >>> This is a weak version of "introduce init_on_alloc=1 and init_on_free=1
> >>> boot options", which zero out page in a asynchronous way. For users can't
> >>> tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings,
> >>> this feauture provide another choice.
> >> "we don’t pre zero out all the free pages" so this is of little actual use.
> >
> > OK. It seems none of the reasons listed above is strong enough for
>
> I was rather saying that for security it's of little use IMHO.
> Application/VM start up time might be improved by using huge pages (and
> pre-zeroing these). Free page reporting might be improved by using
> MADV_FREE instead of MADV_DONTNEED in the hypervisor.
>
> > this feature, above all of them, which one is likely to become the
> > most strong one?  From the implementation, you will find it is
> > configurable, users don't want to use it can turn it off.  This is not
> > an option?
>
> Well, we have to maintain the feature and sacrifice a page flag. For
> example, do we expect someone explicitly enabling the feature just to
> speed up startup time of an app that consumes a lot of memory? I highly
> doubt it.

In our production environment, there are three main applications have such
requirement, one is QEMU [creating a VM with SR-IOV passthrough device],
anther other two are DPDK related applications, DPDK OVS and SPDK vhost,
for best performance, they populate memory when starting up. For SPDK vhost,
we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for
vhost 'live' upgrade, which is done by killing the old process and
starting a new
one with the new binary. In this case, we want the new process started as quick
as possible to shorten the service downtime. We really enable this feature
to speed up startup time for them  :)

> I'd love to hear opinions of other people. (a lot of people are offline
> until beginning of January, including, well, actually me :) )

OK. I will wait some time for others' feedback. Happy holidays!

thanks!

Liang


Re: [RFC PATCH 3/3] mm: support free hugepage pre zero out

2020-12-22 Thread Liang Li
> > Free page reporting in virtio-balloon doesn't give you any guarantees
> > regarding zeroing of pages. Take a look at the QEMU implementation -
> > e.g., with vfio all reports are simply ignored.
> >
> > Also, I am not sure if mangling such details ("zeroing of pages") into
> > the page reporting infrastructure is a good idea.
> >
>
> Oh, now I get what you are doing here, you rely on zero_free_pages of
> your other patch series and are not relying on virtio-balloon free page
> reporting to do the zeroing.
>
> You really should have mentioned that this patch series relies on the
> other one and in which way.

I am sorry for that. After I sent out the patch, I realized I should
mention that, so I sent out an updated version which added the
information you mentioned :)

Thanks !
Liang


Re: [RFC PATCH 2/3] virtio-balloon: add support for providing free huge page reports to host

2020-12-22 Thread Liang Li
On Tue, Dec 22, 2020 at 4:28 PM David Hildenbrand  wrote:
>
> On 22.12.20 08:48, Liang Li wrote:
> > Free page reporting only supports buddy pages, it can't report the
> > free pages reserved for hugetlbfs case. On the other hand, hugetlbfs
>
> The virtio-balloon free page reporting interface accepts a generic sg,
> so it isn't glue to buddy pages. There is no need for a new interface.

OK, then there will be two workers accessing the same vq, we can add a
lock for concurrent access.

Thanks!

Liang


Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO

2020-12-22 Thread Liang Li
On Tue, Dec 22, 2020 at 4:47 PM David Hildenbrand  wrote:
>
> On 21.12.20 17:25, Liang Li wrote:
> > The first version can be found at: https://lkml.org/lkml/2020/4/12/42
> >
> > Zero out the page content usually happens when allocating pages with
> > the flag of __GFP_ZERO, this is a time consuming operation, it makes
> > the population of a large vma area very slowly. This patch introduce
> > a new feature for zero out pages before page allocation, it can help
> > to speed up page allocation with __GFP_ZERO.
> >
> > My original intention for adding this feature is to shorten VM
> > creation time when SR-IOV devicde is attached, it works good and the
> > VM creation time is reduced by about 90%.
> >
> > Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
> > =
> > QEMU use 4K pages, THP is off
> >   round1  round2  round3
> > w/o this patch:23.5s   24.7s   24.6s
> > w/ this patch: 10.2s   10.3s   11.2s
> >
> > QEMU use 4K pages, THP is on
> >   round1  round2  round3
> > w/o this patch:17.9s   14.8s   14.9s
> > w/ this patch: 1.9s1.8s1.9s
> > =
> >
>
> I am still not convinces that we want/need this for this (main) use
> case. Why can't we use huge pages for such use cases (that really care
> about VM creation time) and rather deal with pre-zeroing of huge pages
> instead?
>
> If possible, I'd like to avoid GFP_ZERO (for reasons already discussed).
>

Yes, for VM creation, we can simply use hugetlb for that, just like what
I have done in the other series 'mm: support free hugepage pre zero out'
I send the v2 because I think VM creation is just one example we can benefit
from.

> > Obviously, it can do more than this. We can benefit from this feature
> > in the flowing case:
> >
> > Interactive sence
> > =
> > Shorten application lunch time on desktop or mobile phone, it can help
> > to improve the user experience. Test shows on a
> > server [Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz], zero out 1GB RAM by
> > the kernel will take about 200ms, while some mainly used application
> > like Firefox browser, Office will consume 100 ~ 300 MB RAM just after
> > launch, by pre zero out free pages, it means the application launch
> > time will be reduced about 20~60ms (can be visual sensed?). May be
> > we can make use of this feature to speed up the launch of Andorid APP
> > (I didn't do any test for Android).
>
> I am not really sure if you can actually visually sense a difference in
> your examples. Startup time of an application is not just memory
> allocation (page zeroing) time. It would be interesting of much of a
> difference this actually makes in practice. (e.g., firefox startup time
> etc.)

Yes, using Firefox and Office as an example seems not convincing, maybe a
large Game APP which consumes several GB of RAM is better.

> >
> > Virtulization
> > =
> > Speed up VM creation and shorten guest boot time, especially for PCI
> > SR-IOV device passthrough scenario. Compared with some of the para
> > vitalization solutions, it is easy to deploy because it’s transparent
> > to guest and can handle DMA properly in BIOS stage, while the para
> > virtualization solution can’t handle it well.
>
> What is the "para virtualization" approach you are talking about?

I refer two topic in the KVM forum 2020, the doc can give more details :
https://static.sched.com/hosted_files/kvmforum2020/48/coIOMMU.pdf
https://static.sched.com/hosted_files/kvmforum2020/51/The%20Practice%20Method%20to%20Speed%20Up%2010x%20Boot-up%20Time%20for%20Guest%20in%20Alibaba%20Cloud.pdf

and the flowing link is mine:
https://static.sched.com/hosted_files/kvmforum2020/90/Speed%20Up%20Creation%20of%20a%20VM%20With%20Passthrough%20GPU.pdf
>
> >
> > Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory
> > overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page
> > to the VMM, VMM will unmap the corresponding host page for reclaim,
> > when guest allocate a page just reclaimed, host will allocate a new page
> > and zero it out for guest, in this case pre zero out free page will help
> > to speed up the proccess of fault in and reduce the performance impaction.
>
> Such faults in the VMM are no different to other faults, when first
> accessing a page to be populated. Again, I wonder how much of a
> difference it actually makes.
>

I am not just referring to faults in the VMM, 

[RFC PATCH 0/3 updated] add support for free hugepage reporting

2020-12-22 Thread Liang Li
A typical usage of hugetlbfs it's to reserve amount of memory during 
kernel booting, and the reserved pages are unlikely to return to the
buddy system. When application need hugepages, kernel will allocate 
them from the reserved pool. when application terminates, huge pages
will return to the reserved pool and are kept in the free list for
hugetlb, these free pages will not return to buddy freelist unless
the size fo reserved pool is changed. 
Free page reporting only supports buddy pages, it can't report the
free pages reserved for hugetlbfs. On the other hand, hugetlbfs
is a good choice for system with a huge amount of RAM, because it
can help to reduce the memory management overhead and improve system
performance.
This patch add the support for reporting hugepages in the free list
of hugetlb, it can be used by virtio_balloon driver for memory
overcommit and pre zero out free pages for speeding up memory
population and page fault handling.

Most of the code are 'copied' from free page reporting because they
are working in the same way. So the code can be refined to remove
the duplicated code. Since this is an RFC, I didn't do that.

For the virtio_balloon driver, changes for the virtio spec are needed.
Before that, I need the feedback of the comunity about this new feature.

This RFC is baed on my previous series:
  '[RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO' 

Liang Li (3):
  mm: support hugetlb free page reporting
  virtio-balloon: add support for providing free huge page reports to
host
  mm: support free hugepage pre zero out

 drivers/virtio/virtio_balloon.c |  61 ++
 include/linux/hugetlb.h |   3 +
 include/linux/page_reporting.h  |   5 +
 include/uapi/linux/virtio_balloon.h |   1 +
 mm/hugetlb.c|  29 +++
 mm/page_prezero.c   |  17 ++
 mm/page_reporting.c | 287 
 mm/page_reporting.h |  34 
 8 files changed, 437 insertions(+)

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand   
Cc: Michal Hocko  
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Mike Kravetz 
Cc: Liang Li 
-- 
2.18.2



[RFC PATCH 0/3 updated] add support for free hugepage reporting

2020-12-22 Thread Liang Li
A typical usage of hugetlbfs it's to reserve amount of memory during 
kernel booting, and the reserved pages are unlikely to return to the
buddy system. When application need hugepages, kernel will allocate 
them from the reserved pool. when application terminates, huge pages
will return to the reserved pool and are kept in the free list for
hugetlb, these free pages will not return to buddy freelist unless
the size fo reserved pool is changed. 
Free page reporting only supports buddy pages, it can't report the
free pages reserved for hugetlbfs. On the other hand, hugetlbfs
is a good choice for system with a huge amount of RAM, because it
can help to reduce the memory management overhead and improve system
performance.
This patch add the support for reporting hugepages in the free list
of hugetlb, it can be used by virtio_balloon driver for memory
overcommit and pre zero out free pages for speeding up memory
population and page fault handling.

Most of the code are 'copied' from free page reporting because they
are working in the same way. So the code can be refined to remove
the duplicated code. Since this is an RFC, I didn't do that.

For the virtio_balloon driver, changes for the virtio spec are needed.
Before that, I need the feedback of the comunity about this new feature.

This RFC is baed on my previous series:
  '[RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO' 

Liang Li (3):
  mm: support hugetlb free page reporting
  virtio-balloon: add support for providing free huge page reports to
host
  mm: support free hugepage pre zero out

 drivers/virtio/virtio_balloon.c |  61 ++
 include/linux/hugetlb.h |   3 +
 include/linux/page_reporting.h  |   5 +
 include/uapi/linux/virtio_balloon.h |   1 +
 mm/hugetlb.c|  29 +++
 mm/page_prezero.c   |  17 ++
 mm/page_reporting.c | 287 
 mm/page_reporting.h |  34 
 8 files changed, 437 insertions(+)

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand   
Cc: Michal Hocko  
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Mike Kravetz 
Cc: Liang Li 
-- 
2.18.2



[RFC PATCH 3/3] mm: support free hugepage pre zero out

2020-12-21 Thread Liang Li
This patch add support of pre zero out free hugepage, we can use
this feature to speed up page population and page fault handing.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand   
Cc: Michal Hocko  
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Mike Kravetz 
Cc: Liang Li 
Signed-off-by: Liang Li 
---
 mm/page_prezero.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/mm/page_prezero.c b/mm/page_prezero.c
index c8ce720bfc54..dff4e0adf402 100644
--- a/mm/page_prezero.c
+++ b/mm/page_prezero.c
@@ -26,6 +26,7 @@ static unsigned long delay_millisecs = 1000;
 static unsigned long zeropage_enable __read_mostly;
 static DEFINE_MUTEX(kzeropaged_mutex);
 static struct page_reporting_dev_info zero_page_dev_info;
+static struct page_reporting_dev_info zero_hugepage_dev_info;
 
 inline void clear_zero_page_flag(struct page *page, int order)
 {
@@ -69,9 +70,17 @@ static int start_kzeropaged(void)
zero_page_dev_info.delay_jiffies = 
msecs_to_jiffies(delay_millisecs);
 
err = page_reporting_register(_page_dev_info);
+
+   zero_hugepage_dev_info.report = zero_free_pages;
+   zero_hugepage_dev_info.mini_order = mini_page_order;
+   zero_hugepage_dev_info.batch_size = batch_size;
+   zero_hugepage_dev_info.delay_jiffies = 
msecs_to_jiffies(delay_millisecs);
+
+   err |= hugepage_reporting_register(_hugepage_dev_info);
pr_info("Zero page enabled\n");
} else {
page_reporting_unregister(_page_dev_info);
+   hugepage_reporting_unregister(_hugepage_dev_info);
pr_info("Zero page disabled\n");
}
 
@@ -90,7 +99,15 @@ static int restart_kzeropaged(void)
zero_page_dev_info.batch_size = batch_size;
zero_page_dev_info.delay_jiffies = 
msecs_to_jiffies(delay_millisecs);
 
+   hugepage_reporting_unregister(_hugepage_dev_info);
+
+   zero_hugepage_dev_info.report = zero_free_pages;
+   zero_hugepage_dev_info.mini_order = mini_page_order;
+   zero_hugepage_dev_info.batch_size = batch_size;
+   zero_hugepage_dev_info.delay_jiffies = 
msecs_to_jiffies(delay_millisecs);
+
err = page_reporting_register(_page_dev_info);
+   err |= hugepage_reporting_register(_hugepage_dev_info);
pr_info("Zero page enabled\n");
}
 
-- 
2.18.2



[RFC PATCH 2/3] virtio-balloon: add support for providing free huge page reports to host

2020-12-21 Thread Liang Li
Free page reporting only supports buddy pages, it can't report the
free pages reserved for hugetlbfs case. On the other hand, hugetlbfs
is a good choice for a system with a huge amount of RAM, because it
can help to reduce the memory management overhead and improve system
performance.  This patch add support for reporting free hugepage to
host when guest use hugetlbfs.
A new feature bit and a new vq is added for this new feature.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand   
Cc: Michal Hocko  
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Mike Kravetz 
Cc: Liang Li 
Signed-off-by: Liang Li 
---
 drivers/virtio/virtio_balloon.c | 61 +
 include/uapi/linux/virtio_balloon.h |  1 +
 2 files changed, 62 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index a298517079bb..61363dfd3c2d 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -52,6 +52,7 @@ enum virtio_balloon_vq {
VIRTIO_BALLOON_VQ_STATS,
VIRTIO_BALLOON_VQ_FREE_PAGE,
VIRTIO_BALLOON_VQ_REPORTING,
+   VIRTIO_BALLOON_VQ_HPG_REPORTING,
VIRTIO_BALLOON_VQ_MAX
 };
 
@@ -126,6 +127,10 @@ struct virtio_balloon {
/* Free page reporting device */
struct virtqueue *reporting_vq;
struct page_reporting_dev_info pr_dev_info;
+
+   /* Free hugepage reporting device */
+   struct virtqueue *hpg_reporting_vq;
+   struct page_reporting_dev_info hpr_dev_info;
 };
 
 static const struct virtio_device_id id_table[] = {
@@ -192,6 +197,33 @@ static int virtballoon_free_page_report(struct 
page_reporting_dev_info *pr_dev_i
return 0;
 }
 
+static int virtballoon_free_hugepage_report(struct page_reporting_dev_info 
*hpr_dev_info,
+  struct scatterlist *sg, unsigned int nents)
+{
+   struct virtio_balloon *vb =
+   container_of(hpr_dev_info, struct virtio_balloon, hpr_dev_info);
+   struct virtqueue *vq = vb->hpg_reporting_vq;
+   unsigned int unused, err;
+
+   /* We should always be able to add these buffers to an empty queue. */
+   err = virtqueue_add_inbuf(vq, sg, nents, vb, GFP_NOWAIT | __GFP_NOWARN);
+
+   /*
+* In the extremely unlikely case that something has occurred and we
+* are able to trigger an error we will simply display a warning
+* and exit without actually processing the pages.
+*/
+   if (WARN_ON_ONCE(err))
+   return err;
+
+   virtqueue_kick(vq);
+
+   /* When host has read buffer, this completes via balloon_ack */
+   wait_event(vb->acked, virtqueue_get_buf(vq, ));
+
+   return 0;
+}
+
 static void set_page_pfns(struct virtio_balloon *vb,
  __virtio32 pfns[], struct page *page)
 {
@@ -515,6 +547,7 @@ static int init_vqs(struct virtio_balloon *vb)
callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
+   names[VIRTIO_BALLOON_VQ_HPG_REPORTING] = NULL;
 
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
names[VIRTIO_BALLOON_VQ_STATS] = "stats";
@@ -531,6 +564,11 @@ static int init_vqs(struct virtio_balloon *vb)
callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
}
 
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HPG_REPORTING)) {
+   names[VIRTIO_BALLOON_VQ_HPG_REPORTING] = "hpg_reporting_vq";
+   callbacks[VIRTIO_BALLOON_VQ_HPG_REPORTING] = balloon_ack;
+   }
+
err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
 vqs, callbacks, names, NULL, NULL);
if (err)
@@ -566,6 +604,8 @@ static int init_vqs(struct virtio_balloon *vb)
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
 
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HPG_REPORTING))
+   vb->hpg_reporting_vq = vqs[VIRTIO_BALLOON_VQ_HPG_REPORTING];
return 0;
 }
 
@@ -1001,6 +1041,24 @@ static int virtballoon_probe(struct virtio_device *vdev)
goto out_unregister_oom;
}
 
+   vb->hpr_dev_info.report = virtballoon_free_hugepage_report;
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HPG_REPORTING)) {
+   unsigned int capacity;
+
+   capacity = virtqueue_get_vring_size(vb->hpg_reporting_vq);
+   if (capacity < PAGE_REPORTING_CAPACITY) {
+   err = -ENOSPC;
+   goto out_unregister_oom;
+   }
+
+   vb->hpr_dev_info.mini_order = 0;
+   vb-

[RFC PATCH 1/3] mm: support hugetlb free page reporting

2020-12-21 Thread Liang Li
Free page reporting only supports buddy pages, it can't report the
free pages reserved for hugetlbfs case. On the other hand, hugetlbfs
is a good choice for a system with a huge amount of RAM, because it
can help to reduce the memory management overhead and improve system
performance.
This patch add the support for reporting hugepages in the free list
of hugetlb, it canbe used by virtio_balloon driver for memory
overcommit and pre zero out free pages for speeding up memory population.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand   
Cc: Michal Hocko  
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Mike Kravetz 
Cc: Liang Li 
Signed-off-by: Liang Li 
---
 include/linux/hugetlb.h|   3 +
 include/linux/page_reporting.h |   5 +
 mm/hugetlb.c   |  29 
 mm/page_reporting.c| 287 +
 mm/page_reporting.h|  34 
 5 files changed, 358 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ebca2ef02212..a72ad25501d3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct ctl_table;
 struct user_struct;
@@ -114,6 +115,8 @@ int hugetlb_treat_movable_handler(struct ctl_table *, int, 
void *, size_t *,
 int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void *, size_t *,
loff_t *);
 
+bool isolate_free_huge_page(struct page *page, struct hstate *h, int nid);
+void putback_isolate_huge_page(struct hstate *h, struct page *page);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct 
vm_area_struct *);
 long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 struct page **, struct vm_area_struct **,
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index 63e1e9fbcaa2..0da3d1a6f0cc 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -7,6 +7,7 @@
 
 /* This value should always be a power of 2, see page_reporting_cycle() */
 #define PAGE_REPORTING_CAPACITY32
+#define HUGEPAGE_REPORTING_CAPACITY1
 
 struct page_reporting_dev_info {
/* function that alters pages to make them "reported" */
@@ -26,4 +27,8 @@ struct page_reporting_dev_info {
 /* Tear-down and bring-up for page reporting devices */
 void page_reporting_unregister(struct page_reporting_dev_info *prdev);
 int page_reporting_register(struct page_reporting_dev_info *prdev);
+
+/* Tear-down and bring-up for hugepage reporting devices */
+void hugepage_reporting_unregister(struct page_reporting_dev_info *prdev);
+int hugepage_reporting_register(struct page_reporting_dev_info *prdev);
 #endif /*_LINUX_PAGE_REPORTING_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cbf32d2824fd..de6ce147dfe2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include "page_reporting.h"
 #include "internal.h"
 
 int hugetlb_max_hstate __read_mostly;
@@ -1028,6 +1029,11 @@ static void enqueue_huge_page(struct hstate *h, struct 
page *page)
list_move(>lru, >hugepage_freelists[nid]);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
+   if (hugepage_reported(page)) {
+   __ClearPageReported(page);
+   pr_info("%s, free_huge_pages=%ld\n", __func__, 
h->free_huge_pages);
+   }
+   hugepage_reporting_notify_free(h->order);
 }
 
 static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
@@ -5531,6 +5537,29 @@ follow_huge_pgd(struct mm_struct *mm, unsigned long 
address, pgd_t *pgd, int fla
return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> 
PAGE_SHIFT);
 }
 
+bool isolate_free_huge_page(struct page *page, struct hstate *h, int nid)
+{
+   bool ret = true;
+
+   VM_BUG_ON_PAGE(!PageHead(page), page);
+
+   list_move(>lru, >hugepage_activelist);
+   set_page_refcounted(page);
+   h->free_huge_pages--;
+   h->free_huge_pages_node[nid]--;
+
+   return ret;
+}
+
+void putback_isolate_huge_page(struct hstate *h, struct page *page)
+{
+   int nid = page_to_nid(page);
+   pr_info("%s, free_huge_pages=%ld\n", __func__, h->free_huge_pages);
+   list_move(>lru, >hugepage_freelists[nid]);
+   h->free_huge_pages++;
+   h->free_huge_pages_node[nid]++;
+}
+
 bool isolate_huge_page(struct page *page, struct list_head *list)
 {
bool ret = true;
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 20ec3fb1afc4..15d4b5372df8 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "page_reporting.h"
 #include "internal.h"
@@ -16,6 +17

[RFC PATCH 0/3] add support for free hugepage reporting

2020-12-21 Thread Liang Li
A typical usage of hugetlbfs it's to reserve amount of memory when
the during kernel booting stage, and the reserved pages are unlikely
to return to the buddy system. When application need hugepages, kernel
will allocate them from the reserved pool. when application terminates,
huge pages will return to the reserved pool and are kept in the free
list for hugetlb, these free pages will not return to buddy freelist
unless the size fo reserved pool is changed. 
Free page reporting only supports buddy pages, it can't report the
free pages reserved for hugetlbfs. On the other hand, hugetlbfs
is a good choice for system with a huge amount of RAM, because it
can help to reduce the memory management overhead and improve system
performance.
This patch add the support for reporting hugepages in the free list
of hugetlb, it can be used by virtio_balloon driver for memory
overcommit and pre zero out free pages for speeding up memory
population and page fault handling.

Most of the code are 'copied' from free page reporting because they
are working in the same way. So the code can be refined to remove
the duplication code. Since this is an RFC, I didn't do that.

For the virtio_balloon driver, changes for the virtio spec are needed.
Before that, I need the feedback of the comunity about this new feature.

Liang Li (3):
  mm: support hugetlb free page reporting
  virtio-balloon: add support for providing free huge page reports to
host
  mm: support free hugepage pre zero out

 drivers/virtio/virtio_balloon.c |  61 ++
 include/linux/hugetlb.h |   3 +
 include/linux/page_reporting.h  |   5 +
 include/uapi/linux/virtio_balloon.h |   1 +
 mm/hugetlb.c|  29 +++
 mm/page_prezero.c   |  17 ++
 mm/page_reporting.c | 287 
 mm/page_reporting.h |  34 
 8 files changed, 437 insertions(+)

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand   
Cc: Michal Hocko  
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Mike Kravetz 
Cc: Liang Li 
-- 
2.18.2



[RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO

2020-12-21 Thread Liang Li
The first version can be found at: https://lkml.org/lkml/2020/4/12/42

Zero out the page content usually happens when allocating pages with
the flag of __GFP_ZERO, this is a time consuming operation, it makes
the population of a large vma area very slowly. This patch introduce
a new feature for zero out pages before page allocation, it can help
to speed up page allocation with __GFP_ZERO.

My original intention for adding this feature is to shorten VM
creation time when SR-IOV devicde is attached, it works good and the
VM creation time is reduced by about 90%.

Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
=
QEMU use 4K pages, THP is off
  round1  round2  round3
w/o this patch:23.5s   24.7s   24.6s
w/ this patch: 10.2s   10.3s   11.2s

QEMU use 4K pages, THP is on
  round1  round2  round3
w/o this patch:17.9s   14.8s   14.9s
w/ this patch: 1.9s1.8s1.9s
=

Obviously, it can do more than this. We can benefit from this feature
in the flowing case:

Interactive sence
=
Shorten application lunch time on desktop or mobile phone, it can help
to improve the user experience. Test shows on a
server [Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz], zero out 1GB RAM by
the kernel will take about 200ms, while some mainly used application
like Firefox browser, Office will consume 100 ~ 300 MB RAM just after
launch, by pre zero out free pages, it means the application launch
time will be reduced about 20~60ms (can be visual sensed?). May be
we can make use of this feature to speed up the launch of Andorid APP
(I didn't do any test for Android).

Virtulization
=
Speed up VM creation and shorten guest boot time, especially for PCI
SR-IOV device passthrough scenario. Compared with some of the para
vitalization solutions, it is easy to deploy because it’s transparent
to guest and can handle DMA properly in BIOS stage, while the para
virtualization solution can’t handle it well.

Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory
overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page
to the VMM, VMM will unmap the corresponding host page for reclaim,
when guest allocate a page just reclaimed, host will allocate a new page
and zero it out for guest, in this case pre zero out free page will help
to speed up the proccess of fault in and reduce the performance impaction.

Speed up kernel routine
===
This can’t be guaranteed because we don’t pre zero out all the free pages,
but is true for most case. It can help to speed up some important system
call just like fork, which will allocate zero pages for building page
table. And speed up the process of page fault, especially for huge page
fault. The POC of Hugetlb free page pre zero out has been done.

Security

This is a weak version of "introduce init_on_alloc=1 and init_on_free=1
boot options", which zero out page in a asynchronous way. For users can't
tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings,
this feauture provide another choice.

For the feedback of the first version, cache pollution is the main concern
of the mm guys, On the other hand, this feature is really helpful for
some use case. May be we should let the user decide wether to use it.
So a switch is added in the /sys files, users who don’t like it can turn
off the switch, or by configuring a large batch size to reduce cache
pollution.

To make the whole function works, support of pre zero out free huge pages
should be added for hugetlbfs, I will send another patch for it.

Liang Li (4):
  mm: let user decide page reporting option
  mm: pre zero out free pages to speed up page allocation for __GFP_ZERO
  mm: make page reporing worker works better for low order page
  mm: Add batch size for free page reporting

 drivers/virtio/virtio_balloon.c |   3 +
 include/linux/highmem.h |  31 +++-
 include/linux/page-flags.h  |  16 +-
 include/linux/page_reporting.h  |   3 +
 include/trace/events/mmflags.h  |   7 +
 mm/Kconfig  |  10 ++
 mm/Makefile |   1 +
 mm/huge_memory.c|   3 +-
 mm/page_alloc.c |   4 +
 mm/page_prezero.c   | 266 
 mm/page_prezero.h   |  13 ++
 mm/page_reporting.c |  49 +-
 mm/page_reporting.h |  16 +-
 13 files changed, 405 insertions(+), 17 deletions(-)
 create mode 100644 mm/page_prezero.c
 create mode 100644 mm/page_prezero.h

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Signed-off-by: Liang Li 
-- 
2.18.2



[RFC v2 PATCH 3/4] mm: let user decide page reporting option

2020-12-21 Thread Liang Li
Some key parameters for page reporting are now hard coded, different
users of the framework may have their special requirements, make
these parameter configrable and let the user decide them.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Signed-off-by: Liang Li 
---
 drivers/virtio/virtio_balloon.c |  3 +++
 include/linux/page_reporting.h  |  3 +++
 mm/page_reporting.c | 18 ++
 mm/page_reporting.h |  6 +++---
 4 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 8985fc2cea86..a298517079bb 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -993,6 +993,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
goto out_unregister_oom;
}
 
+   vb->pr_dev_info.mini_order = 6;
+   vb->pr_dev_info.batch_size = 32 * 1024 * 1024; /* 32M */
+   vb->pr_dev_info.delay_jiffies = 2 * HZ; /* 2 seconds */
err = page_reporting_register(>pr_dev_info);
if (err)
goto out_unregister_oom;
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index 3b99e0ec24f2..63e1e9fbcaa2 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -13,6 +13,9 @@ struct page_reporting_dev_info {
int (*report)(struct page_reporting_dev_info *prdev,
  struct scatterlist *sg, unsigned int nents);
 
+   unsigned long batch_size;
+   unsigned long delay_jiffies;
+   int mini_order;
/* work struct for processing reports */
struct delayed_work work;
 
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 2f8e3d032fab..20ec3fb1afc4 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -11,12 +11,10 @@
 #include "page_reporting.h"
 #include "internal.h"
 
-#define PAGE_REPORTING_DELAY   (2 * HZ)
 #define MAX_SCAN_NUM 1024
-
-unsigned long page_report_batch_size  __read_mostly = 4 * 1024 * 1024UL;
-
 static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
+int page_report_mini_order = pageblock_order;
+unsigned long page_report_batch_size = 32 * 1024 * 1024;
 
 enum {
PAGE_REPORTING_IDLE = 0,
@@ -48,7 +46,7 @@ __page_reporting_request(struct page_reporting_dev_info 
*prdev)
 * now we are limiting this to running no more than once every
 * couple of seconds.
 */
-   schedule_delayed_work(>work, PAGE_REPORTING_DELAY);
+   schedule_delayed_work(>work, prdev->delay_jiffies);
 }
 
 /* notify prdev of free page reporting request */
@@ -260,7 +258,7 @@ page_reporting_process_zone(struct page_reporting_dev_info 
*prdev,
 
/* Generate minimum watermark to be able to guarantee progress */
watermark = low_wmark_pages(zone) +
-   (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER);
+   (PAGE_REPORTING_CAPACITY << prdev->mini_order);
 
/*
 * Cancel request if insufficient free memory or if we failed
@@ -270,7 +268,7 @@ page_reporting_process_zone(struct page_reporting_dev_info 
*prdev,
return err;
 
/* Process each free list starting from lowest order/mt */
-   for (order = PAGE_REPORTING_MIN_ORDER; order < MAX_ORDER; order++) {
+   for (order = prdev->mini_order; order < MAX_ORDER; order++) {
for (mt = 0; mt < MIGRATE_TYPES; mt++) {
/* We do not pull pages from the isolate free list */
if (is_migrate_isolate(mt))
@@ -337,7 +335,7 @@ static void page_reporting_process(struct work_struct *work)
 */
state = atomic_cmpxchg(>state, state, PAGE_REPORTING_IDLE);
if (state == PAGE_REPORTING_REQUESTED)
-   schedule_delayed_work(>work, PAGE_REPORTING_DELAY);
+   schedule_delayed_work(>work, prdev->delay_jiffies);
 }
 
 static DEFINE_MUTEX(page_reporting_mutex);
@@ -365,6 +363,8 @@ int page_reporting_register(struct page_reporting_dev_info 
*prdev)
/* Assign device to allow notifications */
rcu_assign_pointer(pr_dev_info, prdev);
 
+   page_report_mini_order = prdev->mini_order;
+   page_report_batch_size = prdev->batch_size;
/* enable page reporting notification */
if (!static_key_enabled(_reporting_enabled)) {
static_branch_enable(_reporting_enabled);
@@ -382,6 +382,8 @@ void page_reporting_unregister(struct 
page_reporting_dev_info *prdev)
mutex_lock(_reporting_mutex);
 
if (rcu_access_pointer(pr_dev_info) == prdev) {
+   if (static_key_enabled(_reporting_enabled))
+ 

[RFC v2 PATCH 4/4] mm: pre zero out free pages to speed up page allocation for __GFP_ZERO

2020-12-21 Thread Liang Li
Zero out the page content usually happens when allocating pages with
the flag of __GFP_ZERO, this is a time consuming operation, it makes
the population of a large vma area very slowly. This patch introduce
a new feature for zero out pages before page allocation, it can help
to speed up page allocation with __GFP_ZERO.

My original intention for adding this feature is to shorten VM
creation time when SR-IOV devicde is attached, it works good and the
VM creation time is reduced by about 90%.

Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
=
QEMU use 4K pages, THP is off
  round1  round2  round3
w/o this patch:23.5s   24.7s   24.6s
w/ this patch: 10.2s   10.3s   11.2s

QEMU use 4K pages, THP is on
  round1  round2  round3
w/o this patch:17.9s   14.8s   14.9s
w/ this patch: 1.9s1.8s1.9s
=

Obviously, it can do more than this. We can benefit from this feature
in the flowing case:

Interactive sence
=
Shorten application lunch time on desktop or mobile phone, it can help
to improve the user experience. Test shows on a
server [Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz], zero out 1GB RAM by
the kernel will take about 200ms, while some mainly used application
like Firefox browser, Office will consume 100 ~ 300 MB RAM just after
launch, by pre zero out free pages, it means the application launch
time will be reduced about 20~60ms (can be sensed by people?). May be
we can make use of this feature to speed up the launch of Andorid APP
(I didn't do any test for Android).

Virtulization
=
Speed up VM creation and shorten guest boot time, especially for PCI
SR-IOV device passthrough scenario. Compared with some of the para
vitalization solutions, it is easy to deploy because it’s transparent
to guest and can handle DMA properly in BIOS stage, while the para
virtualization solution can’t handle it well.

Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory
overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page
to the VMM, VMM will unmap the corresponding host page for reclaim,
when guest allocate a page just reclaimed, host will allocate a new page
and zero it out for guest, in this case pre zero out free page will help
to speed up the proccess of fault in and reduce the performance impaction.

Speed up kernel routine
===
This can’t be guaranteed because we don’t pre zero out all the free pages,
but is true for most case. It can help to speed up some important system
call just like fork, which will allocate zero pages for building page
table. And speed up the process of page fault, especially for huge page
fault. The POC of Hugetlb free page pre zero out has been done.

Security

This is a weak version of "introduce init_on_alloc=1 and init_on_free=1
boot options", which zero out page in a asynchronous way. For users can't
tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings,
this feauture provide another choice.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Signed-off-by: Liang Li 
---
 include/linux/highmem.h|  31 +++-
 include/linux/page-flags.h |  16 +-
 include/trace/events/mmflags.h |   7 +
 mm/Kconfig |  10 ++
 mm/Makefile|   1 +
 mm/huge_memory.c   |   3 +-
 mm/page_alloc.c|   4 +
 mm/page_prezero.c  | 266 +
 mm/page_prezero.h  |  13 ++
 9 files changed, 346 insertions(+), 5 deletions(-)
 create mode 100644 mm/page_prezero.c
 create mode 100644 mm/page_prezero.h

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index d2c70d3772a3..badb5d0528f7 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -146,7 +146,13 @@ static inline void invalidate_kernel_vmap_range(void 
*vaddr, int size)
 #ifndef clear_user_highpage
 static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
 {
-   void *addr = kmap_atomic(page);
+   void *addr;
+
+#ifdef CONFIG_PREZERO_PAGE
+   if (TestClearPageZero(page))
+   return;
+#endif
+   addr = kmap_atomic(page);
clear_user_page(addr, vaddr, page);
kunmap_atomic(addr);
 }
@@ -197,9 +203,30 @@ alloc_zeroed_user_highpage_movable(struct vm_area_struct 
*vma,
return __alloc_zeroed_user_highpage(__GFP_MOVABLE, vma, vaddr);
 }
 
+#ifdef CONFIG_PREZERO_PAGE
+static inline void __clear_highpage(struct page *page)
+{
+   void *kaddr;
+
+   if (PageZero(page))
+   return;
+
+   kaddr = kmap_atomic(page);
+   clear_page(kaddr);
+   SetPag

[RFC v2 PATCH 2/4] mm: Add batch size for free page reporting

2020-12-21 Thread Liang Li
Use the page order as the only threshold for page reporting
is not flexible and has some flaws.
When the system's memory becomes very fragmented, there will be a
lot of low order free pages but very few high order pages, limit
the mini order as pageblock_order will prevent most of free pages
from being reclaimed by host for the case of page reclmain through
the virtio-balllon driver.
Scan a long free list is not cheap, it's better to wake up the
page reporting worker when there are more pages, wake it up for a
sigle page may not worth.
This patch add a batch size as another threshold to control the
waking up of reporting worker.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Signed-off-by: Liang Li 
---
 mm/page_reporting.c |  2 ++
 mm/page_reporting.h | 12 ++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 0b22db94ce2a..2f8e3d032fab 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -14,6 +14,8 @@
 #define PAGE_REPORTING_DELAY   (2 * HZ)
 #define MAX_SCAN_NUM 1024
 
+unsigned long page_report_batch_size  __read_mostly = 4 * 1024 * 1024UL;
+
 static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
 
 enum {
diff --git a/mm/page_reporting.h b/mm/page_reporting.h
index 2c385dd4ddbd..b8fb3bbb345f 100644
--- a/mm/page_reporting.h
+++ b/mm/page_reporting.h
@@ -12,6 +12,8 @@
 
 #define PAGE_REPORTING_MIN_ORDER   pageblock_order
 
+extern unsigned long page_report_batch_size;
+
 #ifdef CONFIG_PAGE_REPORTING
 DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
 void __page_reporting_notify(void);
@@ -33,6 +35,8 @@ static inline bool page_reported(struct page *page)
  */
 static inline void page_reporting_notify_free(unsigned int order)
 {
+   static long batch_size;
+
/* Called from hot path in __free_one_page() */
if (!static_branch_unlikely(_reporting_enabled))
return;
@@ -41,8 +45,12 @@ static inline void page_reporting_notify_free(unsigned int 
order)
if (order < PAGE_REPORTING_MIN_ORDER)
return;
 
-   /* This will add a few cycles, but should be called infrequently */
-   __page_reporting_notify();
+   batch_size += (1 << order) << PAGE_SHIFT;
+   if (batch_size >= page_report_batch_size) {
+   batch_size = 0;
+   /* This add a few cycles, but should be called infrequently */
+   __page_reporting_notify();
+   }
 }
 #else /* CONFIG_PAGE_REPORTING */
 #define page_reported(_page)   false
-- 
2.18.2



[RFC v2 PATCH 1/4] mm: make page reporing worker works better for low order page

2020-12-21 Thread Liang Li
'page_reporting_cycle' may take the zone lock for too long when scan
a low order free page list, because the low order page free list may
have a lot iterms and very few pages without the PG_report flag.
Current implementation limits the mini reported order to pageblock_order,
it's ok for most case. If we want to report low order pages, we should
prevent zone lock from being taken too long by the reporting worker, or it
may affect system performance, this patch try to make 'page_reporting_cycle'
work better for low order pages, the zone lock was released periodicly
and cup was yielded voluntarily if needed.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Signed-off-by: Liang Li 

---
 mm/page_reporting.c | 35 ---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index cd8e13d41df4..0b22db94ce2a 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -6,11 +6,14 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "page_reporting.h"
 #include "internal.h"
 
 #define PAGE_REPORTING_DELAY   (2 * HZ)
+#define MAX_SCAN_NUM 1024
+
 static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
 
 enum {
@@ -115,7 +118,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, 
struct zone *zone,
unsigned int page_len = PAGE_SIZE << order;
struct page *page, *next;
long budget;
-   int err = 0;
+   int err = 0, scan_cnt = 0;
 
/*
 * Perform early check, if free area is empty there is
@@ -145,8 +148,14 @@ page_reporting_cycle(struct page_reporting_dev_info 
*prdev, struct zone *zone,
/* loop through free list adding unreported pages to sg list */
list_for_each_entry_safe(page, next, list, lru) {
/* We are going to skip over the reported pages. */
-   if (PageReported(page))
+   if (PageReported(page)) {
+   if (++scan_cnt >= MAX_SCAN_NUM) {
+   err = scan_cnt;
+   break;
+   }
continue;
+   }
+
 
/*
 * If we fully consumed our budget then update our
@@ -219,6 +228,26 @@ page_reporting_cycle(struct page_reporting_dev_info 
*prdev, struct zone *zone,
return err;
 }
 
+static int
+reporting_order_type(struct page_reporting_dev_info *prdev, struct zone *zone,
+unsigned int order, unsigned int mt,
+struct scatterlist *sgl, unsigned int *offset)
+{
+   int ret = 0;
+   unsigned long total = 0;
+
+   might_sleep();
+   do {
+   cond_resched();
+   ret = page_reporting_cycle(prdev, zone, order, mt,
+  sgl, offset);
+   if (ret > 0)
+   total += ret;
+   } while (ret > 0 && total < zone->free_area[order].nr_free);
+
+   return ret;
+}
+
 static int
 page_reporting_process_zone(struct page_reporting_dev_info *prdev,
struct scatterlist *sgl, struct zone *zone)
@@ -245,7 +274,7 @@ page_reporting_process_zone(struct page_reporting_dev_info 
*prdev,
if (is_migrate_isolate(mt))
continue;
 
-   err = page_reporting_cycle(prdev, zone, order, mt,
+   err = reporting_order_type(prdev, zone, order, mt,
   sgl, );
if (err)
return err;
-- 
2.18.2



[PATCH RFC 4/4] VMX: Expose the LA57 feature to VM

2016-12-29 Thread Liang Li
This patch exposes 5 level page table feature to the VM,
at the same time, the canonical virtual address checking is
extended to support both 48-bits and 57-bits address width,
it's the prerequisite to support 5 level paging guest.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
Cc: Dave Hansen <dave.han...@linux.intel.com>
Cc: Xiao Guangrong <guangrong.x...@linux.intel.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: "Radim Kr??m" <rkrc...@redhat.com>
---
 arch/x86/include/asm/kvm_host.h | 12 ++--
 arch/x86/kvm/cpuid.c| 15 +--
 arch/x86/kvm/emulate.c  | 15 ++-
 arch/x86/kvm/kvm_cache_regs.h   |  7 ++-
 arch/x86/kvm/vmx.c  |  4 ++--
 arch/x86/kvm/x86.c  |  8 ++--
 6 files changed, 39 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e505dac..57850b3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -86,8 +86,8 @@
  | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \
  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | 
X86_CR4_PCIDE \
  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
- | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP \
- | X86_CR4_PKE))
+ | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
+ | X86_CR4_SMAP | X86_CR4_PKE))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
@@ -1269,15 +1269,15 @@ static inline void kvm_inject_gp(struct kvm_vcpu *vcpu, 
u32 error_code)
kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
 }
 
-static inline u64 get_canonical(u64 la)
+static inline u64 get_canonical(u64 la, u8 vaddr_bits)
 {
-   return ((int64_t)la << 16) >> 16;
+   return ((int64_t)la << (64 - vaddr_bits)) >> (64 - vaddr_bits);
 }
 
-static inline bool is_noncanonical_address(u64 la)
+static inline bool is_noncanonical_address(u64 la, u8 vaddr_bits)
 {
 #ifdef CONFIG_X86_64
-   return get_canonical(la) != la;
+   return get_canonical(la, vaddr_bits) != la;
 #else
return false;
 #endif
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index e85f6bd..69e8c1a 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -126,13 +126,16 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
kvm_x86_ops->fpu_activate(vcpu);
 
/*
-* The existing code assumes virtual address is 48-bit in the canonical
-* address checks; exit if it is ever changed.
+* The existing code assumes virtual address is 48-bit or 57-bit in the
+* canonical address checks; exit if it is ever changed.
 */
best = kvm_find_cpuid_entry(vcpu, 0x8008, 0);
-   if (best && ((best->eax & 0xff00) >> 8) != 48 &&
-   ((best->eax & 0xff00) >> 8) != 0)
-   return -EINVAL;
+   if (best) {
+   int vaddr_bits = (best->eax & 0xff00) >> 8;
+
+   if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0)
+   return -EINVAL;
+   }
 
/* Update physical-address width */
vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
@@ -383,7 +386,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
 
/* cpuid 7.0.ecx*/
const u32 kvm_cpuid_7_0_ecx_x86_features =
-   F(AVX512VBMI) | F(PKU) | 0 /*OSPKE*/;
+   F(AVX512VBMI) | F(LA57) | F(PKU) | 0 /*OSPKE*/;
 
/* cpuid 7.0.edx*/
const u32 kvm_cpuid_7_0_edx_x86_features =
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 56628a4..da01dd7 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -676,6 +676,11 @@ static unsigned insn_alignment(struct x86_emulate_ctxt 
*ctxt, unsigned size)
}
 }
 
+static __always_inline u8 virt_addr_bits(struct x86_emulate_ctxt *ctxt)
+{
+   return (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_LA57) ? 57 : 48;
+}
+
 static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt,
   struct segmented_address addr,
   unsigned *max_size, unsigned size,
@@ -693,7 +698,7 @@ static __always_inline int __linearize(struct 
x86_emulate_ctxt *ctxt,
switch (mode) {
case X86EMUL_MODE_PROT64:
*linear = la;
-   if (is_noncanonical_address(la))
+   if (is_noncanonical_address(la, virt_addr_bits(ctxt)))
goto bad;
 
*max_size = m

[PATCH RFC 4/4] VMX: Expose the LA57 feature to VM

2016-12-29 Thread Liang Li
This patch exposes 5 level page table feature to the VM,
at the same time, the canonical virtual address checking is
extended to support both 48-bits and 57-bits address width,
it's the prerequisite to support 5 level paging guest.

Signed-off-by: Liang Li 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Kirill A. Shutemov 
Cc: Dave Hansen 
Cc: Xiao Guangrong 
Cc: Paolo Bonzini 
Cc: "Radim Kr??m" 
---
 arch/x86/include/asm/kvm_host.h | 12 ++--
 arch/x86/kvm/cpuid.c| 15 +--
 arch/x86/kvm/emulate.c  | 15 ++-
 arch/x86/kvm/kvm_cache_regs.h   |  7 ++-
 arch/x86/kvm/vmx.c  |  4 ++--
 arch/x86/kvm/x86.c  |  8 ++--
 6 files changed, 39 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e505dac..57850b3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -86,8 +86,8 @@
  | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \
  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | 
X86_CR4_PCIDE \
  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
- | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP \
- | X86_CR4_PKE))
+ | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
+ | X86_CR4_SMAP | X86_CR4_PKE))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
@@ -1269,15 +1269,15 @@ static inline void kvm_inject_gp(struct kvm_vcpu *vcpu, 
u32 error_code)
kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
 }
 
-static inline u64 get_canonical(u64 la)
+static inline u64 get_canonical(u64 la, u8 vaddr_bits)
 {
-   return ((int64_t)la << 16) >> 16;
+   return ((int64_t)la << (64 - vaddr_bits)) >> (64 - vaddr_bits);
 }
 
-static inline bool is_noncanonical_address(u64 la)
+static inline bool is_noncanonical_address(u64 la, u8 vaddr_bits)
 {
 #ifdef CONFIG_X86_64
-   return get_canonical(la) != la;
+   return get_canonical(la, vaddr_bits) != la;
 #else
return false;
 #endif
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index e85f6bd..69e8c1a 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -126,13 +126,16 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
kvm_x86_ops->fpu_activate(vcpu);
 
/*
-* The existing code assumes virtual address is 48-bit in the canonical
-* address checks; exit if it is ever changed.
+* The existing code assumes virtual address is 48-bit or 57-bit in the
+* canonical address checks; exit if it is ever changed.
 */
best = kvm_find_cpuid_entry(vcpu, 0x8008, 0);
-   if (best && ((best->eax & 0xff00) >> 8) != 48 &&
-   ((best->eax & 0xff00) >> 8) != 0)
-   return -EINVAL;
+   if (best) {
+   int vaddr_bits = (best->eax & 0xff00) >> 8;
+
+   if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0)
+   return -EINVAL;
+   }
 
/* Update physical-address width */
vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
@@ -383,7 +386,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
 
/* cpuid 7.0.ecx*/
const u32 kvm_cpuid_7_0_ecx_x86_features =
-   F(AVX512VBMI) | F(PKU) | 0 /*OSPKE*/;
+   F(AVX512VBMI) | F(LA57) | F(PKU) | 0 /*OSPKE*/;
 
/* cpuid 7.0.edx*/
const u32 kvm_cpuid_7_0_edx_x86_features =
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 56628a4..da01dd7 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -676,6 +676,11 @@ static unsigned insn_alignment(struct x86_emulate_ctxt 
*ctxt, unsigned size)
}
 }
 
+static __always_inline u8 virt_addr_bits(struct x86_emulate_ctxt *ctxt)
+{
+   return (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_LA57) ? 57 : 48;
+}
+
 static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt,
   struct segmented_address addr,
   unsigned *max_size, unsigned size,
@@ -693,7 +698,7 @@ static __always_inline int __linearize(struct 
x86_emulate_ctxt *ctxt,
switch (mode) {
case X86EMUL_MODE_PROT64:
*linear = la;
-   if (is_noncanonical_address(la))
+   if (is_noncanonical_address(la, virt_addr_bits(ctxt)))
goto bad;
 
*max_size = min_t(u64, ~0u, (1ull << 48) - la);
@@ -1721,7 +1726,7 @@ static int __load_segment_descriptor(struct 
x86_emulate_ctxt *ctxt,
if (ret != X86EMUL_CONTINUE)
return ret;
if (is_noncanonical_address(get_desc_ba

[PATCH RFC 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support.

2016-12-29 Thread Liang Li
The future Intel CPU will extend the max physical address to 52 bits.
To support the new physical address width, EPT is extended to support
5 level page table.
This patch add the 5 level EPT and extend shadow page to support
5 level paging guest. As the RFC version, this patch enables 5 level
EPT once the hardware supports, and this is not a good choice because
5 level EPT requires more memory access comparing to use 4 level EPT.
The right thing is to use 5 level EPT only when it's needed, will
change in the future version.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
Cc: Dave Hansen <dave.han...@linux.intel.com>
Cc: Xiao Guangrong <guangrong.x...@linux.intel.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: "Radim Kr??m" <rkrc...@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |   3 +-
 arch/x86/include/asm/vmx.h  |   1 +
 arch/x86/kvm/cpuid.h|   8 ++
 arch/x86/kvm/mmu.c  | 167 +++-
 arch/x86/kvm/mmu_audit.c|   5 +-
 arch/x86/kvm/paging_tmpl.h  |  19 -
 arch/x86/kvm/vmx.c  |  19 +++--
 arch/x86/kvm/x86.h  |  10 +++
 8 files changed, 184 insertions(+), 48 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a7066dc..e505dac 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -124,6 +124,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, 
int level)
 #define KVM_NR_VAR_MTRR 8
 
 #define ASYNC_PF_PER_VCPU 64
+#define PT64_ROOT_5LEVEL 5
 
 enum kvm_reg {
VCPU_REGS_RAX = 0,
@@ -310,7 +311,7 @@ struct kvm_pio_request {
 };
 
 struct rsvd_bits_validate {
-   u64 rsvd_bits_mask[2][4];
+   u64 rsvd_bits_mask[2][PT64_ROOT_5LEVEL];
u64 bad_mt_xwr;
 };
 
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 2b5b2d4..bf2f178 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -442,6 +442,7 @@ enum vmcs_field {
 
 #define VMX_EPT_EXECUTE_ONLY_BIT   (1ull)
 #define VMX_EPT_PAGE_WALK_4_BIT(1ull << 6)
+#define VMX_EPT_PAGE_WALK_5_BIT(1ull << 7)
 #define VMX_EPTP_UC_BIT(1ull << 8)
 #define VMX_EPTP_WB_BIT(1ull << 14)
 #define VMX_EPT_2MB_PAGE_BIT   (1ull << 16)
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 35058c2..4bdf3dc 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -88,6 +88,14 @@ static inline bool guest_cpuid_has_pku(struct kvm_vcpu *vcpu)
return best && (best->ecx & bit(X86_FEATURE_PKU));
 }
 
+static inline bool guest_cpuid_has_la57(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 7, 0);
+   return best && (best->ecx & bit(X86_FEATURE_LA57));
+}
+
 static inline bool guest_cpuid_has_longmode(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4c40273..0a56f27 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1986,8 +1986,8 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t 
gfn,
 }
 
 struct mmu_page_path {
-   struct kvm_mmu_page *parent[PT64_ROOT_4LEVEL];
-   unsigned int idx[PT64_ROOT_4LEVEL];
+   struct kvm_mmu_page *parent[PT64_ROOT_5LEVEL];
+   unsigned int idx[PT64_ROOT_5LEVEL];
 };
 
 #define for_each_sp(pvec, sp, parents, i)  \
@@ -2198,6 +2198,11 @@ static void shadow_walk_init(struct 
kvm_shadow_walk_iterator *iterator,
!vcpu->arch.mmu.direct_map)
--iterator->level;
 
+   if (iterator->level == PT64_ROOT_5LEVEL &&
+   vcpu->arch.mmu.root_level < PT64_ROOT_5LEVEL &&
+   !vcpu->arch.mmu.direct_map)
+   iterator->level -= 2;
+
if (iterator->level == PT32E_ROOT_LEVEL) {
iterator->shadow_addr
= vcpu->arch.mmu.pae_root[(addr >> 30) & 3];
@@ -3061,9 +3066,12 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
return;
 
-   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL &&
-   (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
-vcpu->arch.mmu.direct_map)) {
+   if ((vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL &&
+(vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
+ vcpu->arch.mmu.direct_map)) ||
+   (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_5LEVEL &&
+(vcpu-

[PATCH RFC 2/4] KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL

2016-12-29 Thread Liang Li
Now we have 4 level page table and 5 level page table in 64 bits
long mode, let's rename the PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL,
then we can use PT64_ROOT_5LEVEL for 5 level page table, it's
helpful to make the code more clear.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
Cc: Dave Hansen <dave.han...@linux.intel.com>
Cc: Xiao Guangrong <guangrong.x...@linux.intel.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: "Radim Kr??m" <rkrc...@redhat.com>
---
 arch/x86/kvm/mmu.c   | 36 ++--
 arch/x86/kvm/mmu.h   |  2 +-
 arch/x86/kvm/mmu_audit.c |  4 ++--
 arch/x86/kvm/svm.c   |  2 +-
 4 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7012de4..4c40273 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1986,8 +1986,8 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t 
gfn,
 }
 
 struct mmu_page_path {
-   struct kvm_mmu_page *parent[PT64_ROOT_LEVEL];
-   unsigned int idx[PT64_ROOT_LEVEL];
+   struct kvm_mmu_page *parent[PT64_ROOT_4LEVEL];
+   unsigned int idx[PT64_ROOT_4LEVEL];
 };
 
 #define for_each_sp(pvec, sp, parents, i)  \
@@ -2193,8 +2193,8 @@ static void shadow_walk_init(struct 
kvm_shadow_walk_iterator *iterator,
iterator->shadow_addr = vcpu->arch.mmu.root_hpa;
iterator->level = vcpu->arch.mmu.shadow_root_level;
 
-   if (iterator->level == PT64_ROOT_LEVEL &&
-   vcpu->arch.mmu.root_level < PT64_ROOT_LEVEL &&
+   if (iterator->level == PT64_ROOT_4LEVEL &&
+   vcpu->arch.mmu.root_level < PT64_ROOT_4LEVEL &&
!vcpu->arch.mmu.direct_map)
--iterator->level;
 
@@ -3061,8 +3061,8 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
return;
 
-   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL &&
-   (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL ||
+   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL &&
+   (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
 vcpu->arch.mmu.direct_map)) {
hpa_t root = vcpu->arch.mmu.root_hpa;
 
@@ -3114,10 +3114,10 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
struct kvm_mmu_page *sp;
unsigned i;
 
-   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL) {
+   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL) {
spin_lock(>kvm->mmu_lock);
make_mmu_pages_available(vcpu);
-   sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_LEVEL, 1, ACC_ALL);
+   sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_4LEVEL, 1, ACC_ALL);
++sp->root_count;
spin_unlock(>kvm->mmu_lock);
vcpu->arch.mmu.root_hpa = __pa(sp->spt);
@@ -3158,14 +3158,14 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 * Do we shadow a long mode page table? If so we need to
 * write-protect the guests page table root.
 */
-   if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) {
+   if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
hpa_t root = vcpu->arch.mmu.root_hpa;
 
MMU_WARN_ON(VALID_PAGE(root));
 
spin_lock(>kvm->mmu_lock);
make_mmu_pages_available(vcpu);
-   sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_LEVEL,
+   sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_4LEVEL,
  0, ACC_ALL);
root = __pa(sp->spt);
++sp->root_count;
@@ -3180,7 +3180,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 * the shadow page table may be a PAE or a long mode page table.
 */
pm_mask = PT_PRESENT_MASK;
-   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL)
+   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL)
pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
 
for (i = 0; i < 4; ++i) {
@@ -3213,7 +3213,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 * If we shadow a 32 bit page table with a long mode page
 * table we enter this path.
 */
-   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL) {
+   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL) {
if (vcpu->arch.mmu.lm_root == NULL) {
/*
 * The additional page necessary for this is only
@@ -3258,7 +3258

[PATCH RFC 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support.

2016-12-29 Thread Liang Li
The future Intel CPU will extend the max physical address to 52 bits.
To support the new physical address width, EPT is extended to support
5 level page table.
This patch add the 5 level EPT and extend shadow page to support
5 level paging guest. As the RFC version, this patch enables 5 level
EPT once the hardware supports, and this is not a good choice because
5 level EPT requires more memory access comparing to use 4 level EPT.
The right thing is to use 5 level EPT only when it's needed, will
change in the future version.

Signed-off-by: Liang Li 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Kirill A. Shutemov 
Cc: Dave Hansen 
Cc: Xiao Guangrong 
Cc: Paolo Bonzini 
Cc: "Radim Kr??m" 
---
 arch/x86/include/asm/kvm_host.h |   3 +-
 arch/x86/include/asm/vmx.h  |   1 +
 arch/x86/kvm/cpuid.h|   8 ++
 arch/x86/kvm/mmu.c  | 167 +++-
 arch/x86/kvm/mmu_audit.c|   5 +-
 arch/x86/kvm/paging_tmpl.h  |  19 -
 arch/x86/kvm/vmx.c  |  19 +++--
 arch/x86/kvm/x86.h  |  10 +++
 8 files changed, 184 insertions(+), 48 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a7066dc..e505dac 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -124,6 +124,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, 
int level)
 #define KVM_NR_VAR_MTRR 8
 
 #define ASYNC_PF_PER_VCPU 64
+#define PT64_ROOT_5LEVEL 5
 
 enum kvm_reg {
VCPU_REGS_RAX = 0,
@@ -310,7 +311,7 @@ struct kvm_pio_request {
 };
 
 struct rsvd_bits_validate {
-   u64 rsvd_bits_mask[2][4];
+   u64 rsvd_bits_mask[2][PT64_ROOT_5LEVEL];
u64 bad_mt_xwr;
 };
 
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 2b5b2d4..bf2f178 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -442,6 +442,7 @@ enum vmcs_field {
 
 #define VMX_EPT_EXECUTE_ONLY_BIT   (1ull)
 #define VMX_EPT_PAGE_WALK_4_BIT(1ull << 6)
+#define VMX_EPT_PAGE_WALK_5_BIT(1ull << 7)
 #define VMX_EPTP_UC_BIT(1ull << 8)
 #define VMX_EPTP_WB_BIT(1ull << 14)
 #define VMX_EPT_2MB_PAGE_BIT   (1ull << 16)
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 35058c2..4bdf3dc 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -88,6 +88,14 @@ static inline bool guest_cpuid_has_pku(struct kvm_vcpu *vcpu)
return best && (best->ecx & bit(X86_FEATURE_PKU));
 }
 
+static inline bool guest_cpuid_has_la57(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 7, 0);
+   return best && (best->ecx & bit(X86_FEATURE_LA57));
+}
+
 static inline bool guest_cpuid_has_longmode(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4c40273..0a56f27 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1986,8 +1986,8 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t 
gfn,
 }
 
 struct mmu_page_path {
-   struct kvm_mmu_page *parent[PT64_ROOT_4LEVEL];
-   unsigned int idx[PT64_ROOT_4LEVEL];
+   struct kvm_mmu_page *parent[PT64_ROOT_5LEVEL];
+   unsigned int idx[PT64_ROOT_5LEVEL];
 };
 
 #define for_each_sp(pvec, sp, parents, i)  \
@@ -2198,6 +2198,11 @@ static void shadow_walk_init(struct 
kvm_shadow_walk_iterator *iterator,
!vcpu->arch.mmu.direct_map)
--iterator->level;
 
+   if (iterator->level == PT64_ROOT_5LEVEL &&
+   vcpu->arch.mmu.root_level < PT64_ROOT_5LEVEL &&
+   !vcpu->arch.mmu.direct_map)
+   iterator->level -= 2;
+
if (iterator->level == PT32E_ROOT_LEVEL) {
iterator->shadow_addr
= vcpu->arch.mmu.pae_root[(addr >> 30) & 3];
@@ -3061,9 +3066,12 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
return;
 
-   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL &&
-   (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
-vcpu->arch.mmu.direct_map)) {
+   if ((vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL &&
+(vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
+ vcpu->arch.mmu.direct_map)) ||
+   (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_5LEVEL &&
+(vcpu->arch.mmu.root_level == PT64_ROOT_5LEVEL ||
+ vcpu->arch.mmu.direct_map))) {
hpa_t root = vcpu->arch.mmu.root_hpa;
 
spin_lock(>kvm->mmu_lock);
@@ -3114,10 +3122,12 @@ static 

[PATCH RFC 2/4] KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL

2016-12-29 Thread Liang Li
Now we have 4 level page table and 5 level page table in 64 bits
long mode, let's rename the PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL,
then we can use PT64_ROOT_5LEVEL for 5 level page table, it's
helpful to make the code more clear.

Signed-off-by: Liang Li 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Kirill A. Shutemov 
Cc: Dave Hansen 
Cc: Xiao Guangrong 
Cc: Paolo Bonzini 
Cc: "Radim Kr??m" 
---
 arch/x86/kvm/mmu.c   | 36 ++--
 arch/x86/kvm/mmu.h   |  2 +-
 arch/x86/kvm/mmu_audit.c |  4 ++--
 arch/x86/kvm/svm.c   |  2 +-
 4 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7012de4..4c40273 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1986,8 +1986,8 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t 
gfn,
 }
 
 struct mmu_page_path {
-   struct kvm_mmu_page *parent[PT64_ROOT_LEVEL];
-   unsigned int idx[PT64_ROOT_LEVEL];
+   struct kvm_mmu_page *parent[PT64_ROOT_4LEVEL];
+   unsigned int idx[PT64_ROOT_4LEVEL];
 };
 
 #define for_each_sp(pvec, sp, parents, i)  \
@@ -2193,8 +2193,8 @@ static void shadow_walk_init(struct 
kvm_shadow_walk_iterator *iterator,
iterator->shadow_addr = vcpu->arch.mmu.root_hpa;
iterator->level = vcpu->arch.mmu.shadow_root_level;
 
-   if (iterator->level == PT64_ROOT_LEVEL &&
-   vcpu->arch.mmu.root_level < PT64_ROOT_LEVEL &&
+   if (iterator->level == PT64_ROOT_4LEVEL &&
+   vcpu->arch.mmu.root_level < PT64_ROOT_4LEVEL &&
!vcpu->arch.mmu.direct_map)
--iterator->level;
 
@@ -3061,8 +3061,8 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
return;
 
-   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL &&
-   (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL ||
+   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL &&
+   (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
 vcpu->arch.mmu.direct_map)) {
hpa_t root = vcpu->arch.mmu.root_hpa;
 
@@ -3114,10 +3114,10 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
struct kvm_mmu_page *sp;
unsigned i;
 
-   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL) {
+   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL) {
spin_lock(>kvm->mmu_lock);
make_mmu_pages_available(vcpu);
-   sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_LEVEL, 1, ACC_ALL);
+   sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_4LEVEL, 1, ACC_ALL);
++sp->root_count;
spin_unlock(>kvm->mmu_lock);
vcpu->arch.mmu.root_hpa = __pa(sp->spt);
@@ -3158,14 +3158,14 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 * Do we shadow a long mode page table? If so we need to
 * write-protect the guests page table root.
 */
-   if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) {
+   if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
hpa_t root = vcpu->arch.mmu.root_hpa;
 
MMU_WARN_ON(VALID_PAGE(root));
 
spin_lock(>kvm->mmu_lock);
make_mmu_pages_available(vcpu);
-   sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_LEVEL,
+   sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_4LEVEL,
  0, ACC_ALL);
root = __pa(sp->spt);
++sp->root_count;
@@ -3180,7 +3180,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 * the shadow page table may be a PAE or a long mode page table.
 */
pm_mask = PT_PRESENT_MASK;
-   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL)
+   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL)
pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
 
for (i = 0; i < 4; ++i) {
@@ -3213,7 +3213,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 * If we shadow a 32 bit page table with a long mode page
 * table we enter this path.
 */
-   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL) {
+   if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL) {
if (vcpu->arch.mmu.lm_root == NULL) {
/*
 * The additional page necessary for this is only
@@ -3258,7 +3258,7 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
 
vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
-   if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) {
+   if (vcpu->arch.mmu.root

[PATCH RFC 0/4] 5-level EPT

2016-12-29 Thread Liang Li
x86-64 is currently limited physical address width to 46 bits, which
can support 64 TiB of memory. Some vendors require to support more for
some use case. Intel plans to extend the physical address width to
52 bits in some of the future products.  

The current EPT implementation only supports 4 level page table, which
can support maximum 48 bits physical address width, so it's needed to
extend the EPT to 5 level to support 52 bits physical address width.

This patchset has been tested in the SIMICS environment for 5 level
paging guest, which was patched with Kirill's patchset for enabling
5 level page table, with both the EPT and shadow page support. I just
covered the booting process, the guest can boot successfully. 

Some parts of this patchset can be improved. Any comments on the design
or the patches would be appreciated.

Liang Li (4):
  x86: Add the new CPUID and CR4 bits for 5 level page table
  KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL
  KVM: MMU: Add 5 level EPT & Shadow page table support.
  VMX: Expose the LA57 feature to VM

 arch/x86/include/asm/cpufeatures.h  |   1 +
 arch/x86/include/asm/kvm_host.h |  15 +--
 arch/x86/include/asm/vmx.h  |   1 +
 arch/x86/include/uapi/asm/processor-flags.h |   2 +
 arch/x86/kvm/cpuid.c|  15 ++-
 arch/x86/kvm/cpuid.h|   8 ++
 arch/x86/kvm/emulate.c  |  15 ++-
 arch/x86/kvm/kvm_cache_regs.h   |   7 +-
 arch/x86/kvm/mmu.c  | 179 +---
 arch/x86/kvm/mmu.h  |   2 +-
 arch/x86/kvm/mmu_audit.c|   5 +-
 arch/x86/kvm/paging_tmpl.h  |  19 ++-
 arch/x86/kvm/svm.c  |   2 +-
 arch/x86/kvm/vmx.c  |  23 ++--
 arch/x86/kvm/x86.c  |   8 +-
 arch/x86/kvm/x86.h  |  10 ++
 16 files changed, 234 insertions(+), 78 deletions(-)

-- 
1.9.1



[PATCH RFC 1/4] x86: Add the new CPUID and CR4 bits for 5 level page table

2016-12-29 Thread Liang Li
Define the related bits for the 5 level page table, which supports
57 bits width virtual address space. This patch maybe included in
Kirill's patch set which enables 5 level page table for x86,
because 5 level EPT doesn't depend on 5 level page table, we put
it here for independence.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
Cc: Dave Hansen <dave.han...@linux.intel.com>
Cc: Xiao Guangrong <guangrong.x...@linux.intel.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: "Radim Kr??m" <rkrc...@redhat.com>
---
 arch/x86/include/asm/cpufeatures.h  | 1 +
 arch/x86/include/uapi/asm/processor-flags.h | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index eafee31..2cf4018 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -288,6 +288,7 @@
 #define X86_FEATURE_AVX512VBMI  (16*32+ 1) /* AVX512 Vector Bit Manipulation 
instructions*/
 #define X86_FEATURE_PKU(16*32+ 3) /* Protection Keys for 
Userspace */
 #define X86_FEATURE_OSPKE  (16*32+ 4) /* OS Protection Keys Enable */
+#define X86_FEATURE_LA57   (16*32 + 16) /* 5-level page tables */
 #define X86_FEATURE_RDPID  (16*32+ 22) /* RDPID instruction */
 
 /* AMD-defined CPU features, CPUID level 0x8007 (ebx), word 17 */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h 
b/arch/x86/include/uapi/asm/processor-flags.h
index 567de50..185f3d1 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -104,6 +104,8 @@
 #define X86_CR4_OSFXSR _BITUL(X86_CR4_OSFXSR_BIT)
 #define X86_CR4_OSXMMEXCPT_BIT 10 /* enable unmasked SSE exceptions */
 #define X86_CR4_OSXMMEXCPT _BITUL(X86_CR4_OSXMMEXCPT_BIT)
+#define X86_CR4_LA57_BIT   12 /* enable 5-level page tables */
+#define X86_CR4_LA57   _BITUL(X86_CR4_LA57_BIT)
 #define X86_CR4_VMXE_BIT   13 /* enable VMX virtualization */
 #define X86_CR4_VMXE   _BITUL(X86_CR4_VMXE_BIT)
 #define X86_CR4_SMXE_BIT   14 /* enable safer mode (TXT) */
-- 
1.9.1



[PATCH RFC 1/4] x86: Add the new CPUID and CR4 bits for 5 level page table

2016-12-29 Thread Liang Li
Define the related bits for the 5 level page table, which supports
57 bits width virtual address space. This patch maybe included in
Kirill's patch set which enables 5 level page table for x86,
because 5 level EPT doesn't depend on 5 level page table, we put
it here for independence.

Signed-off-by: Liang Li 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Kirill A. Shutemov 
Cc: Dave Hansen 
Cc: Xiao Guangrong 
Cc: Paolo Bonzini 
Cc: "Radim Kr??m" 
---
 arch/x86/include/asm/cpufeatures.h  | 1 +
 arch/x86/include/uapi/asm/processor-flags.h | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index eafee31..2cf4018 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -288,6 +288,7 @@
 #define X86_FEATURE_AVX512VBMI  (16*32+ 1) /* AVX512 Vector Bit Manipulation 
instructions*/
 #define X86_FEATURE_PKU(16*32+ 3) /* Protection Keys for 
Userspace */
 #define X86_FEATURE_OSPKE  (16*32+ 4) /* OS Protection Keys Enable */
+#define X86_FEATURE_LA57   (16*32 + 16) /* 5-level page tables */
 #define X86_FEATURE_RDPID  (16*32+ 22) /* RDPID instruction */
 
 /* AMD-defined CPU features, CPUID level 0x8007 (ebx), word 17 */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h 
b/arch/x86/include/uapi/asm/processor-flags.h
index 567de50..185f3d1 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -104,6 +104,8 @@
 #define X86_CR4_OSFXSR _BITUL(X86_CR4_OSFXSR_BIT)
 #define X86_CR4_OSXMMEXCPT_BIT 10 /* enable unmasked SSE exceptions */
 #define X86_CR4_OSXMMEXCPT _BITUL(X86_CR4_OSXMMEXCPT_BIT)
+#define X86_CR4_LA57_BIT   12 /* enable 5-level page tables */
+#define X86_CR4_LA57   _BITUL(X86_CR4_LA57_BIT)
 #define X86_CR4_VMXE_BIT   13 /* enable VMX virtualization */
 #define X86_CR4_VMXE   _BITUL(X86_CR4_VMXE_BIT)
 #define X86_CR4_SMXE_BIT   14 /* enable safer mode (TXT) */
-- 
1.9.1



[PATCH RFC 0/4] 5-level EPT

2016-12-29 Thread Liang Li
x86-64 is currently limited physical address width to 46 bits, which
can support 64 TiB of memory. Some vendors require to support more for
some use case. Intel plans to extend the physical address width to
52 bits in some of the future products.  

The current EPT implementation only supports 4 level page table, which
can support maximum 48 bits physical address width, so it's needed to
extend the EPT to 5 level to support 52 bits physical address width.

This patchset has been tested in the SIMICS environment for 5 level
paging guest, which was patched with Kirill's patchset for enabling
5 level page table, with both the EPT and shadow page support. I just
covered the booting process, the guest can boot successfully. 

Some parts of this patchset can be improved. Any comments on the design
or the patches would be appreciated.

Liang Li (4):
  x86: Add the new CPUID and CR4 bits for 5 level page table
  KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL
  KVM: MMU: Add 5 level EPT & Shadow page table support.
  VMX: Expose the LA57 feature to VM

 arch/x86/include/asm/cpufeatures.h  |   1 +
 arch/x86/include/asm/kvm_host.h |  15 +--
 arch/x86/include/asm/vmx.h  |   1 +
 arch/x86/include/uapi/asm/processor-flags.h |   2 +
 arch/x86/kvm/cpuid.c|  15 ++-
 arch/x86/kvm/cpuid.h|   8 ++
 arch/x86/kvm/emulate.c  |  15 ++-
 arch/x86/kvm/kvm_cache_regs.h   |   7 +-
 arch/x86/kvm/mmu.c  | 179 +---
 arch/x86/kvm/mmu.h  |   2 +-
 arch/x86/kvm/mmu_audit.c|   5 +-
 arch/x86/kvm/paging_tmpl.h  |  19 ++-
 arch/x86/kvm/svm.c  |   2 +-
 arch/x86/kvm/vmx.c  |  23 ++--
 arch/x86/kvm/x86.c  |   8 +-
 arch/x86/kvm/x86.h  |  10 ++
 16 files changed, 234 insertions(+), 78 deletions(-)

-- 
1.9.1



[PATCH v6 kernel 2/5] virtio-balloon: define new feature bit and head struct

2016-12-20 Thread Liang Li
Add a new feature which supports sending the page information
with range array. The current implementation uses PFNs array,
which is not very efficient. Using ranges can improve the
performance of inflating/deflating significantly.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
Cc: Andrea Arcangeli <aarca...@redhat.com>
Cc: David Hildenbrand <da...@redhat.com>
---
 include/uapi/linux/virtio_balloon.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 343d7dd..2f850bf 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,10 +34,14 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info with ranges */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
 
+/* Bits width for the length of the pfn range */
+#define VIRTIO_BALLOON_NR_PFN_BITS 12
+
 struct virtio_balloon_config {
/* Number of pages host wants Guest to give up. */
__u32 num_pages;
@@ -82,4 +86,12 @@ struct virtio_balloon_stat {
__virtio64 val;
 } __attribute__((packed));
 
+/* Response header structure */
+struct virtio_balloon_resp_hdr {
+   __le64 cmd : 8; /* Distinguish different requests type */
+   __le64 flag: 8; /* Mark status for a specific request type */
+   __le64 id : 16; /* Distinguish requests of a specific type */
+   __le64 data_len: 32; /* Length of the following data, in bytes */
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.9.1



[PATCH v6 kernel 4/5] virtio-balloon: define flags and head for host request vq

2016-12-20 Thread Liang Li
Define the flags and head struct for a new host request virtual
queue. Guest can get requests from host and then responds to them on
this new virtual queue.
Host can make use of this virtual queue to request the guest do some
operations, e.g. drop page cache, synchronize file system, etc.
And the hypervisor can get some of guest's runtime information
through this virtual queue too, e.g. the guest's unused page
information, which can be used for live migration optimization.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
Cc: Andrea Arcangeli <aarca...@redhat.com>
Cc: David Hildenbrand <da...@redhat.com>
---
 include/uapi/linux/virtio_balloon.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 2f850bf..b367020 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info with ranges */
+#define VIRTIO_BALLOON_F_HOST_REQ_VQ   4 /* Host request virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -94,4 +95,25 @@ struct virtio_balloon_resp_hdr {
__le64 data_len: 32; /* Length of the following data, in bytes */
 };
 
+enum virtio_balloon_req_id {
+   /* Get unused page information */
+   BALLOON_GET_UNUSED_PAGES,
+};
+
+enum virtio_balloon_flag {
+   /* Have more data for a request */
+   BALLOON_FLAG_CONT,
+   /* No more data for a request */
+   BALLOON_FLAG_DONE,
+};
+
+struct virtio_balloon_req_hdr {
+   /* Used to distinguish different requests */
+   __le16 cmd;
+   /* Reserved */
+   __le16 reserved[3];
+   /* Request parameter */
+   __le64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.9.1



[PATCH v6 kernel 2/5] virtio-balloon: define new feature bit and head struct

2016-12-20 Thread Liang Li
Add a new feature which supports sending the page information
with range array. The current implementation uses PFNs array,
which is not very efficient. Using ranges can improve the
performance of inflating/deflating significantly.

Signed-off-by: Liang Li 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
Cc: Andrea Arcangeli 
Cc: David Hildenbrand 
---
 include/uapi/linux/virtio_balloon.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 343d7dd..2f850bf 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,10 +34,14 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info with ranges */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
 
+/* Bits width for the length of the pfn range */
+#define VIRTIO_BALLOON_NR_PFN_BITS 12
+
 struct virtio_balloon_config {
/* Number of pages host wants Guest to give up. */
__u32 num_pages;
@@ -82,4 +86,12 @@ struct virtio_balloon_stat {
__virtio64 val;
 } __attribute__((packed));
 
+/* Response header structure */
+struct virtio_balloon_resp_hdr {
+   __le64 cmd : 8; /* Distinguish different requests type */
+   __le64 flag: 8; /* Mark status for a specific request type */
+   __le64 id : 16; /* Distinguish requests of a specific type */
+   __le64 data_len: 32; /* Length of the following data, in bytes */
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.9.1



[PATCH v6 kernel 4/5] virtio-balloon: define flags and head for host request vq

2016-12-20 Thread Liang Li
Define the flags and head struct for a new host request virtual
queue. Guest can get requests from host and then responds to them on
this new virtual queue.
Host can make use of this virtual queue to request the guest do some
operations, e.g. drop page cache, synchronize file system, etc.
And the hypervisor can get some of guest's runtime information
through this virtual queue too, e.g. the guest's unused page
information, which can be used for live migration optimization.

Signed-off-by: Liang Li 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
Cc: Andrea Arcangeli 
Cc: David Hildenbrand 
---
 include/uapi/linux/virtio_balloon.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 2f850bf..b367020 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info with ranges */
+#define VIRTIO_BALLOON_F_HOST_REQ_VQ   4 /* Host request virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -94,4 +95,25 @@ struct virtio_balloon_resp_hdr {
__le64 data_len: 32; /* Length of the following data, in bytes */
 };
 
+enum virtio_balloon_req_id {
+   /* Get unused page information */
+   BALLOON_GET_UNUSED_PAGES,
+};
+
+enum virtio_balloon_flag {
+   /* Have more data for a request */
+   BALLOON_FLAG_CONT,
+   /* No more data for a request */
+   BALLOON_FLAG_DONE,
+};
+
+struct virtio_balloon_req_hdr {
+   /* Used to distinguish different requests */
+   __le16 cmd;
+   /* Reserved */
+   __le16 reserved[3];
+   /* Request parameter */
+   __le64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.9.1



[PATCH v6 kernel 5/5] virtio-balloon: tell host vm's unused page info

2016-12-20 Thread Liang Li
This patch contains two parts:

One is to add a new API to mm go get the unused page information.
The virtio balloon driver will use this new API added to get the
unused page info and send it to hypervisor(QEMU) to speed up live
migration. During sending the bitmap, some the pages may be modified
and are used by the guest, this inaccuracy can be corrected by the
dirty page logging mechanism.

One is to add support the request for vm's unused page information,
QEMU can make use of unused page information and the dirty page
logging mechanism to skip the transportation of some of these unused
pages, this is very helpful to reduce the network traffic and speed
up the live migration process.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
Cc: Andrea Arcangeli <aarca...@redhat.com>
Cc: David Hildenbrand <da...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 144 ++--
 include/linux/mm.h  |   3 +
 mm/page_alloc.c | 120 +
 3 files changed, 261 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 03383b3..b67f865 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,7 +56,7 @@
 
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *req_vq;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -85,6 +85,8 @@ struct virtio_balloon {
unsigned int nr_page_bmap;
/* Used to record the processed pfn range */
unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+   /* Request header */
+   struct virtio_balloon_req_hdr req_hdr;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -505,6 +507,80 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(available));
 }
 
+static void __send_unused_pages(struct virtio_balloon *vb,
+   unsigned long req_id, unsigned int pos, bool done)
+{
+   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
+   struct virtqueue *vq = vb->req_vq;
+
+   vb->resp_pos = pos;
+   hdr->cmd = BALLOON_GET_UNUSED_PAGES;
+   hdr->id = req_id;
+   if (!done)
+   hdr->flag = BALLOON_FLAG_CONT;
+   else
+   hdr->flag = BALLOON_FLAG_DONE;
+
+   if (pos > 0 || done)
+   send_resp_data(vb, vq, true);
+
+}
+
+static void send_unused_pages(struct virtio_balloon *vb,
+   unsigned long req_id)
+{
+   struct scatterlist sg_in;
+   unsigned int pos = 0;
+   struct virtqueue *vq = vb->req_vq;
+   int ret, order;
+   struct zone *zone = NULL;
+   bool part_fill = false;
+
+   mutex_lock(>balloon_lock);
+
+   for (order = MAX_ORDER - 1; order >= 0; order--) {
+   ret = mark_unused_pages(, order, vb->resp_data,
+vb->resp_buf_size / sizeof(__le64),
+, VIRTIO_BALLOON_NR_PFN_BITS, part_fill);
+   if (ret == -ENOSPC) {
+   if (pos == 0) {
+   void *new_resp_data;
+
+   new_resp_data = kmalloc(2 * vb->resp_buf_size,
+   GFP_KERNEL);
+   if (new_resp_data) {
+   kfree(vb->resp_data);
+   vb->resp_data = new_resp_data;
+   vb->resp_buf_size *= 2;
+   } else {
+   part_fill = true;
+   dev_warn(>vdev->dev,
+"%s: part fill order: %d\n",
+__func__, order);
+   }
+   } else {
+   __send_unused_pages(vb, req_id, pos, false);
+   pos = 0;
+   }
+
+   if (!part_fill) {
+   order++;
+   continue;
+   }
+   } else
+   zone = NULL;
+
+   if

[PATCH v6 kernel 5/5] virtio-balloon: tell host vm's unused page info

2016-12-20 Thread Liang Li
This patch contains two parts:

One is to add a new API to mm go get the unused page information.
The virtio balloon driver will use this new API added to get the
unused page info and send it to hypervisor(QEMU) to speed up live
migration. During sending the bitmap, some the pages may be modified
and are used by the guest, this inaccuracy can be corrected by the
dirty page logging mechanism.

One is to add support the request for vm's unused page information,
QEMU can make use of unused page information and the dirty page
logging mechanism to skip the transportation of some of these unused
pages, this is very helpful to reduce the network traffic and speed
up the live migration process.

Signed-off-by: Liang Li 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
Cc: Andrea Arcangeli 
Cc: David Hildenbrand 
---
 drivers/virtio/virtio_balloon.c | 144 ++--
 include/linux/mm.h  |   3 +
 mm/page_alloc.c | 120 +
 3 files changed, 261 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 03383b3..b67f865 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,7 +56,7 @@
 
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *req_vq;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -85,6 +85,8 @@ struct virtio_balloon {
unsigned int nr_page_bmap;
/* Used to record the processed pfn range */
unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+   /* Request header */
+   struct virtio_balloon_req_hdr req_hdr;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -505,6 +507,80 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(available));
 }
 
+static void __send_unused_pages(struct virtio_balloon *vb,
+   unsigned long req_id, unsigned int pos, bool done)
+{
+   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
+   struct virtqueue *vq = vb->req_vq;
+
+   vb->resp_pos = pos;
+   hdr->cmd = BALLOON_GET_UNUSED_PAGES;
+   hdr->id = req_id;
+   if (!done)
+   hdr->flag = BALLOON_FLAG_CONT;
+   else
+   hdr->flag = BALLOON_FLAG_DONE;
+
+   if (pos > 0 || done)
+   send_resp_data(vb, vq, true);
+
+}
+
+static void send_unused_pages(struct virtio_balloon *vb,
+   unsigned long req_id)
+{
+   struct scatterlist sg_in;
+   unsigned int pos = 0;
+   struct virtqueue *vq = vb->req_vq;
+   int ret, order;
+   struct zone *zone = NULL;
+   bool part_fill = false;
+
+   mutex_lock(>balloon_lock);
+
+   for (order = MAX_ORDER - 1; order >= 0; order--) {
+   ret = mark_unused_pages(, order, vb->resp_data,
+vb->resp_buf_size / sizeof(__le64),
+, VIRTIO_BALLOON_NR_PFN_BITS, part_fill);
+   if (ret == -ENOSPC) {
+   if (pos == 0) {
+   void *new_resp_data;
+
+   new_resp_data = kmalloc(2 * vb->resp_buf_size,
+   GFP_KERNEL);
+   if (new_resp_data) {
+   kfree(vb->resp_data);
+   vb->resp_data = new_resp_data;
+   vb->resp_buf_size *= 2;
+   } else {
+   part_fill = true;
+   dev_warn(>vdev->dev,
+"%s: part fill order: %d\n",
+__func__, order);
+   }
+   } else {
+   __send_unused_pages(vb, req_id, pos, false);
+   pos = 0;
+   }
+
+   if (!part_fill) {
+   order++;
+   continue;
+   }
+   } else
+   zone = NULL;
+
+   if (order == 0)
+   __send_unused_pages(vb, req_id, pos, true);
+
+   }
+
+   mutex_unlock(>balloon_lock);
+   sg_init_one(_in, >req_hdr, sizeof(vb->req_hdr));
+   virtqueue_add_inbuf(vq, _in, 1, >req_hdr, GFP_KERNEL);
+   virtqueue_kick(vq);
+}
+
 /*

[PATCH v6 kernel 1/5] virtio-balloon: rework deflate to add page to a list

2016-12-20 Thread Liang Li
When doing the inflating/deflating operation, the current virtio-balloon
implementation uses an array to save 256 PFNS, then send these PFNS to
host through virtio and process each PFN one by one. This way is not
efficient when inflating/deflating a large mount of memory because too
many times of the following operations:

1. Virtio data transmission
2. Page allocate/free
3. Address translation(GPA->HVA)
4. madvise

The over head of these operations will consume a lot of CPU cycles and
will take a long time to complete, it may impact the QoS of the guest as
well as the host. The overhead will be reduced a lot if batch processing
is used. E.g. If there are several pages whose address are physical
contiguous in the guest, these bulk pages can be processed in one
operation.

The main idea for the optimization is to reduce the above operations as
much as possible. And it can be achieved by using a {pfn|length} array
instead of a PFN array. Comparing with PFN array, {pfn|length} array can
present more pages and is fit for batch processing.

This patch saves the deflated pages to a list instead of the PFN array,
which will allow faster notifications using the {pfn|length} down the
road. balloon_pfn_to_page() can be removed because it's useless.

Signed-off-by: Liang Li <liang.z...@intel.com>
Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
Cc: Andrea Arcangeli <aarca...@redhat.com>
Cc: David Hildenbrand <da...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 181793f..f59cb4f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-   BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-   return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
size_t num)
return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+struct list_head *pages)
 {
-   unsigned int i;
-   struct page *page;
+   struct page *page, *next;
 
-   /* Find pfns pointing at start of each page, get pages and free them. */
-   for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-  vb->pfns[i]));
+   list_for_each_entry_safe(page, next, pages, lru) {
if (!virtio_has_feature(vb->vdev,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
adjust_managed_page_count(page, 1);
+   list_del(>lru);
put_page(page); /* balloon reference */
}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
unsigned num_freed_pages;
struct page *page;
struct balloon_dev_info *vb_dev_info = >vb_dev_info;
+   LIST_HEAD(pages);
 
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
if (!page)
break;
set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+   list_add(>lru, );
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
 */
if (vb->num_pfns != 0)
tell_host(vb, vb->deflate_vq);
-   release_pages_balloon(vb);
+   release_pages_balloon(vb, );
mutex_unlock(>balloon_lock);
return num_freed_pages;
 }
-- 
1.9.1



[PATCH v6 kernel 1/5] virtio-balloon: rework deflate to add page to a list

2016-12-20 Thread Liang Li
When doing the inflating/deflating operation, the current virtio-balloon
implementation uses an array to save 256 PFNS, then send these PFNS to
host through virtio and process each PFN one by one. This way is not
efficient when inflating/deflating a large mount of memory because too
many times of the following operations:

1. Virtio data transmission
2. Page allocate/free
3. Address translation(GPA->HVA)
4. madvise

The over head of these operations will consume a lot of CPU cycles and
will take a long time to complete, it may impact the QoS of the guest as
well as the host. The overhead will be reduced a lot if batch processing
is used. E.g. If there are several pages whose address are physical
contiguous in the guest, these bulk pages can be processed in one
operation.

The main idea for the optimization is to reduce the above operations as
much as possible. And it can be achieved by using a {pfn|length} array
instead of a PFN array. Comparing with PFN array, {pfn|length} array can
present more pages and is fit for batch processing.

This patch saves the deflated pages to a list instead of the PFN array,
which will allow faster notifications using the {pfn|length} down the
road. balloon_pfn_to_page() can be removed because it's useless.

Signed-off-by: Liang Li 
Signed-off-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
Cc: Andrea Arcangeli 
Cc: David Hildenbrand 
---
 drivers/virtio/virtio_balloon.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 181793f..f59cb4f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-   BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-   return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
size_t num)
return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+struct list_head *pages)
 {
-   unsigned int i;
-   struct page *page;
+   struct page *page, *next;
 
-   /* Find pfns pointing at start of each page, get pages and free them. */
-   for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-  vb->pfns[i]));
+   list_for_each_entry_safe(page, next, pages, lru) {
if (!virtio_has_feature(vb->vdev,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
adjust_managed_page_count(page, 1);
+   list_del(>lru);
put_page(page); /* balloon reference */
}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
unsigned num_freed_pages;
struct page *page;
struct balloon_dev_info *vb_dev_info = >vb_dev_info;
+   LIST_HEAD(pages);
 
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
if (!page)
break;
set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+   list_add(>lru, );
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
 */
if (vb->num_pfns != 0)
tell_host(vb, vb->deflate_vq);
-   release_pages_balloon(vb);
+   release_pages_balloon(vb, );
mutex_unlock(>balloon_lock);
return num_freed_pages;
 }
-- 
1.9.1



[PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2016-12-20 Thread Liang Li
The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using {pfn|length} array to send the page info instead of the
PFNs, we can reduce the overhead in stage b quite a lot.
Furthermore, we can do the address translation and call madvise()
with a range of memory, instead of the current page per page way,
the overhead of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li <liang.z...@intel.com>
Suggested-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
Cc: Andrea Arcangeli <aarca...@redhat.com>
Cc: David Hildenbrand <da...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 348 
 1 file changed, 320 insertions(+), 28 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f59cb4f..03383b3 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define BALLOON_BMAP_SIZE  (8 * PAGE_SIZE)
+#define PFNS_PER_BMAP  (BALLOON_BMAP_SIZE * BITS_PER_BYTE)
+#define BALLOON_BMAP_COUNT 32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -67,6 +71,20 @@ struct virtio_balloon {
 
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
+   /* Pointer to the response header. */
+   void *resp_hdr;
+   /* Pointer to the start address of response data. */
+   __le64 *resp_data;
+   /* Size of response data buffer. */
+   unsigned int resp_buf_size;
+   /* Pointer offset of the response data. */
+   unsigned int resp_pos;
+   /* Bitmap used to save the pfns info */
+   unsigned long *page_bitmap[BALLOON_BMAP_COUNT];
+   /* Number of split page bitmaps */
+   unsigned int nr_page_bmap;
+   /* Used to record the processed pfn range */
+   unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -110,20 +128,180 @@ static void balloon_ack(struct virtqueue *vq)
wake_up(>acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static inline void init_bmap_pfn_range(struct virtio_balloon *vb)
 {
-   struct scatterlist sg;
+   vb->min_pfn = ULONG_MAX;
+   vb->max_pfn = 0;
+}
+
+static inline void update_bmap_pfn_range(struct virtio_balloon *vb,
+struct page *page)
+{
+   unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+   vb->min_pfn = min(balloon_pfn, vb->min_pfn);
+   vb->max_pfn = max(balloon_pfn, vb->max_pfn);
+}
+
+static void extend_page_bitmap(struct virtio_balloon *vb,
+   unsigned long nr_pfn)
+{
+   int i, bmap_count;
+   unsigned long bmap_len;
+
+   bmap_len = ALIGN(nr_pfn, BITS_PER_LONG) / BITS_PER_BYTE;
+   bmap_len = ALIGN(bmap_len, BALLOON_BMAP_SIZE);
+   bmap_count = min((int)(bmap_len / BALLOON_BMAP_SIZE),
+BALLOON_BMAP_COUNT);
+
+   for (i = 1; i < bmap_count; i++) {
+   vb->page_bitmap[i] = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
+   if (vb->page_bitmap[i])
+   vb->nr_page_bmap++;
+   else
+   break;
+   }
+}
+
+static void free_extended_page_bitmap(struct virtio_balloon *vb)
+{
+   int i, bmap_count = vb->nr_page_bmap;
+
+   for (i = 1; i < bmap_count; i++) {
+   kfree(vb->page_bitmap[i]);
+   vb->page_bitmap[i] = NULL;
+   vb->nr_page_bmap--;
+   }
+}
+
+static void kfree_page_bitmap(struct virtio_balloon *vb)
+{
+   int i;
+
+   for (i = 0; i < vb->nr_page_bmap; i+

[PATCH v6 kernel 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-20 Thread Liang Li
This patch set contains two parts of changes to the virtio-balloon.
 
One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use {pfn|length} to present
the page information instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.
 
Another change is for speeding up live migration. By skipping process
guest's unused pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We put guest's unused page information in a
{pfn|length} array and send it to host with the virt queue of
virtio-balloon. For an idle guest with 8GB RAM, this can help to shorten
the total live migration time from 2Sec to about 500ms in 10Gbps network
environment. For an guest with quite a lot of page cache and with little
unused pages, it's possible to let the guest drop it's page cache before
live migration, this case can benefit from this new feature too.
 
Changes from v5 to v6:
* Drop the bitmap from the virtio ABI, use {pfn|length} only.
* Enhance the API to get the unused page information from mm. 

Changes from v4 to v5:
* Drop the code to get the max_pfn, use another way instead.
* Simplify the API to get the unused page information from mm. 

Changes from v3 to v4:
* Use the new scheme suggested by Dave Hansen to encode the bitmap.
* Add code which is missed in v3 to handle migrate page. 
* Free the memory for bitmap intime once the operation is done.
* Address some of the comments in v3.

Changes from v2 to v3:
* Change the name of 'free page' to 'unused page'.
* Use the scatter & gather bitmap instead of a 1MB page bitmap.
* Fix overwriting the page bitmap after kicking.
* Some of MST's comments for v2.
 
Changes from v1 to v2:
* Abandon the patch for dropping page cache.
* Put some structures to uapi head file.
* Use a new way to determine the page bitmap size.
* Use a unified way to send the free page information with the bitmap
* Address the issues referred in MST's comments

Liang Li (5):
  virtio-balloon: rework deflate to add page to a list
  virtio-balloon: define new feature bit and head struct
  virtio-balloon: speed up inflate/deflate process
  virtio-balloon: define flags and head for host request vq
  virtio-balloon: tell host vm's unused page info

 drivers/virtio/virtio_balloon.c | 510 
 include/linux/mm.h  |   3 +
 include/uapi/linux/virtio_balloon.h |  34 +++
 mm/page_alloc.c | 120 +
 4 files changed, 621 insertions(+), 46 deletions(-)

-- 
1.9.1



[PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2016-12-20 Thread Liang Li
The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using {pfn|length} array to send the page info instead of the
PFNs, we can reduce the overhead in stage b quite a lot.
Furthermore, we can do the address translation and call madvise()
with a range of memory, instead of the current page per page way,
the overhead of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li 
Suggested-by: Michael S. Tsirkin 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
Cc: Andrea Arcangeli 
Cc: David Hildenbrand 
---
 drivers/virtio/virtio_balloon.c | 348 
 1 file changed, 320 insertions(+), 28 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f59cb4f..03383b3 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define BALLOON_BMAP_SIZE  (8 * PAGE_SIZE)
+#define PFNS_PER_BMAP  (BALLOON_BMAP_SIZE * BITS_PER_BYTE)
+#define BALLOON_BMAP_COUNT 32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -67,6 +71,20 @@ struct virtio_balloon {
 
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
+   /* Pointer to the response header. */
+   void *resp_hdr;
+   /* Pointer to the start address of response data. */
+   __le64 *resp_data;
+   /* Size of response data buffer. */
+   unsigned int resp_buf_size;
+   /* Pointer offset of the response data. */
+   unsigned int resp_pos;
+   /* Bitmap used to save the pfns info */
+   unsigned long *page_bitmap[BALLOON_BMAP_COUNT];
+   /* Number of split page bitmaps */
+   unsigned int nr_page_bmap;
+   /* Used to record the processed pfn range */
+   unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -110,20 +128,180 @@ static void balloon_ack(struct virtqueue *vq)
wake_up(>acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static inline void init_bmap_pfn_range(struct virtio_balloon *vb)
 {
-   struct scatterlist sg;
+   vb->min_pfn = ULONG_MAX;
+   vb->max_pfn = 0;
+}
+
+static inline void update_bmap_pfn_range(struct virtio_balloon *vb,
+struct page *page)
+{
+   unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+   vb->min_pfn = min(balloon_pfn, vb->min_pfn);
+   vb->max_pfn = max(balloon_pfn, vb->max_pfn);
+}
+
+static void extend_page_bitmap(struct virtio_balloon *vb,
+   unsigned long nr_pfn)
+{
+   int i, bmap_count;
+   unsigned long bmap_len;
+
+   bmap_len = ALIGN(nr_pfn, BITS_PER_LONG) / BITS_PER_BYTE;
+   bmap_len = ALIGN(bmap_len, BALLOON_BMAP_SIZE);
+   bmap_count = min((int)(bmap_len / BALLOON_BMAP_SIZE),
+BALLOON_BMAP_COUNT);
+
+   for (i = 1; i < bmap_count; i++) {
+   vb->page_bitmap[i] = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
+   if (vb->page_bitmap[i])
+   vb->nr_page_bmap++;
+   else
+   break;
+   }
+}
+
+static void free_extended_page_bitmap(struct virtio_balloon *vb)
+{
+   int i, bmap_count = vb->nr_page_bmap;
+
+   for (i = 1; i < bmap_count; i++) {
+   kfree(vb->page_bitmap[i]);
+   vb->page_bitmap[i] = NULL;
+   vb->nr_page_bmap--;
+   }
+}
+
+static void kfree_page_bitmap(struct virtio_balloon *vb)
+{
+   int i;
+
+   for (i = 0; i < vb->nr_page_bmap; i++)
+   kfree(vb->page_bitmap[i]);
+}
+
+static void clear_page_bitmap(struct virtio_balloon *vb)
+{
+   int i;
+
+   for (i = 0; i < vb->nr_page_bmap; i++)
+   memset(vb->page_bitmap[i], 0, BALLOON_BMAP_SIZE

[PATCH v6 kernel 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-20 Thread Liang Li
This patch set contains two parts of changes to the virtio-balloon.
 
One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use {pfn|length} to present
the page information instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.
 
Another change is for speeding up live migration. By skipping process
guest's unused pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We put guest's unused page information in a
{pfn|length} array and send it to host with the virt queue of
virtio-balloon. For an idle guest with 8GB RAM, this can help to shorten
the total live migration time from 2Sec to about 500ms in 10Gbps network
environment. For an guest with quite a lot of page cache and with little
unused pages, it's possible to let the guest drop it's page cache before
live migration, this case can benefit from this new feature too.
 
Changes from v5 to v6:
* Drop the bitmap from the virtio ABI, use {pfn|length} only.
* Enhance the API to get the unused page information from mm. 

Changes from v4 to v5:
* Drop the code to get the max_pfn, use another way instead.
* Simplify the API to get the unused page information from mm. 

Changes from v3 to v4:
* Use the new scheme suggested by Dave Hansen to encode the bitmap.
* Add code which is missed in v3 to handle migrate page. 
* Free the memory for bitmap intime once the operation is done.
* Address some of the comments in v3.

Changes from v2 to v3:
* Change the name of 'free page' to 'unused page'.
* Use the scatter & gather bitmap instead of a 1MB page bitmap.
* Fix overwriting the page bitmap after kicking.
* Some of MST's comments for v2.
 
Changes from v1 to v2:
* Abandon the patch for dropping page cache.
* Put some structures to uapi head file.
* Use a new way to determine the page bitmap size.
* Use a unified way to send the free page information with the bitmap
* Address the issues referred in MST's comments

Liang Li (5):
  virtio-balloon: rework deflate to add page to a list
  virtio-balloon: define new feature bit and head struct
  virtio-balloon: speed up inflate/deflate process
  virtio-balloon: define flags and head for host request vq
  virtio-balloon: tell host vm's unused page info

 drivers/virtio/virtio_balloon.c | 510 
 include/linux/mm.h  |   3 +
 include/uapi/linux/virtio_balloon.h |  34 +++
 mm/page_alloc.c | 120 +
 4 files changed, 621 insertions(+), 46 deletions(-)

-- 
1.9.1



[PATCH kernel v5 2/5] virtio-balloon: define new feature bit and head struct

2016-11-30 Thread Liang Li
Add a new feature which supports sending the page information with
a bitmap. The current implementation uses PFNs array, which is not
very efficient. Using bitmap can improve the performance of
inflating/deflating significantly

The page bitmap header will used to tell the host some information
about the page bitmap. e.g. the page size, page bitmap length and
start pfn.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 include/uapi/linux/virtio_balloon.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 343d7dd..1be4b1f 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,22 @@ struct virtio_balloon_stat {
__virtio64 val;
 } __attribute__((packed));
 
+/* Response header structure */
+struct virtio_balloon_resp_hdr {
+   __le64 cmd : 8; /* Distinguish different requests type */
+   __le64 flag: 8; /* Mark status for a specific request type */
+   __le64 id : 16; /* Distinguish requests of a specific type */
+   __le64 data_len: 32; /* Length of the following data, in bytes */
+};
+
+/* Page bitmap header structure */
+struct virtio_balloon_bmap_hdr {
+   struct {
+   __le64 start_pfn : 52; /* start pfn for the bitmap */
+   __le64 page_shift : 6; /* page shift width, in bytes */
+   __le64 bmap_len : 6;  /* bitmap length, in bytes */
+   } head;
+   __le64 bmap[0];
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1



[PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info

2016-11-30 Thread Liang Li
This patch contains two parts:

One is to add a new API to mm go get the unused page information.
The virtio balloon driver will use this new API added to get the
unused page info and send it to hypervisor(QEMU) to speed up live
migration. During sending the bitmap, some the pages may be modified
and are used by the guest, this inaccuracy can be corrected by the
dirty page logging mechanism.

One is to add support the request for vm's unused page information,
QEMU can make use of unused page information and the dirty page
logging mechanism to skip the transportation of some of these unused
pages, this is very helpful to reduce the network traffic and speed
up the live migration process.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 drivers/virtio/virtio_balloon.c | 126 +---
 include/linux/mm.h  |   3 +-
 mm/page_alloc.c |  72 +++
 3 files changed, 193 insertions(+), 8 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index c3ddec3..2626cc0 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,7 +56,7 @@
 
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *req_vq;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -75,6 +75,8 @@ struct virtio_balloon {
void *resp_hdr;
/* Pointer to the start address of response data. */
unsigned long *resp_data;
+   /* Size of response data buffer. */
+   unsigned long resp_buf_size;
/* Pointer offset of the response data. */
unsigned long resp_pos;
/* Bitmap and bitmap count used to tell the host the pages */
@@ -83,6 +85,8 @@ struct virtio_balloon {
unsigned int nr_page_bmap;
/* Used to record the processed pfn range */
unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+   /* Request header */
+   struct virtio_balloon_req_hdr req_hdr;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -551,6 +555,58 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(available));
 }
 
+static void send_unused_pages_info(struct virtio_balloon *vb,
+   unsigned long req_id)
+{
+   struct scatterlist sg_in;
+   unsigned long pos = 0;
+   struct virtqueue *vq = vb->req_vq;
+   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
+   int ret, order;
+
+   mutex_lock(>balloon_lock);
+
+   for (order = MAX_ORDER - 1; order >= 0; order--) {
+   pos = 0;
+   ret = get_unused_pages(vb->resp_data,
+vb->resp_buf_size / sizeof(unsigned long),
+order, );
+   if (ret == -ENOSPC) {
+   void *new_resp_data;
+
+   new_resp_data = kmalloc(2 * vb->resp_buf_size,
+   GFP_KERNEL);
+   if (new_resp_data) {
+   kfree(vb->resp_data);
+   vb->resp_data = new_resp_data;
+   vb->resp_buf_size *= 2;
+   order++;
+   continue;
+   } else
+   dev_warn(>vdev->dev,
+"%s: omit some %d order pages\n",
+__func__, order);
+   }
+
+   if (pos > 0) {
+   vb->resp_pos = pos;
+   hdr->cmd = BALLOON_GET_UNUSED_PAGES;
+   hdr->id = req_id;
+   if (order > 0)
+   hdr->flag = BALLOON_FLAG_CONT;
+   else
+   hdr->flag = BALLOON_FLAG_DONE;
+
+   send_resp_data(vb, vq, true);
+   }
+   }
+
+   mutex_unlock(>balloon_lock);
+   sg_init_one(_in, >req_hdr, sizeof(vb->req_hdr));
+   virtqueue_add_inbuf(vq, _in, 1, >req_hdr, GFP_KERNEL);
+   virtqueue_kick(vq);
+}
+
 /*
  * While most virtqueues communicate guest-initiated requests to the 
hypervisor,
  * the stats queue operates in r

[PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-11-30 Thread Liang Li
This patch set contains two parts of changes to the virtio-balloon.
 
One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use bitmap to send the page
information to host instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.
 
Another change is for speeding up live migration. By skipping process
guest's unused pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We put guest's unused page information in a bitmap
and send it to host with the virt queue of virtio-balloon. For an idle
guest with 8GB RAM, this can help to shorten the total live migration
time from 2Sec to about 500ms in 10Gbps network environment.
 
Changes from v4 to v5:
* Drop the code to get the max_pfn, use another way instead.
* Simplify the API to get the unused page information from mm. 

Changes from v3 to v4:
* Use the new scheme suggested by Dave Hansen to encode the bitmap.
* Add code which is missed in v3 to handle migrate page. 
* Free the memory for bitmap intime once the operation is done.
* Address some of the comments in v3.

Changes from v2 to v3:
* Change the name of 'free page' to 'unused page'.
* Use the scatter & gather bitmap instead of a 1MB page bitmap.
* Fix overwriting the page bitmap after kicking.
* Some of MST's comments for v2.
 
Changes from v1 to v2:
* Abandon the patch for dropping page cache.
* Put some structures to uapi head file.
* Use a new way to determine the page bitmap size.
* Use a unified way to send the free page information with the bitmap
* Address the issues referred in MST's comments

Liang Li (5):
  virtio-balloon: rework deflate to add page to a list
  virtio-balloon: define new feature bit and head struct
  virtio-balloon: speed up inflate/deflate process
  virtio-balloon: define flags and head for host request vq
  virtio-balloon: tell host vm's unused page info

 drivers/virtio/virtio_balloon.c | 539 
 include/linux/mm.h  |   3 +-
 include/uapi/linux/virtio_balloon.h |  41 +++
 mm/page_alloc.c |  72 +
 4 files changed, 607 insertions(+), 48 deletions(-)

-- 
1.8.3.1



[PATCH kernel v5 4/5] virtio-balloon: define flags and head for host request vq

2016-11-30 Thread Liang Li
Define the flags and head struct for a new host request virtual
queue. Guest can get requests from host and then responds to them on
this new virtual queue.
Host can make use of this virtual queue to request the guest do some
operations, e.g. drop page cache, synchronize file system, etc.
And the hypervisor can get some of guest's runtime information
through this virtual queue too, e.g. the guest's unused page
information, which can be used for live migration optimization.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 include/uapi/linux/virtio_balloon.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 1be4b1f..5ac3a40 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
+#define VIRTIO_BALLOON_F_HOST_REQ_VQ   4 /* Host request virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -101,4 +102,25 @@ struct virtio_balloon_bmap_hdr {
__le64 bmap[0];
 };
 
+enum virtio_balloon_req_id {
+   /* Get unused page information */
+   BALLOON_GET_UNUSED_PAGES,
+};
+
+enum virtio_balloon_flag {
+   /* Have more data for a request */
+   BALLOON_FLAG_CONT,
+   /* No more data for a request */
+   BALLOON_FLAG_DONE,
+};
+
+struct virtio_balloon_req_hdr {
+   /* Used to distinguish different requests */
+   __le16 cmd;
+   /* Reserved */
+   __le16 reserved[3];
+   /* Request parameter */
+   __le64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1



[PATCH kernel v5 2/5] virtio-balloon: define new feature bit and head struct

2016-11-30 Thread Liang Li
Add a new feature which supports sending the page information with
a bitmap. The current implementation uses PFNs array, which is not
very efficient. Using bitmap can improve the performance of
inflating/deflating significantly

The page bitmap header will used to tell the host some information
about the page bitmap. e.g. the page size, page bitmap length and
start pfn.

Signed-off-by: Liang Li 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
---
 include/uapi/linux/virtio_balloon.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 343d7dd..1be4b1f 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,22 @@ struct virtio_balloon_stat {
__virtio64 val;
 } __attribute__((packed));
 
+/* Response header structure */
+struct virtio_balloon_resp_hdr {
+   __le64 cmd : 8; /* Distinguish different requests type */
+   __le64 flag: 8; /* Mark status for a specific request type */
+   __le64 id : 16; /* Distinguish requests of a specific type */
+   __le64 data_len: 32; /* Length of the following data, in bytes */
+};
+
+/* Page bitmap header structure */
+struct virtio_balloon_bmap_hdr {
+   struct {
+   __le64 start_pfn : 52; /* start pfn for the bitmap */
+   __le64 page_shift : 6; /* page shift width, in bytes */
+   __le64 bmap_len : 6;  /* bitmap length, in bytes */
+   } head;
+   __le64 bmap[0];
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1



[PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info

2016-11-30 Thread Liang Li
This patch contains two parts:

One is to add a new API to mm go get the unused page information.
The virtio balloon driver will use this new API added to get the
unused page info and send it to hypervisor(QEMU) to speed up live
migration. During sending the bitmap, some the pages may be modified
and are used by the guest, this inaccuracy can be corrected by the
dirty page logging mechanism.

One is to add support the request for vm's unused page information,
QEMU can make use of unused page information and the dirty page
logging mechanism to skip the transportation of some of these unused
pages, this is very helpful to reduce the network traffic and speed
up the live migration process.

Signed-off-by: Liang Li 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
---
 drivers/virtio/virtio_balloon.c | 126 +---
 include/linux/mm.h  |   3 +-
 mm/page_alloc.c |  72 +++
 3 files changed, 193 insertions(+), 8 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index c3ddec3..2626cc0 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,7 +56,7 @@
 
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *req_vq;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -75,6 +75,8 @@ struct virtio_balloon {
void *resp_hdr;
/* Pointer to the start address of response data. */
unsigned long *resp_data;
+   /* Size of response data buffer. */
+   unsigned long resp_buf_size;
/* Pointer offset of the response data. */
unsigned long resp_pos;
/* Bitmap and bitmap count used to tell the host the pages */
@@ -83,6 +85,8 @@ struct virtio_balloon {
unsigned int nr_page_bmap;
/* Used to record the processed pfn range */
unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+   /* Request header */
+   struct virtio_balloon_req_hdr req_hdr;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -551,6 +555,58 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(available));
 }
 
+static void send_unused_pages_info(struct virtio_balloon *vb,
+   unsigned long req_id)
+{
+   struct scatterlist sg_in;
+   unsigned long pos = 0;
+   struct virtqueue *vq = vb->req_vq;
+   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
+   int ret, order;
+
+   mutex_lock(>balloon_lock);
+
+   for (order = MAX_ORDER - 1; order >= 0; order--) {
+   pos = 0;
+   ret = get_unused_pages(vb->resp_data,
+vb->resp_buf_size / sizeof(unsigned long),
+order, );
+   if (ret == -ENOSPC) {
+   void *new_resp_data;
+
+   new_resp_data = kmalloc(2 * vb->resp_buf_size,
+   GFP_KERNEL);
+   if (new_resp_data) {
+   kfree(vb->resp_data);
+   vb->resp_data = new_resp_data;
+   vb->resp_buf_size *= 2;
+   order++;
+   continue;
+   } else
+   dev_warn(>vdev->dev,
+"%s: omit some %d order pages\n",
+__func__, order);
+   }
+
+   if (pos > 0) {
+   vb->resp_pos = pos;
+   hdr->cmd = BALLOON_GET_UNUSED_PAGES;
+   hdr->id = req_id;
+   if (order > 0)
+   hdr->flag = BALLOON_FLAG_CONT;
+   else
+   hdr->flag = BALLOON_FLAG_DONE;
+
+   send_resp_data(vb, vq, true);
+   }
+   }
+
+   mutex_unlock(>balloon_lock);
+   sg_init_one(_in, >req_hdr, sizeof(vb->req_hdr));
+   virtqueue_add_inbuf(vq, _in, 1, >req_hdr, GFP_KERNEL);
+   virtqueue_kick(vq);
+}
+
 /*
  * While most virtqueues communicate guest-initiated requests to the 
hypervisor,
  * the stats queue operates in reverse.  The driver initializes the virtqueue
@@ -685,18 +741,56 @@ static void update_balloon_size_func(struct work_struct 
*work)
queue_work(system_freezable_wq, work);
 }
 
+static void misc_handle_rq(struct virtio_balloo

[PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-11-30 Thread Liang Li
This patch set contains two parts of changes to the virtio-balloon.
 
One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use bitmap to send the page
information to host instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.
 
Another change is for speeding up live migration. By skipping process
guest's unused pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We put guest's unused page information in a bitmap
and send it to host with the virt queue of virtio-balloon. For an idle
guest with 8GB RAM, this can help to shorten the total live migration
time from 2Sec to about 500ms in 10Gbps network environment.
 
Changes from v4 to v5:
* Drop the code to get the max_pfn, use another way instead.
* Simplify the API to get the unused page information from mm. 

Changes from v3 to v4:
* Use the new scheme suggested by Dave Hansen to encode the bitmap.
* Add code which is missed in v3 to handle migrate page. 
* Free the memory for bitmap intime once the operation is done.
* Address some of the comments in v3.

Changes from v2 to v3:
* Change the name of 'free page' to 'unused page'.
* Use the scatter & gather bitmap instead of a 1MB page bitmap.
* Fix overwriting the page bitmap after kicking.
* Some of MST's comments for v2.
 
Changes from v1 to v2:
* Abandon the patch for dropping page cache.
* Put some structures to uapi head file.
* Use a new way to determine the page bitmap size.
* Use a unified way to send the free page information with the bitmap
* Address the issues referred in MST's comments

Liang Li (5):
  virtio-balloon: rework deflate to add page to a list
  virtio-balloon: define new feature bit and head struct
  virtio-balloon: speed up inflate/deflate process
  virtio-balloon: define flags and head for host request vq
  virtio-balloon: tell host vm's unused page info

 drivers/virtio/virtio_balloon.c | 539 
 include/linux/mm.h  |   3 +-
 include/uapi/linux/virtio_balloon.h |  41 +++
 mm/page_alloc.c |  72 +
 4 files changed, 607 insertions(+), 48 deletions(-)

-- 
1.8.3.1



[PATCH kernel v5 4/5] virtio-balloon: define flags and head for host request vq

2016-11-30 Thread Liang Li
Define the flags and head struct for a new host request virtual
queue. Guest can get requests from host and then responds to them on
this new virtual queue.
Host can make use of this virtual queue to request the guest do some
operations, e.g. drop page cache, synchronize file system, etc.
And the hypervisor can get some of guest's runtime information
through this virtual queue too, e.g. the guest's unused page
information, which can be used for live migration optimization.

Signed-off-by: Liang Li 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
---
 include/uapi/linux/virtio_balloon.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 1be4b1f..5ac3a40 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
+#define VIRTIO_BALLOON_F_HOST_REQ_VQ   4 /* Host request virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -101,4 +102,25 @@ struct virtio_balloon_bmap_hdr {
__le64 bmap[0];
 };
 
+enum virtio_balloon_req_id {
+   /* Get unused page information */
+   BALLOON_GET_UNUSED_PAGES,
+};
+
+enum virtio_balloon_flag {
+   /* Have more data for a request */
+   BALLOON_FLAG_CONT,
+   /* No more data for a request */
+   BALLOON_FLAG_DONE,
+};
+
+struct virtio_balloon_req_hdr {
+   /* Used to distinguish different requests */
+   __le16 cmd;
+   /* Reserved */
+   __le16 reserved[3];
+   /* Request parameter */
+   __le64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1



[PATCH kernel v5 1/5] virtio-balloon: rework deflate to add page to a list

2016-11-30 Thread Liang Li
When doing the inflating/deflating operation, the current virtio-balloon
implementation uses an array to save 256 PFNS, then send these PFNS to
host through virtio and process each PFN one by one. This way is not
efficient when inflating/deflating a large mount of memory because too
many times of the following operations:

1. Virtio data transmission
2. Page allocate/free
3. Address translation(GPA->HVA)
4. madvise

The over head of these operations will consume a lot of CPU cycles and
will take a long time to complete, it may impact the QoS of the guest as
well as the host. The overhead will be reduced a lot if batch processing
is used. E.g. If there are several pages whose address are physical
contiguous in the guest, these bulk pages can be processed in one
operation.

The main idea for the optimization is to reduce the above operations as
much as possible. And it can be achieved by using a bitmap instead of an
PFN array. Comparing with PFN array, for a specific size buffer, bitmap
can present more pages, which is very important for batch processing.

Using bitmap instead of PFN is not very helpful when inflating/deflating
a small mount of pages, in this case, using PFNs is better. But using
bitmap will not impact the QoS of guest or host heavily because the
operation will be completed very soon for a small mount of pages, and we
will use some methods to make sure the efficiency not drop too much.

This patch saves the deflated pages to a list instead of the PFN array,
which will allow faster notifications using a bitmap down the road.
balloon_pfn_to_page() can be removed because it's useless.

Signed-off-by: Liang Li <liang.z...@intel.com>
Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 181793f..f59cb4f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-   BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-   return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
size_t num)
return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+struct list_head *pages)
 {
-   unsigned int i;
-   struct page *page;
+   struct page *page, *next;
 
-   /* Find pfns pointing at start of each page, get pages and free them. */
-   for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-  vb->pfns[i]));
+   list_for_each_entry_safe(page, next, pages, lru) {
if (!virtio_has_feature(vb->vdev,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
adjust_managed_page_count(page, 1);
+   list_del(>lru);
put_page(page); /* balloon reference */
}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
unsigned num_freed_pages;
struct page *page;
struct balloon_dev_info *vb_dev_info = >vb_dev_info;
+   LIST_HEAD(pages);
 
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
if (!page)
break;
set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+   list_add(>lru, );
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
 */
if (vb->num_pfns != 0)
tell_host(vb, vb->deflate_vq);
-   release_pages_balloon(vb);
+   release_pages_balloon(vb, );
mutex_unlock(>balloon_lock);
return num_freed_pages;
 }
-- 
1.8.3.1



[PATCH kernel v5 3/5] virtio-balloon: speed up inflate/deflate process

2016-11-30 Thread Liang Li
The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we
can reduce the overhead in stage b quite a lot. Furthermore, we
can do the address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li <liang.z...@intel.com>
Suggested-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 drivers/virtio/virtio_balloon.c | 395 +---
 1 file changed, 367 insertions(+), 28 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f59cb4f..c3ddec3 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define BALLOON_BMAP_SIZE  (8 * PAGE_SIZE)
+#define PFNS_PER_BMAP  (BALLOON_BMAP_SIZE * BITS_PER_BYTE)
+#define BALLOON_BMAP_COUNT 32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -67,6 +71,18 @@ struct virtio_balloon {
 
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
+   /* Pointer to the response header. */
+   void *resp_hdr;
+   /* Pointer to the start address of response data. */
+   unsigned long *resp_data;
+   /* Pointer offset of the response data. */
+   unsigned long resp_pos;
+   /* Bitmap and bitmap count used to tell the host the pages */
+   unsigned long *page_bitmap[BALLOON_BMAP_COUNT];
+   /* Number of split page bitmaps */
+   unsigned int nr_page_bmap;
+   /* Used to record the processed pfn range */
+   unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -110,20 +126,228 @@ static void balloon_ack(struct virtqueue *vq)
wake_up(>acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static inline void init_bmap_pfn_range(struct virtio_balloon *vb)
 {
-   struct scatterlist sg;
+   vb->min_pfn = ULONG_MAX;
+   vb->max_pfn = 0;
+}
+
+static inline void update_bmap_pfn_range(struct virtio_balloon *vb,
+struct page *page)
+{
+   unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+   vb->min_pfn = min(balloon_pfn, vb->min_pfn);
+   vb->max_pfn = max(balloon_pfn, vb->max_pfn);
+}
+
+static void extend_page_bitmap(struct virtio_balloon *vb,
+   unsigned long nr_pfn)
+{
+   int i, bmap_count;
+   unsigned long bmap_len;
+
+   bmap_len = ALIGN(nr_pfn, BITS_PER_LONG) / BITS_PER_BYTE;
+   bmap_len = ALIGN(bmap_len, BALLOON_BMAP_SIZE);
+   bmap_count = min((int)(bmap_len / BALLOON_BMAP_SIZE),
+BALLOON_BMAP_COUNT);
+
+   for (i = 1; i < bmap_count; i++) {
+   vb->page_bitmap[i] = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
+   if (vb->page_bitmap[i])
+   vb->nr_page_bmap++;
+   else
+   break;
+   }
+}
+
+static void free_extended_page_bitmap(struct virtio_balloon *vb)
+{
+   int i, bmap_count = vb->nr_page_bmap;
+
+
+   for (i = 1; i < bmap_count; i++) {
+   kfree(vb->page_bitmap[i]);
+   vb->page_bitmap[i] = NULL;
+   vb->nr_page_bmap--;
+   }
+}
+
+static void kfree_page_bitmap(struct virtio_balloon *vb)
+{
+   int i;
+
+   for (i = 0; i < vb->nr_page_bmap; i++)
+   kfree(vb->page_bitmap[i]);
+}
+
+static void clear_page_bitmap(struct virtio_balloon *vb)
+{
+   int i;
+
+   for (i = 0; i &

[PATCH kernel v5 1/5] virtio-balloon: rework deflate to add page to a list

2016-11-30 Thread Liang Li
When doing the inflating/deflating operation, the current virtio-balloon
implementation uses an array to save 256 PFNS, then send these PFNS to
host through virtio and process each PFN one by one. This way is not
efficient when inflating/deflating a large mount of memory because too
many times of the following operations:

1. Virtio data transmission
2. Page allocate/free
3. Address translation(GPA->HVA)
4. madvise

The over head of these operations will consume a lot of CPU cycles and
will take a long time to complete, it may impact the QoS of the guest as
well as the host. The overhead will be reduced a lot if batch processing
is used. E.g. If there are several pages whose address are physical
contiguous in the guest, these bulk pages can be processed in one
operation.

The main idea for the optimization is to reduce the above operations as
much as possible. And it can be achieved by using a bitmap instead of an
PFN array. Comparing with PFN array, for a specific size buffer, bitmap
can present more pages, which is very important for batch processing.

Using bitmap instead of PFN is not very helpful when inflating/deflating
a small mount of pages, in this case, using PFNs is better. But using
bitmap will not impact the QoS of guest or host heavily because the
operation will be completed very soon for a small mount of pages, and we
will use some methods to make sure the efficiency not drop too much.

This patch saves the deflated pages to a list instead of the PFN array,
which will allow faster notifications using a bitmap down the road.
balloon_pfn_to_page() can be removed because it's useless.

Signed-off-by: Liang Li 
Signed-off-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
---
 drivers/virtio/virtio_balloon.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 181793f..f59cb4f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-   BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-   return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
size_t num)
return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+struct list_head *pages)
 {
-   unsigned int i;
-   struct page *page;
+   struct page *page, *next;
 
-   /* Find pfns pointing at start of each page, get pages and free them. */
-   for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-  vb->pfns[i]));
+   list_for_each_entry_safe(page, next, pages, lru) {
if (!virtio_has_feature(vb->vdev,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
adjust_managed_page_count(page, 1);
+   list_del(>lru);
put_page(page); /* balloon reference */
}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
unsigned num_freed_pages;
struct page *page;
struct balloon_dev_info *vb_dev_info = >vb_dev_info;
+   LIST_HEAD(pages);
 
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
if (!page)
break;
set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+   list_add(>lru, );
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
 */
if (vb->num_pfns != 0)
tell_host(vb, vb->deflate_vq);
-   release_pages_balloon(vb);
+   release_pages_balloon(vb, );
mutex_unlock(>balloon_lock);
return num_freed_pages;
 }
-- 
1.8.3.1



[PATCH kernel v5 3/5] virtio-balloon: speed up inflate/deflate process

2016-11-30 Thread Liang Li
The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we
can reduce the overhead in stage b quite a lot. Furthermore, we
can do the address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li 
Suggested-by: Michael S. Tsirkin 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
---
 drivers/virtio/virtio_balloon.c | 395 +---
 1 file changed, 367 insertions(+), 28 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f59cb4f..c3ddec3 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define BALLOON_BMAP_SIZE  (8 * PAGE_SIZE)
+#define PFNS_PER_BMAP  (BALLOON_BMAP_SIZE * BITS_PER_BYTE)
+#define BALLOON_BMAP_COUNT 32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -67,6 +71,18 @@ struct virtio_balloon {
 
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
+   /* Pointer to the response header. */
+   void *resp_hdr;
+   /* Pointer to the start address of response data. */
+   unsigned long *resp_data;
+   /* Pointer offset of the response data. */
+   unsigned long resp_pos;
+   /* Bitmap and bitmap count used to tell the host the pages */
+   unsigned long *page_bitmap[BALLOON_BMAP_COUNT];
+   /* Number of split page bitmaps */
+   unsigned int nr_page_bmap;
+   /* Used to record the processed pfn range */
+   unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -110,20 +126,228 @@ static void balloon_ack(struct virtqueue *vq)
wake_up(>acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static inline void init_bmap_pfn_range(struct virtio_balloon *vb)
 {
-   struct scatterlist sg;
+   vb->min_pfn = ULONG_MAX;
+   vb->max_pfn = 0;
+}
+
+static inline void update_bmap_pfn_range(struct virtio_balloon *vb,
+struct page *page)
+{
+   unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+   vb->min_pfn = min(balloon_pfn, vb->min_pfn);
+   vb->max_pfn = max(balloon_pfn, vb->max_pfn);
+}
+
+static void extend_page_bitmap(struct virtio_balloon *vb,
+   unsigned long nr_pfn)
+{
+   int i, bmap_count;
+   unsigned long bmap_len;
+
+   bmap_len = ALIGN(nr_pfn, BITS_PER_LONG) / BITS_PER_BYTE;
+   bmap_len = ALIGN(bmap_len, BALLOON_BMAP_SIZE);
+   bmap_count = min((int)(bmap_len / BALLOON_BMAP_SIZE),
+BALLOON_BMAP_COUNT);
+
+   for (i = 1; i < bmap_count; i++) {
+   vb->page_bitmap[i] = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
+   if (vb->page_bitmap[i])
+   vb->nr_page_bmap++;
+   else
+   break;
+   }
+}
+
+static void free_extended_page_bitmap(struct virtio_balloon *vb)
+{
+   int i, bmap_count = vb->nr_page_bmap;
+
+
+   for (i = 1; i < bmap_count; i++) {
+   kfree(vb->page_bitmap[i]);
+   vb->page_bitmap[i] = NULL;
+   vb->nr_page_bmap--;
+   }
+}
+
+static void kfree_page_bitmap(struct virtio_balloon *vb)
+{
+   int i;
+
+   for (i = 0; i < vb->nr_page_bmap; i++)
+   kfree(vb->page_bitmap[i]);
+}
+
+static void clear_page_bitmap(struct virtio_balloon *vb)
+{
+   int i;
+
+   for (i = 0; i < vb->nr_page_bmap; i++)
+   memset(vb->page_bitmap[i], 0, BALLOON_BMAP_SIZE);
+}
+
+static unsigned long do_set_resp_bitmap(struct virtio_balloon *vb,
+   uns

[PATCH kernel v4 4/7] virtio-balloon: speed up inflate/deflate process

2016-11-02 Thread Liang Li
The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we
can reduce the overhead in stage b quite a lot. Furthermore, we
can do the address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li <liang.z...@intel.com>
Suggested-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 drivers/virtio/virtio_balloon.c | 398 +---
 1 file changed, 369 insertions(+), 29 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 59ffe5a..c6c94b6 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define BALLOON_BMAP_SIZE  (8 * PAGE_SIZE)
+#define PFNS_PER_BMAP  (BALLOON_BMAP_SIZE * BITS_PER_BYTE)
+#define BALLOON_BMAP_COUNT 32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -67,6 +71,18 @@ struct virtio_balloon {
 
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
+   /* Pointer to the response header. */
+   void *resp_hdr;
+   /* Pointer to the start address of response data. */
+   unsigned long *resp_data;
+   /* Pointer offset of the response data. */
+   unsigned long resp_pos;
+   /* Bitmap and bitmap count used to tell the host the pages */
+   unsigned long *page_bitmap[BALLOON_BMAP_COUNT];
+   /* Number of split page bitmaps */
+   unsigned int nr_page_bmap;
+   /* Used to record the processed pfn range */
+   unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -110,20 +126,227 @@ static void balloon_ack(struct virtqueue *vq)
wake_up(>acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static inline void init_bmap_pfn_range(struct virtio_balloon *vb)
 {
-   struct scatterlist sg;
+   vb->min_pfn = ULONG_MAX;
+   vb->max_pfn = 0;
+}
+
+static inline void update_bmap_pfn_range(struct virtio_balloon *vb,
+struct page *page)
+{
+   unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+   vb->min_pfn = min(balloon_pfn, vb->min_pfn);
+   vb->max_pfn = max(balloon_pfn, vb->max_pfn);
+}
+
+static void extend_page_bitmap(struct virtio_balloon *vb)
+{
+   int i, bmap_count;
+   unsigned long bmap_len;
+
+   bmap_len = ALIGN(get_max_pfn(), BITS_PER_LONG) / BITS_PER_BYTE;
+   bmap_len = ALIGN(bmap_len, BALLOON_BMAP_SIZE);
+   bmap_count = min((int)(bmap_len / BALLOON_BMAP_SIZE),
+BALLOON_BMAP_COUNT);
+
+   for (i = 1; i < bmap_count; i++) {
+   vb->page_bitmap[i] = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
+   if (vb->page_bitmap[i])
+   vb->nr_page_bmap++;
+   else
+   break;
+   }
+}
+
+static void free_extended_page_bitmap(struct virtio_balloon *vb)
+{
+   int i, bmap_count = vb->nr_page_bmap;
+
+
+   for (i = 1; i < bmap_count; i++) {
+   kfree(vb->page_bitmap[i]);
+   vb->page_bitmap[i] = NULL;
+   vb->nr_page_bmap--;
+   }
+}
+
+static void kfree_page_bitmap(struct virtio_balloon *vb)
+{
+   int i;
+
+   for (i = 0; i < vb->nr_page_bmap; i++)
+   kfree(vb->page_bitmap[i]);
+}
+
+static void clear_page_bitmap(struct virtio_balloon *vb)
+{
+   int i;
+
+   for (i = 0; i < vb->nr_page_bmap; i++)
+  

[PATCH kernel v4 4/7] virtio-balloon: speed up inflate/deflate process

2016-11-02 Thread Liang Li
The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we
can reduce the overhead in stage b quite a lot. Furthermore, we
can do the address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li 
Suggested-by: Michael S. Tsirkin 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
---
 drivers/virtio/virtio_balloon.c | 398 +---
 1 file changed, 369 insertions(+), 29 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 59ffe5a..c6c94b6 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define BALLOON_BMAP_SIZE  (8 * PAGE_SIZE)
+#define PFNS_PER_BMAP  (BALLOON_BMAP_SIZE * BITS_PER_BYTE)
+#define BALLOON_BMAP_COUNT 32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -67,6 +71,18 @@ struct virtio_balloon {
 
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
+   /* Pointer to the response header. */
+   void *resp_hdr;
+   /* Pointer to the start address of response data. */
+   unsigned long *resp_data;
+   /* Pointer offset of the response data. */
+   unsigned long resp_pos;
+   /* Bitmap and bitmap count used to tell the host the pages */
+   unsigned long *page_bitmap[BALLOON_BMAP_COUNT];
+   /* Number of split page bitmaps */
+   unsigned int nr_page_bmap;
+   /* Used to record the processed pfn range */
+   unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -110,20 +126,227 @@ static void balloon_ack(struct virtqueue *vq)
wake_up(>acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static inline void init_bmap_pfn_range(struct virtio_balloon *vb)
 {
-   struct scatterlist sg;
+   vb->min_pfn = ULONG_MAX;
+   vb->max_pfn = 0;
+}
+
+static inline void update_bmap_pfn_range(struct virtio_balloon *vb,
+struct page *page)
+{
+   unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+   vb->min_pfn = min(balloon_pfn, vb->min_pfn);
+   vb->max_pfn = max(balloon_pfn, vb->max_pfn);
+}
+
+static void extend_page_bitmap(struct virtio_balloon *vb)
+{
+   int i, bmap_count;
+   unsigned long bmap_len;
+
+   bmap_len = ALIGN(get_max_pfn(), BITS_PER_LONG) / BITS_PER_BYTE;
+   bmap_len = ALIGN(bmap_len, BALLOON_BMAP_SIZE);
+   bmap_count = min((int)(bmap_len / BALLOON_BMAP_SIZE),
+BALLOON_BMAP_COUNT);
+
+   for (i = 1; i < bmap_count; i++) {
+   vb->page_bitmap[i] = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
+   if (vb->page_bitmap[i])
+   vb->nr_page_bmap++;
+   else
+   break;
+   }
+}
+
+static void free_extended_page_bitmap(struct virtio_balloon *vb)
+{
+   int i, bmap_count = vb->nr_page_bmap;
+
+
+   for (i = 1; i < bmap_count; i++) {
+   kfree(vb->page_bitmap[i]);
+   vb->page_bitmap[i] = NULL;
+   vb->nr_page_bmap--;
+   }
+}
+
+static void kfree_page_bitmap(struct virtio_balloon *vb)
+{
+   int i;
+
+   for (i = 0; i < vb->nr_page_bmap; i++)
+   kfree(vb->page_bitmap[i]);
+}
+
+static void clear_page_bitmap(struct virtio_balloon *vb)
+{
+   int i;
+
+   for (i = 0; i < vb->nr_page_bmap; i++)
+   memset(vb->page_bitmap[i], 0, BALLOON_BMAP_SIZE);
+}
+
+static unsigned long do_set_resp_bitmap(struct virtio_balloon *vb,
+   unsigned long *bitmap,  unsigned long base_pfn,
+

[PATCH kernel v4 6/7] virtio-balloon: define flags and head for host request vq

2016-11-02 Thread Liang Li
Define the flags and head struct for a new host request virtual
queue. Guest can get requests from host and then responds to them on
this new virtual queue.
Host can make use of this virtual queue to request the guest do some
operations, e.g. drop page cache, synchronize file system, etc.
And the hypervisor can get some of guest's runtime information
through this virtual queue too, e.g. the guest's unused page
information, which can be used for live migration optimization.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 include/uapi/linux/virtio_balloon.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index bed6f41..c4e34d0 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
+#define VIRTIO_BALLOON_F_HOST_REQ_VQ   4 /* Host request virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -101,4 +102,25 @@ struct virtio_balloon_bmap_hdr {
__le64 bmap[0];
 };
 
+enum virtio_balloon_req_id {
+   /* Get unused page information */
+   BALLOON_GET_UNUSED_PAGES,
+};
+
+enum virtio_balloon_flag {
+   /* Have more data for a request */
+   BALLOON_FLAG_CONT,
+   /* No more data for a request */
+   BALLOON_FLAG_DONE,
+};
+
+struct virtio_balloon_req_hdr {
+   /* Used to distinguish different requests */
+   __le16 cmd;
+   /* Reserved */
+   __le16 reserved[3];
+   /* Request parameter */
+   __le64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1



[PATCH kernel v4 5/7] mm: add the related functions to get unused page

2016-11-02 Thread Liang Li
Save the unused page info into a split page bitmap. The virtio
balloon driver will use this new API to get the unused page bitmap
and send the bitmap to hypervisor(QEMU) to speed up live migration.
During sending the bitmap, some the pages may be modified and are
no free anymore, this inaccuracy can be corrected by the dirty
page logging mechanism.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 include/linux/mm.h |  2 ++
 mm/page_alloc.c| 85 ++
 2 files changed, 87 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f47862a..7014d8a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1773,6 +1773,8 @@ extern void free_area_init_node(int nid, unsigned long * 
zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 extern unsigned long get_max_pfn(void);
+extern int get_unused_pages(unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap[], unsigned long len, unsigned int nr_bmap);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 12cc8ed..72537cc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4438,6 +4438,91 @@ unsigned long get_max_pfn(void)
 }
 EXPORT_SYMBOL(get_max_pfn);
 
+static void mark_unused_pages_bitmap(struct zone *zone,
+   unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap[], unsigned long bits,
+   unsigned int nr_bmap)
+{
+   unsigned long pfn, flags, nr_pg, pos, *bmap;
+   unsigned int order, i, t, bmap_idx;
+   struct list_head *curr;
+
+   if (zone_is_empty(zone))
+   return;
+
+   end_pfn = min(start_pfn + nr_bmap * bits, end_pfn);
+   spin_lock_irqsave(>lock, flags);
+
+   for_each_migratetype_order(order, t) {
+   list_for_each(curr, >free_area[order].free_list[t]) {
+   pfn = page_to_pfn(list_entry(curr, struct page, lru));
+   if (pfn < start_pfn || pfn >= end_pfn)
+   continue;
+   nr_pg = 1UL << order;
+   if (pfn + nr_pg > end_pfn)
+   nr_pg = end_pfn - pfn;
+   bmap_idx = (pfn - start_pfn) / bits;
+   if (bmap_idx == (pfn + nr_pg - start_pfn) / bits) {
+   bmap = bitmap[bmap_idx];
+   pos = (pfn - start_pfn) % bits;
+   bitmap_set(bmap, pos, nr_pg);
+   } else
+   for (i = 0; i < nr_pg; i++) {
+   pos = pfn - start_pfn + i;
+   bmap_idx = pos / bits;
+   bmap = bitmap[bmap_idx];
+   pos = pos % bits;
+   bitmap_set(bmap, pos, 1);
+   }
+   }
+   }
+
+   spin_unlock_irqrestore(>lock, flags);
+}
+
+/*
+ * During live migration, page is always discardable unless it's
+ * content is needed by the system.
+ * get_unused_pages provides an API to get the unused pages, these
+ * unused pages can be discarded if there is no modification since
+ * the request. Some other mechanism, like the dirty page logging
+ * can be used to track the modification.
+ *
+ * This function scans the free page list to get the unused pages
+ * whose pfn are range from start_pfn to end_pfn, and set the
+ * corresponding bit in the bitmap if an unused page is found.
+ *
+ * Allocating a large bitmap may fail because of fragmentation,
+ * instead of using a single bitmap, we use a scatter/gather bitmap.
+ * The 'bitmap' is the start address of an array which contains
+ * 'nr_bmap' separate small bitmaps, each bitmap contains 'bits' bits.
+ *
+ * return -1 if parameters are invalid
+ * return 0 when end_pfn >= max_pfn
+ * return 1 when end_pfn < max_pfn
+ */
+int get_unused_pages(unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap[], unsigned long bits, unsigned int nr_bmap)
+{
+   struct zone *zone;
+   int ret = 0;
+
+   if (bitmap == NULL || *bitmap == NULL || nr_bmap == 0 ||
+bits == 0 || start_pfn > end_pfn)
+   return -1;
+   if (end_pfn < max_pfn)
+   ret = 1;
+   if (end_pfn >= max_pfn)
+   ret = 0;
+
+   for_each_

[PATCH kernel v4 3/7] mm: add a function to get the max pfn

2016-11-02 Thread Liang Li
Expose the function to get the max pfn, so it can be used in the
virtio-balloon device driver. Simply include the 'linux/bootmem.h'
is not enough, if the device driver is built to a module, directly
refer the max_pfn lead to build failed.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 include/linux/mm.h |  1 +
 mm/page_alloc.c| 10 ++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a92c8d7..f47862a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1772,6 +1772,7 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, 
pmd_t *pmd)
 extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+extern unsigned long get_max_pfn(void);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fd42aa..12cc8ed 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4428,6 +4428,16 @@ void show_free_areas(unsigned int filter)
show_swap_cache_info();
 }
 
+/*
+ * The max_pfn can change because of memory hot plug, so it's only good
+ * as a hint. e.g. for sizing data structures.
+ */
+unsigned long get_max_pfn(void)
+{
+   return max_pfn;
+}
+EXPORT_SYMBOL(get_max_pfn);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
zoneref->zone = zone;
-- 
1.8.3.1



[PATCH kernel v4 5/7] mm: add the related functions to get unused page

2016-11-02 Thread Liang Li
Save the unused page info into a split page bitmap. The virtio
balloon driver will use this new API to get the unused page bitmap
and send the bitmap to hypervisor(QEMU) to speed up live migration.
During sending the bitmap, some the pages may be modified and are
no free anymore, this inaccuracy can be corrected by the dirty
page logging mechanism.

Signed-off-by: Liang Li 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
---
 include/linux/mm.h |  2 ++
 mm/page_alloc.c| 85 ++
 2 files changed, 87 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f47862a..7014d8a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1773,6 +1773,8 @@ extern void free_area_init_node(int nid, unsigned long * 
zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 extern unsigned long get_max_pfn(void);
+extern int get_unused_pages(unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap[], unsigned long len, unsigned int nr_bmap);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 12cc8ed..72537cc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4438,6 +4438,91 @@ unsigned long get_max_pfn(void)
 }
 EXPORT_SYMBOL(get_max_pfn);
 
+static void mark_unused_pages_bitmap(struct zone *zone,
+   unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap[], unsigned long bits,
+   unsigned int nr_bmap)
+{
+   unsigned long pfn, flags, nr_pg, pos, *bmap;
+   unsigned int order, i, t, bmap_idx;
+   struct list_head *curr;
+
+   if (zone_is_empty(zone))
+   return;
+
+   end_pfn = min(start_pfn + nr_bmap * bits, end_pfn);
+   spin_lock_irqsave(>lock, flags);
+
+   for_each_migratetype_order(order, t) {
+   list_for_each(curr, >free_area[order].free_list[t]) {
+   pfn = page_to_pfn(list_entry(curr, struct page, lru));
+   if (pfn < start_pfn || pfn >= end_pfn)
+   continue;
+   nr_pg = 1UL << order;
+   if (pfn + nr_pg > end_pfn)
+   nr_pg = end_pfn - pfn;
+   bmap_idx = (pfn - start_pfn) / bits;
+   if (bmap_idx == (pfn + nr_pg - start_pfn) / bits) {
+   bmap = bitmap[bmap_idx];
+   pos = (pfn - start_pfn) % bits;
+   bitmap_set(bmap, pos, nr_pg);
+   } else
+   for (i = 0; i < nr_pg; i++) {
+   pos = pfn - start_pfn + i;
+   bmap_idx = pos / bits;
+   bmap = bitmap[bmap_idx];
+   pos = pos % bits;
+   bitmap_set(bmap, pos, 1);
+   }
+   }
+   }
+
+   spin_unlock_irqrestore(>lock, flags);
+}
+
+/*
+ * During live migration, page is always discardable unless it's
+ * content is needed by the system.
+ * get_unused_pages provides an API to get the unused pages, these
+ * unused pages can be discarded if there is no modification since
+ * the request. Some other mechanism, like the dirty page logging
+ * can be used to track the modification.
+ *
+ * This function scans the free page list to get the unused pages
+ * whose pfn are range from start_pfn to end_pfn, and set the
+ * corresponding bit in the bitmap if an unused page is found.
+ *
+ * Allocating a large bitmap may fail because of fragmentation,
+ * instead of using a single bitmap, we use a scatter/gather bitmap.
+ * The 'bitmap' is the start address of an array which contains
+ * 'nr_bmap' separate small bitmaps, each bitmap contains 'bits' bits.
+ *
+ * return -1 if parameters are invalid
+ * return 0 when end_pfn >= max_pfn
+ * return 1 when end_pfn < max_pfn
+ */
+int get_unused_pages(unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap[], unsigned long bits, unsigned int nr_bmap)
+{
+   struct zone *zone;
+   int ret = 0;
+
+   if (bitmap == NULL || *bitmap == NULL || nr_bmap == 0 ||
+bits == 0 || start_pfn > end_pfn)
+   return -1;
+   if (end_pfn < max_pfn)
+   ret = 1;
+   if (end_pfn >= max_pfn)
+   ret = 0;
+
+   for_each_populated_zone(zone)
+   mark_unused_pages_bitmap(zone, start_pfn, end_pfn, bitmap,
+bits, nr_bmap);
+
+   return ret;
+}
+EXPORT_SYMBOL(get_unused_pages);
+
 static void zoneref_set_z

[PATCH kernel v4 3/7] mm: add a function to get the max pfn

2016-11-02 Thread Liang Li
Expose the function to get the max pfn, so it can be used in the
virtio-balloon device driver. Simply include the 'linux/bootmem.h'
is not enough, if the device driver is built to a module, directly
refer the max_pfn lead to build failed.

Signed-off-by: Liang Li 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
---
 include/linux/mm.h |  1 +
 mm/page_alloc.c| 10 ++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a92c8d7..f47862a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1772,6 +1772,7 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, 
pmd_t *pmd)
 extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+extern unsigned long get_max_pfn(void);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fd42aa..12cc8ed 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4428,6 +4428,16 @@ void show_free_areas(unsigned int filter)
show_swap_cache_info();
 }
 
+/*
+ * The max_pfn can change because of memory hot plug, so it's only good
+ * as a hint. e.g. for sizing data structures.
+ */
+unsigned long get_max_pfn(void)
+{
+   return max_pfn;
+}
+EXPORT_SYMBOL(get_max_pfn);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
zoneref->zone = zone;
-- 
1.8.3.1



[PATCH kernel v4 6/7] virtio-balloon: define flags and head for host request vq

2016-11-02 Thread Liang Li
Define the flags and head struct for a new host request virtual
queue. Guest can get requests from host and then responds to them on
this new virtual queue.
Host can make use of this virtual queue to request the guest do some
operations, e.g. drop page cache, synchronize file system, etc.
And the hypervisor can get some of guest's runtime information
through this virtual queue too, e.g. the guest's unused page
information, which can be used for live migration optimization.

Signed-off-by: Liang Li 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
---
 include/uapi/linux/virtio_balloon.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index bed6f41..c4e34d0 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
+#define VIRTIO_BALLOON_F_HOST_REQ_VQ   4 /* Host request virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -101,4 +102,25 @@ struct virtio_balloon_bmap_hdr {
__le64 bmap[0];
 };
 
+enum virtio_balloon_req_id {
+   /* Get unused page information */
+   BALLOON_GET_UNUSED_PAGES,
+};
+
+enum virtio_balloon_flag {
+   /* Have more data for a request */
+   BALLOON_FLAG_CONT,
+   /* No more data for a request */
+   BALLOON_FLAG_DONE,
+};
+
+struct virtio_balloon_req_hdr {
+   /* Used to distinguish different requests */
+   __le16 cmd;
+   /* Reserved */
+   __le16 reserved[3];
+   /* Request parameter */
+   __le64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1



[PATCH kernel v4 7/7] virtio-balloon: tell host vm's unused page info

2016-11-02 Thread Liang Li
Support the request for vm's unused page information, response with
a page bitmap. QEMU can make use of this bitmap and the dirty page
logging mechanism to skip the transportation of some of these unused
pages, this is very helpful to reduce the network traffic and  speed
up the live migration process.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 drivers/virtio/virtio_balloon.c | 128 +---
 1 file changed, 121 insertions(+), 7 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index c6c94b6..ba2d37b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,7 +56,7 @@
 
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *req_vq;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -83,6 +83,8 @@ struct virtio_balloon {
unsigned int nr_page_bmap;
/* Used to record the processed pfn range */
unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+   /* Request header */
+   struct virtio_balloon_req_hdr req_hdr;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -552,6 +554,63 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(available));
 }
 
+static void send_unused_pages_info(struct virtio_balloon *vb,
+   unsigned long req_id)
+{
+   struct scatterlist sg_in;
+   unsigned long pfn = 0, bmap_len, pfn_limit, last_pfn, nr_pfn;
+   struct virtqueue *vq = vb->req_vq;
+   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
+   int ret = 1, used_nr_bmap = 0, i;
+
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP) &&
+   vb->nr_page_bmap == 1)
+   extend_page_bitmap(vb);
+
+   pfn_limit = PFNS_PER_BMAP * vb->nr_page_bmap;
+   mutex_lock(>balloon_lock);
+   last_pfn = get_max_pfn();
+
+   while (ret) {
+   clear_page_bitmap(vb);
+   ret = get_unused_pages(pfn, pfn + pfn_limit, vb->page_bitmap,
+PFNS_PER_BMAP, vb->nr_page_bmap);
+   if (ret < 0)
+   break;
+   hdr->cmd = BALLOON_GET_UNUSED_PAGES;
+   hdr->id = req_id;
+   bmap_len = BALLOON_BMAP_SIZE * vb->nr_page_bmap;
+
+   if (!ret) {
+   hdr->flag = BALLOON_FLAG_DONE;
+   nr_pfn = last_pfn - pfn;
+   used_nr_bmap = nr_pfn / PFNS_PER_BMAP;
+   if (nr_pfn % PFNS_PER_BMAP)
+   used_nr_bmap++;
+   bmap_len = nr_pfn / BITS_PER_BYTE;
+   } else {
+   hdr->flag = BALLOON_FLAG_CONT;
+   used_nr_bmap = vb->nr_page_bmap;
+   }
+   for (i = 0; i < used_nr_bmap; i++) {
+   unsigned int bmap_size = BALLOON_BMAP_SIZE;
+
+   if (i + 1 == used_nr_bmap)
+   bmap_size = bmap_len - BALLOON_BMAP_SIZE * i;
+   set_bulk_pages(vb, vq, pfn + i * PFNS_PER_BMAP,
+vb->page_bitmap[i], bmap_size, true);
+   }
+   if (vb->resp_pos > 0)
+   send_resp_data(vb, vq, true);
+   pfn += pfn_limit;
+   }
+
+   mutex_unlock(>balloon_lock);
+   sg_init_one(_in, >req_hdr, sizeof(vb->req_hdr));
+   virtqueue_add_inbuf(vq, _in, 1, >req_hdr, GFP_KERNEL);
+   virtqueue_kick(vq);
+}
+
 /*
  * While most virtqueues communicate guest-initiated requests to the 
hypervisor,
  * the stats queue operates in reverse.  The driver initializes the virtqueue
@@ -686,18 +745,56 @@ static void update_balloon_size_func(struct work_struct 
*work)
queue_work(system_freezable_wq, work);
 }
 
+static void misc_handle_rq(struct virtio_balloon *vb)
+{
+   struct virtio_balloon_req_hdr *ptr_hdr;
+   unsigned int len;
+
+   ptr_hdr = virtqueue_get_buf(vb->req_vq, );
+   if (!ptr_hdr || len != sizeof(vb->req_hdr))
+   return;
+
+   switch (ptr_hdr->cmd) {
+   case BALLOON_GET_UNUSED_PAGES:
+   send_unused_pages_info(vb, ptr_hdr->param);
+   break;
+   default:
+   break;
+   }
+}
+
+static void misc

[PATCH kernel v4 7/7] virtio-balloon: tell host vm's unused page info

2016-11-02 Thread Liang Li
Support the request for vm's unused page information, response with
a page bitmap. QEMU can make use of this bitmap and the dirty page
logging mechanism to skip the transportation of some of these unused
pages, this is very helpful to reduce the network traffic and  speed
up the live migration process.

Signed-off-by: Liang Li 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
---
 drivers/virtio/virtio_balloon.c | 128 +---
 1 file changed, 121 insertions(+), 7 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index c6c94b6..ba2d37b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,7 +56,7 @@
 
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *req_vq;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -83,6 +83,8 @@ struct virtio_balloon {
unsigned int nr_page_bmap;
/* Used to record the processed pfn range */
unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+   /* Request header */
+   struct virtio_balloon_req_hdr req_hdr;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -552,6 +554,63 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(available));
 }
 
+static void send_unused_pages_info(struct virtio_balloon *vb,
+   unsigned long req_id)
+{
+   struct scatterlist sg_in;
+   unsigned long pfn = 0, bmap_len, pfn_limit, last_pfn, nr_pfn;
+   struct virtqueue *vq = vb->req_vq;
+   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
+   int ret = 1, used_nr_bmap = 0, i;
+
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP) &&
+   vb->nr_page_bmap == 1)
+   extend_page_bitmap(vb);
+
+   pfn_limit = PFNS_PER_BMAP * vb->nr_page_bmap;
+   mutex_lock(>balloon_lock);
+   last_pfn = get_max_pfn();
+
+   while (ret) {
+   clear_page_bitmap(vb);
+   ret = get_unused_pages(pfn, pfn + pfn_limit, vb->page_bitmap,
+PFNS_PER_BMAP, vb->nr_page_bmap);
+   if (ret < 0)
+   break;
+   hdr->cmd = BALLOON_GET_UNUSED_PAGES;
+   hdr->id = req_id;
+   bmap_len = BALLOON_BMAP_SIZE * vb->nr_page_bmap;
+
+   if (!ret) {
+   hdr->flag = BALLOON_FLAG_DONE;
+   nr_pfn = last_pfn - pfn;
+   used_nr_bmap = nr_pfn / PFNS_PER_BMAP;
+   if (nr_pfn % PFNS_PER_BMAP)
+   used_nr_bmap++;
+   bmap_len = nr_pfn / BITS_PER_BYTE;
+   } else {
+   hdr->flag = BALLOON_FLAG_CONT;
+   used_nr_bmap = vb->nr_page_bmap;
+   }
+   for (i = 0; i < used_nr_bmap; i++) {
+   unsigned int bmap_size = BALLOON_BMAP_SIZE;
+
+   if (i + 1 == used_nr_bmap)
+   bmap_size = bmap_len - BALLOON_BMAP_SIZE * i;
+   set_bulk_pages(vb, vq, pfn + i * PFNS_PER_BMAP,
+vb->page_bitmap[i], bmap_size, true);
+   }
+   if (vb->resp_pos > 0)
+   send_resp_data(vb, vq, true);
+   pfn += pfn_limit;
+   }
+
+   mutex_unlock(>balloon_lock);
+   sg_init_one(_in, >req_hdr, sizeof(vb->req_hdr));
+   virtqueue_add_inbuf(vq, _in, 1, >req_hdr, GFP_KERNEL);
+   virtqueue_kick(vq);
+}
+
 /*
  * While most virtqueues communicate guest-initiated requests to the 
hypervisor,
  * the stats queue operates in reverse.  The driver initializes the virtqueue
@@ -686,18 +745,56 @@ static void update_balloon_size_func(struct work_struct 
*work)
queue_work(system_freezable_wq, work);
 }
 
+static void misc_handle_rq(struct virtio_balloon *vb)
+{
+   struct virtio_balloon_req_hdr *ptr_hdr;
+   unsigned int len;
+
+   ptr_hdr = virtqueue_get_buf(vb->req_vq, );
+   if (!ptr_hdr || len != sizeof(vb->req_hdr))
+   return;
+
+   switch (ptr_hdr->cmd) {
+   case BALLOON_GET_UNUSED_PAGES:
+   send_unused_pages_info(vb, ptr_hdr->param);
+   break;
+   default:
+   break;
+   }
+}
+
+static void misc_request(struct virtqueue *vq)
+{
+   struct virtio_balloon *vb = vq->vdev->priv;
+
+   misc_handle_rq(vb);
+}
+
 static int init_vqs(struct virtio_ba

[PATCH kernel v4 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-11-02 Thread Liang Li
This patch set contains two parts of changes to the virtio-balloon.
 
One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use bitmap to send the page
information to host instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.
 
Another change is for speeding up live migration. By skipping process
guest's unused pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We put guest's unused page information in a bitmap
and send it to host with the virt queue of virtio-balloon. For an idle
guest with 8GB RAM, this can help to shorten the total live migration
time from 2Sec to about 500ms in 10Gbps network environment.
 
Changes from v3 to v4:
* Use the new scheme suggested by Dave Hansen to encode the bitmap.
* Add code which is missed in v3 to handle migrate page. 
* Free the memory for bitmap intime once the operation is done.
* Address some of the comments in v3.

Changes from v2 to v3:
* Change the name of 'free page' to 'unused page'.
* Use the scatter & gather bitmap instead of a 1MB page bitmap.
* Fix overwriting the page bitmap after kicking.
* Some of MST's comments for v2.
 
Changes from v1 to v2:
* Abandon the patch for dropping page cache.
* Put some structures to uapi head file.
* Use a new way to determine the page bitmap size.
* Use a unified way to send the free page information with the bitmap
* Address the issues referred in MST's comments

Liang Li (7):
  virtio-balloon: rework deflate to add page to a list
  virtio-balloon: define new feature bit and head struct
  mm: add a function to get the max pfn
  virtio-balloon: speed up inflate/deflate process
  mm: add the related functions to get unused page
  virtio-balloon: define flags and head for host request vq
  virtio-balloon: tell host vm's unused page info

 drivers/virtio/virtio_balloon.c | 546 
 include/linux/mm.h  |   3 +
 include/uapi/linux/virtio_balloon.h |  41 +++
 mm/page_alloc.c |  95 +++
 4 files changed, 636 insertions(+), 49 deletions(-)

-- 
1.8.3.1



[PATCH kernel v4 2/7] virtio-balloon: define new feature bit and head struct

2016-11-02 Thread Liang Li
Add a new feature which supports sending the page information with
a bitmap. The current implementation uses PFNs array, which is not
very efficient. Using bitmap can improve the performance of
inflating/deflating significantly

The page bitmap header will used to tell the host some information
about the page bitmap. e.g. the page size, page bitmap length and
start pfn.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 include/uapi/linux/virtio_balloon.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 343d7dd..bed6f41 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,22 @@ struct virtio_balloon_stat {
__virtio64 val;
 } __attribute__((packed));
 
+/* Response header structure */
+struct virtio_balloon_resp_hdr {
+   __le64 cmd : 8; /* Distinguish different requests type */
+   __le64 flag: 8; /* Mark status for a specific request type */
+   __le64 id : 16; /* Distinguish requests of a specific type */
+   __le64 data_len: 32; /* Length of the following data, in bytes */
+};
+
+/* Page bitmap header structure */
+struct virtio_balloon_bmap_hdr {
+   struct {
+   __le64 start_pfn : 52; /* start pfn for the bitmap */
+   __le64 page_shift : 6; /* page shift width, in bytes */
+   __le64 bmap_len : 6;  /* bitmap length, in bytes */
+   } head;
+   __le64 bmap[0];
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1



[PATCH kernel v4 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-11-02 Thread Liang Li
This patch set contains two parts of changes to the virtio-balloon.
 
One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use bitmap to send the page
information to host instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.
 
Another change is for speeding up live migration. By skipping process
guest's unused pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We put guest's unused page information in a bitmap
and send it to host with the virt queue of virtio-balloon. For an idle
guest with 8GB RAM, this can help to shorten the total live migration
time from 2Sec to about 500ms in 10Gbps network environment.
 
Changes from v3 to v4:
* Use the new scheme suggested by Dave Hansen to encode the bitmap.
* Add code which is missed in v3 to handle migrate page. 
* Free the memory for bitmap intime once the operation is done.
* Address some of the comments in v3.

Changes from v2 to v3:
* Change the name of 'free page' to 'unused page'.
* Use the scatter & gather bitmap instead of a 1MB page bitmap.
* Fix overwriting the page bitmap after kicking.
* Some of MST's comments for v2.
 
Changes from v1 to v2:
* Abandon the patch for dropping page cache.
* Put some structures to uapi head file.
* Use a new way to determine the page bitmap size.
* Use a unified way to send the free page information with the bitmap
* Address the issues referred in MST's comments

Liang Li (7):
  virtio-balloon: rework deflate to add page to a list
  virtio-balloon: define new feature bit and head struct
  mm: add a function to get the max pfn
  virtio-balloon: speed up inflate/deflate process
  mm: add the related functions to get unused page
  virtio-balloon: define flags and head for host request vq
  virtio-balloon: tell host vm's unused page info

 drivers/virtio/virtio_balloon.c | 546 
 include/linux/mm.h  |   3 +
 include/uapi/linux/virtio_balloon.h |  41 +++
 mm/page_alloc.c |  95 +++
 4 files changed, 636 insertions(+), 49 deletions(-)

-- 
1.8.3.1



[PATCH kernel v4 2/7] virtio-balloon: define new feature bit and head struct

2016-11-02 Thread Liang Li
Add a new feature which supports sending the page information with
a bitmap. The current implementation uses PFNs array, which is not
very efficient. Using bitmap can improve the performance of
inflating/deflating significantly

The page bitmap header will used to tell the host some information
about the page bitmap. e.g. the page size, page bitmap length and
start pfn.

Signed-off-by: Liang Li 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
---
 include/uapi/linux/virtio_balloon.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 343d7dd..bed6f41 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,22 @@ struct virtio_balloon_stat {
__virtio64 val;
 } __attribute__((packed));
 
+/* Response header structure */
+struct virtio_balloon_resp_hdr {
+   __le64 cmd : 8; /* Distinguish different requests type */
+   __le64 flag: 8; /* Mark status for a specific request type */
+   __le64 id : 16; /* Distinguish requests of a specific type */
+   __le64 data_len: 32; /* Length of the following data, in bytes */
+};
+
+/* Page bitmap header structure */
+struct virtio_balloon_bmap_hdr {
+   struct {
+   __le64 start_pfn : 52; /* start pfn for the bitmap */
+   __le64 page_shift : 6; /* page shift width, in bytes */
+   __le64 bmap_len : 6;  /* bitmap length, in bytes */
+   } head;
+   __le64 bmap[0];
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1



[PATCH kernel v4 1/7] virtio-balloon: rework deflate to add page to a list

2016-11-02 Thread Liang Li
When doing the inflating/deflating operation, the current virtio-balloon
implementation uses an array to save 256 PFNS, then send these PFNS to
host through virtio and process each PFN one by one. This way is not
efficient when inflating/deflating a large mount of memory because too
many times of the following operations:

1. Virtio data transmission
2. Page allocate/free
3. Address translation(GPA->HVA)
4. madvise

The over head of these operations will consume a lot of CPU cycles and
will take a long time to complete, it may impact the QoS of the guest as
well as the host. The overhead will be reduced a lot if batch processing
is used. E.g. If there are several pages whose address are physical
contiguous in the guest, these bulk pages can be processed in one
operation.

The main idea for the optimization is to reduce the above operations as
much as possible. And it can be achieved by using a bitmap instead of an
PFN array. Comparing with PFN array, for a specific size buffer, bitmap
can present more pages, which is very important for batch processing.

Using bitmap instead of PFN is not very helpful when inflating/deflating
a small mount of pages, in this case, using PFNs is better. But using
bitmap will not impact the QoS of guest or host heavily because the
operation will be completed very soon for a small mount of pages, and we
will use some methods to make sure the efficiency not drop too much.

This patch saves the deflated pages to a list instead of the PFN array,
which will allow faster notifications using a bitmap down the road.
balloon_pfn_to_page() can be removed because it's useless.

Signed-off-by: Liang Li <liang.z...@intel.com>
Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4e7003d..59ffe5a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-   BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-   return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
size_t num)
return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+struct list_head *pages)
 {
-   unsigned int i;
-   struct page *page;
+   struct page *page, *next;
 
-   /* Find pfns pointing at start of each page, get pages and free them. */
-   for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-  vb->pfns[i]));
+   list_for_each_entry_safe(page, next, pages, lru) {
if (!virtio_has_feature(vb->vdev,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
adjust_managed_page_count(page, 1);
+   list_del(>lru);
put_page(page); /* balloon reference */
}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
unsigned num_freed_pages;
struct page *page;
struct balloon_dev_info *vb_dev_info = >vb_dev_info;
+   LIST_HEAD(pages);
 
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
if (!page)
break;
set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+   list_add(>lru, );
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
 */
if (vb->num_pfns != 0)
tell_host(vb, vb->deflate_vq);
-   release_pages_balloon(vb);
+   release_pages_balloon(vb, );
mutex_unlock(>balloon_lock);
return num_freed_pages;
 }
-- 
1.8.3.1



[PATCH kernel v4 1/7] virtio-balloon: rework deflate to add page to a list

2016-11-02 Thread Liang Li
When doing the inflating/deflating operation, the current virtio-balloon
implementation uses an array to save 256 PFNS, then send these PFNS to
host through virtio and process each PFN one by one. This way is not
efficient when inflating/deflating a large mount of memory because too
many times of the following operations:

1. Virtio data transmission
2. Page allocate/free
3. Address translation(GPA->HVA)
4. madvise

The over head of these operations will consume a lot of CPU cycles and
will take a long time to complete, it may impact the QoS of the guest as
well as the host. The overhead will be reduced a lot if batch processing
is used. E.g. If there are several pages whose address are physical
contiguous in the guest, these bulk pages can be processed in one
operation.

The main idea for the optimization is to reduce the above operations as
much as possible. And it can be achieved by using a bitmap instead of an
PFN array. Comparing with PFN array, for a specific size buffer, bitmap
can present more pages, which is very important for batch processing.

Using bitmap instead of PFN is not very helpful when inflating/deflating
a small mount of pages, in this case, using PFNs is better. But using
bitmap will not impact the QoS of guest or host heavily because the
operation will be completed very soon for a small mount of pages, and we
will use some methods to make sure the efficiency not drop too much.

This patch saves the deflated pages to a list instead of the PFN array,
which will allow faster notifications using a bitmap down the road.
balloon_pfn_to_page() can be removed because it's useless.

Signed-off-by: Liang Li 
Signed-off-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
Cc: Dave Hansen 
---
 drivers/virtio/virtio_balloon.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4e7003d..59ffe5a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-   BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-   return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
size_t num)
return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+struct list_head *pages)
 {
-   unsigned int i;
-   struct page *page;
+   struct page *page, *next;
 
-   /* Find pfns pointing at start of each page, get pages and free them. */
-   for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-  vb->pfns[i]));
+   list_for_each_entry_safe(page, next, pages, lru) {
if (!virtio_has_feature(vb->vdev,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
adjust_managed_page_count(page, 1);
+   list_del(>lru);
put_page(page); /* balloon reference */
}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
unsigned num_freed_pages;
struct page *page;
struct balloon_dev_info *vb_dev_info = >vb_dev_info;
+   LIST_HEAD(pages);
 
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
if (!page)
break;
set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+   list_add(>lru, );
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
 */
if (vb->num_pfns != 0)
tell_host(vb, vb->deflate_vq);
-   release_pages_balloon(vb);
+   release_pages_balloon(vb, );
mutex_unlock(>balloon_lock);
return num_freed_pages;
 }
-- 
1.8.3.1



[RESEND PATCH v3 kernel 6/7] virtio-balloon: define feature bit and head for misc virt queue

2016-10-21 Thread Liang Li
Define a new feature bit which supports a new virtual queue. This
new virtual qeuque is for information exchange between hypervisor
and guest. The VMM hypervisor can make use of this virtual queue
to request the guest do some operations, e.g. drop page cache,
synchronize file system, etc. And the VMM hypervisor can get some
of guest's runtime information through this virtual queue, e.g. the
guest's unused page information, which can be used for live migration
optimization.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 include/uapi/linux/virtio_balloon.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index d3b182a..3a9d633 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
+#define VIRTIO_BALLOON_F_MISC_VQ   4 /* Misc info virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -101,4 +102,25 @@ struct balloon_bmap_hdr {
__virtio64 bmap_len;
 };
 
+enum balloon_req_id {
+   /* Get unused pages information */
+   BALLOON_GET_UNUSED_PAGES,
+};
+
+enum balloon_flag {
+   /* Have more data for a request */
+   BALLOON_FLAG_CONT,
+   /* No more data for a request */
+   BALLOON_FLAG_DONE,
+};
+
+struct balloon_req_hdr {
+   /* Used to distinguish different request */
+   __virtio16 cmd;
+   /* Reserved */
+   __virtio16 reserved[3];
+   /* Request parameter */
+   __virtio64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1



[RESEND PATCH v3 kernel 4/7] virtio-balloon: speed up inflate/deflate process

2016-10-21 Thread Liang Li
The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we
can reduce the overhead in stage b quite a lot. Furthermore, we
can do the address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li <liang.z...@intel.com>
Suggested-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 233 +++-
 1 file changed, 209 insertions(+), 24 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 59ffe5a..c31839c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define BALLOON_BMAP_SIZE  (8 * PAGE_SIZE)
+#define PFNS_PER_BMAP  (BALLOON_BMAP_SIZE * BITS_PER_BYTE)
+#define BALLOON_BMAP_COUNT 32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -67,6 +71,13 @@ struct virtio_balloon {
 
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
+   /* Pointer of the bitmap header. */
+   void *bmap_hdr;
+   /* Bitmap and bitmap count used to tell the host the pages */
+   unsigned long *page_bitmap[BALLOON_BMAP_COUNT];
+   unsigned int nr_page_bmap;
+   /* Used to record the processed pfn range */
+   unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -110,16 +121,66 @@ static void balloon_ack(struct virtqueue *vq)
wake_up(>acked);
 }
 
+static inline void init_pfn_range(struct virtio_balloon *vb)
+{
+   vb->min_pfn = ULONG_MAX;
+   vb->max_pfn = 0;
+}
+
+static inline void update_pfn_range(struct virtio_balloon *vb,
+struct page *page)
+{
+   unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+   if (balloon_pfn < vb->min_pfn)
+   vb->min_pfn = balloon_pfn;
+   if (balloon_pfn > vb->max_pfn)
+   vb->max_pfn = balloon_pfn;
+}
+
 static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 {
-   struct scatterlist sg;
-   unsigned int len;
+   struct scatterlist sg, sg2[BALLOON_BMAP_COUNT + 1];
+   unsigned int len, i;
+
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP)) {
+   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
+   unsigned long bmap_len;
+   int nr_pfn, nr_used_bmap, nr_buf;
+
+   nr_pfn = vb->end_pfn - vb->start_pfn + 1;
+   nr_pfn = roundup(nr_pfn, BITS_PER_LONG);
+   nr_used_bmap = nr_pfn / PFNS_PER_BMAP;
+   bmap_len = nr_pfn / BITS_PER_BYTE;
+   nr_buf = nr_used_bmap + 1;
+
+   /* cmd, reserved and req_id are init to 0, unused here */
+   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
+   hdr->start_pfn = cpu_to_virtio64(vb->vdev, vb->start_pfn);
+   hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
+   sg_init_table(sg2, nr_buf);
+   sg_set_buf([0], hdr, sizeof(struct balloon_bmap_hdr));
+   for (i = 0; i < nr_used_bmap; i++) {
+   unsigned int  buf_len = BALLOON_BMAP_SIZE;
+
+   if (i + 1 == nr_used_bmap)
+   buf_len = bmap_len - BALLOON_BMAP_SIZE * i;
+   sg_set_buf([i + 1], vb->page_bitmap[i], buf_len);
+   }
 
-   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+   while (vq->n

[RESEND PATCH v3 kernel 3/7] mm: add a function to get the max pfn

2016-10-21 Thread Liang Li
Expose the function to get the max pfn, so it can be used in the
virtio-balloon device driver. Simply include the 'linux/bootmem.h'
is not enough, if the device driver is built to a module, directly
refer the max_pfn lead to build failed.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 include/linux/mm.h |  1 +
 mm/page_alloc.c| 10 ++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ffbd729..2a89da0e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1776,6 +1776,7 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, 
pmd_t *pmd)
 extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+extern unsigned long get_max_pfn(void);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2b3bf67..e5f63a9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4426,6 +4426,16 @@ void show_free_areas(unsigned int filter)
show_swap_cache_info();
 }
 
+/*
+ * The max_pfn can change because of memory hot plug, so it's only good
+ * as a hint. e.g. for sizing data structures.
+ */
+unsigned long get_max_pfn(void)
+{
+   return max_pfn;
+}
+EXPORT_SYMBOL(get_max_pfn);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
zoneref->zone = zone;
-- 
1.8.3.1



[RESEND PATCH v3 kernel 7/7] virtio-balloon: tell host vm's unused page info

2016-10-21 Thread Liang Li
Support the request for vm's unused page information, response with
a page bitmap. QEMU can make use of this bitmap and the dirty page
logging mechanism to skip the transportation of these unused pages,
this is very helpful to speed up the live migration process.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 143 +---
 1 file changed, 134 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index c31839c..f10bb8b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,7 +56,7 @@
 
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *misc_vq;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -78,6 +78,8 @@ struct virtio_balloon {
unsigned int nr_page_bmap;
/* Used to record the processed pfn range */
unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+   /* Request header */
+   struct balloon_req_hdr req_hdr;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -423,6 +425,78 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(available));
 }
 
+static void send_unused_pages_info(struct virtio_balloon *vb,
+   unsigned long req_id)
+{
+   struct scatterlist sg_in, sg_out[BALLOON_BMAP_COUNT + 1];
+   unsigned long pfn = 0, bmap_len, pfn_limit, last_pfn, nr_pfn;
+   struct virtqueue *vq = vb->misc_vq;
+   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
+   int ret = 1, nr_buf, used_nr_bmap = 0, i;
+
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP) &&
+   vb->nr_page_bmap == 1)
+   extend_page_bitmap(vb);
+
+   pfn_limit = PFNS_PER_BMAP * vb->nr_page_bmap;
+   mutex_lock(>balloon_lock);
+   last_pfn = get_max_pfn();
+
+   while (ret) {
+   clear_page_bitmap(vb);
+   ret = get_unused_pages(pfn, pfn + pfn_limit, vb->page_bitmap,
+PFNS_PER_BMAP, vb->nr_page_bmap);
+   if (ret < 0)
+   break;
+   hdr->cmd = cpu_to_virtio16(vb->vdev, BALLOON_GET_UNUSED_PAGES);
+   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
+   hdr->req_id = cpu_to_virtio64(vb->vdev, req_id);
+   hdr->start_pfn = cpu_to_virtio64(vb->vdev, pfn);
+   bmap_len = BALLOON_BMAP_SIZE * vb->nr_page_bmap;
+
+   if (!ret) {
+   hdr->flag = cpu_to_virtio16(vb->vdev,
+BALLOON_FLAG_DONE);
+   nr_pfn = last_pfn - pfn;
+   used_nr_bmap = nr_pfn / PFNS_PER_BMAP;
+   if (nr_pfn % PFNS_PER_BMAP)
+   used_nr_bmap++;
+   bmap_len = nr_pfn / BITS_PER_BYTE;
+   } else {
+   hdr->flag = cpu_to_virtio16(vb->vdev,
+   BALLOON_FLAG_CONT);
+   used_nr_bmap = vb->nr_page_bmap;
+   }
+   hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
+   nr_buf = used_nr_bmap + 1;
+   sg_init_table(sg_out, nr_buf);
+   sg_set_buf(_out[0], hdr, sizeof(struct balloon_bmap_hdr));
+   for (i = 0; i < used_nr_bmap; i++) {
+   unsigned int buf_len = BALLOON_BMAP_SIZE;
+
+   if (i + 1 == used_nr_bmap)
+   buf_len = bmap_len - BALLOON_BMAP_SIZE * i;
+   sg_set_buf(_out[i + 1], vb->page_bitmap[i], buf_len);
+   }
+
+   while (vq->num_free < nr_buf)
+   msleep(2);
+   if (virtqueue_add_outbuf(vq, sg_out, nr_buf, vb,
+GFP_KERNEL) == 0) {
+   virtqueue_kick(vq);
+   while (!virtqueue_get_buf(vq, )
+   && !virtqueue_is_broken(vq))
+   cpu_relax();
+   }
+   pfn += pfn_limit;
+   }
+
+   mutex_unlock(>balloon_lock);
+   sg_init_one(_in, >req_hdr, sizeof(vb->req_hdr));
+   virtqueue_add_inbuf(vq, _in, 1, >req_hdr, GFP_KERNEL);
+   vir

[RESEND PATCH v3 kernel 6/7] virtio-balloon: define feature bit and head for misc virt queue

2016-10-21 Thread Liang Li
Define a new feature bit which supports a new virtual queue. This
new virtual qeuque is for information exchange between hypervisor
and guest. The VMM hypervisor can make use of this virtual queue
to request the guest do some operations, e.g. drop page cache,
synchronize file system, etc. And the VMM hypervisor can get some
of guest's runtime information through this virtual queue, e.g. the
guest's unused page information, which can be used for live migration
optimization.

Signed-off-by: Liang Li 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
---
 include/uapi/linux/virtio_balloon.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index d3b182a..3a9d633 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
+#define VIRTIO_BALLOON_F_MISC_VQ   4 /* Misc info virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -101,4 +102,25 @@ struct balloon_bmap_hdr {
__virtio64 bmap_len;
 };
 
+enum balloon_req_id {
+   /* Get unused pages information */
+   BALLOON_GET_UNUSED_PAGES,
+};
+
+enum balloon_flag {
+   /* Have more data for a request */
+   BALLOON_FLAG_CONT,
+   /* No more data for a request */
+   BALLOON_FLAG_DONE,
+};
+
+struct balloon_req_hdr {
+   /* Used to distinguish different request */
+   __virtio16 cmd;
+   /* Reserved */
+   __virtio16 reserved[3];
+   /* Request parameter */
+   __virtio64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1



[RESEND PATCH v3 kernel 4/7] virtio-balloon: speed up inflate/deflate process

2016-10-21 Thread Liang Li
The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we
can reduce the overhead in stage b quite a lot. Furthermore, we
can do the address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li 
Suggested-by: Michael S. Tsirkin 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
---
 drivers/virtio/virtio_balloon.c | 233 +++-
 1 file changed, 209 insertions(+), 24 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 59ffe5a..c31839c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define BALLOON_BMAP_SIZE  (8 * PAGE_SIZE)
+#define PFNS_PER_BMAP  (BALLOON_BMAP_SIZE * BITS_PER_BYTE)
+#define BALLOON_BMAP_COUNT 32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -67,6 +71,13 @@ struct virtio_balloon {
 
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
+   /* Pointer of the bitmap header. */
+   void *bmap_hdr;
+   /* Bitmap and bitmap count used to tell the host the pages */
+   unsigned long *page_bitmap[BALLOON_BMAP_COUNT];
+   unsigned int nr_page_bmap;
+   /* Used to record the processed pfn range */
+   unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -110,16 +121,66 @@ static void balloon_ack(struct virtqueue *vq)
wake_up(>acked);
 }
 
+static inline void init_pfn_range(struct virtio_balloon *vb)
+{
+   vb->min_pfn = ULONG_MAX;
+   vb->max_pfn = 0;
+}
+
+static inline void update_pfn_range(struct virtio_balloon *vb,
+struct page *page)
+{
+   unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+   if (balloon_pfn < vb->min_pfn)
+   vb->min_pfn = balloon_pfn;
+   if (balloon_pfn > vb->max_pfn)
+   vb->max_pfn = balloon_pfn;
+}
+
 static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 {
-   struct scatterlist sg;
-   unsigned int len;
+   struct scatterlist sg, sg2[BALLOON_BMAP_COUNT + 1];
+   unsigned int len, i;
+
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP)) {
+   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
+   unsigned long bmap_len;
+   int nr_pfn, nr_used_bmap, nr_buf;
+
+   nr_pfn = vb->end_pfn - vb->start_pfn + 1;
+   nr_pfn = roundup(nr_pfn, BITS_PER_LONG);
+   nr_used_bmap = nr_pfn / PFNS_PER_BMAP;
+   bmap_len = nr_pfn / BITS_PER_BYTE;
+   nr_buf = nr_used_bmap + 1;
+
+   /* cmd, reserved and req_id are init to 0, unused here */
+   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
+   hdr->start_pfn = cpu_to_virtio64(vb->vdev, vb->start_pfn);
+   hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
+   sg_init_table(sg2, nr_buf);
+   sg_set_buf([0], hdr, sizeof(struct balloon_bmap_hdr));
+   for (i = 0; i < nr_used_bmap; i++) {
+   unsigned int  buf_len = BALLOON_BMAP_SIZE;
+
+   if (i + 1 == nr_used_bmap)
+   buf_len = bmap_len - BALLOON_BMAP_SIZE * i;
+   sg_set_buf([i + 1], vb->page_bitmap[i], buf_len);
+   }
 
-   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+   while (vq->num_free < nr_buf)
+   msleep(2);
+   if (virtqueue_add_outbuf(vq, sg2, nr_buf, vb, GFP_KERNEL) == 0)
+  

[RESEND PATCH v3 kernel 3/7] mm: add a function to get the max pfn

2016-10-21 Thread Liang Li
Expose the function to get the max pfn, so it can be used in the
virtio-balloon device driver. Simply include the 'linux/bootmem.h'
is not enough, if the device driver is built to a module, directly
refer the max_pfn lead to build failed.

Signed-off-by: Liang Li 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
---
 include/linux/mm.h |  1 +
 mm/page_alloc.c| 10 ++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ffbd729..2a89da0e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1776,6 +1776,7 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, 
pmd_t *pmd)
 extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+extern unsigned long get_max_pfn(void);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2b3bf67..e5f63a9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4426,6 +4426,16 @@ void show_free_areas(unsigned int filter)
show_swap_cache_info();
 }
 
+/*
+ * The max_pfn can change because of memory hot plug, so it's only good
+ * as a hint. e.g. for sizing data structures.
+ */
+unsigned long get_max_pfn(void)
+{
+   return max_pfn;
+}
+EXPORT_SYMBOL(get_max_pfn);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
zoneref->zone = zone;
-- 
1.8.3.1



[RESEND PATCH v3 kernel 7/7] virtio-balloon: tell host vm's unused page info

2016-10-21 Thread Liang Li
Support the request for vm's unused page information, response with
a page bitmap. QEMU can make use of this bitmap and the dirty page
logging mechanism to skip the transportation of these unused pages,
this is very helpful to speed up the live migration process.

Signed-off-by: Liang Li 
Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
---
 drivers/virtio/virtio_balloon.c | 143 +---
 1 file changed, 134 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index c31839c..f10bb8b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,7 +56,7 @@
 
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *misc_vq;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -78,6 +78,8 @@ struct virtio_balloon {
unsigned int nr_page_bmap;
/* Used to record the processed pfn range */
unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+   /* Request header */
+   struct balloon_req_hdr req_hdr;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -423,6 +425,78 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(available));
 }
 
+static void send_unused_pages_info(struct virtio_balloon *vb,
+   unsigned long req_id)
+{
+   struct scatterlist sg_in, sg_out[BALLOON_BMAP_COUNT + 1];
+   unsigned long pfn = 0, bmap_len, pfn_limit, last_pfn, nr_pfn;
+   struct virtqueue *vq = vb->misc_vq;
+   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
+   int ret = 1, nr_buf, used_nr_bmap = 0, i;
+
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP) &&
+   vb->nr_page_bmap == 1)
+   extend_page_bitmap(vb);
+
+   pfn_limit = PFNS_PER_BMAP * vb->nr_page_bmap;
+   mutex_lock(>balloon_lock);
+   last_pfn = get_max_pfn();
+
+   while (ret) {
+   clear_page_bitmap(vb);
+   ret = get_unused_pages(pfn, pfn + pfn_limit, vb->page_bitmap,
+PFNS_PER_BMAP, vb->nr_page_bmap);
+   if (ret < 0)
+   break;
+   hdr->cmd = cpu_to_virtio16(vb->vdev, BALLOON_GET_UNUSED_PAGES);
+   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
+   hdr->req_id = cpu_to_virtio64(vb->vdev, req_id);
+   hdr->start_pfn = cpu_to_virtio64(vb->vdev, pfn);
+   bmap_len = BALLOON_BMAP_SIZE * vb->nr_page_bmap;
+
+   if (!ret) {
+   hdr->flag = cpu_to_virtio16(vb->vdev,
+BALLOON_FLAG_DONE);
+   nr_pfn = last_pfn - pfn;
+   used_nr_bmap = nr_pfn / PFNS_PER_BMAP;
+   if (nr_pfn % PFNS_PER_BMAP)
+   used_nr_bmap++;
+   bmap_len = nr_pfn / BITS_PER_BYTE;
+   } else {
+   hdr->flag = cpu_to_virtio16(vb->vdev,
+   BALLOON_FLAG_CONT);
+   used_nr_bmap = vb->nr_page_bmap;
+   }
+   hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
+   nr_buf = used_nr_bmap + 1;
+   sg_init_table(sg_out, nr_buf);
+   sg_set_buf(_out[0], hdr, sizeof(struct balloon_bmap_hdr));
+   for (i = 0; i < used_nr_bmap; i++) {
+   unsigned int buf_len = BALLOON_BMAP_SIZE;
+
+   if (i + 1 == used_nr_bmap)
+   buf_len = bmap_len - BALLOON_BMAP_SIZE * i;
+   sg_set_buf(_out[i + 1], vb->page_bitmap[i], buf_len);
+   }
+
+   while (vq->num_free < nr_buf)
+   msleep(2);
+   if (virtqueue_add_outbuf(vq, sg_out, nr_buf, vb,
+GFP_KERNEL) == 0) {
+   virtqueue_kick(vq);
+   while (!virtqueue_get_buf(vq, )
+   && !virtqueue_is_broken(vq))
+   cpu_relax();
+   }
+   pfn += pfn_limit;
+   }
+
+   mutex_unlock(>balloon_lock);
+   sg_init_one(_in, >req_hdr, sizeof(vb->req_hdr));
+   virtqueue_add_inbuf(vq, _in, 1, >req_hdr, GFP_KERNEL);
+   virtqueue_kick(vq);
+}
+
 /*
  * While most virtqueues communicate guest-initiated requests to the 
hypervisor,
  * the stats queue operat

  1   2   3   >